Affiliate links on Android Authority may earn us a commission. Learn more.
What is Google Gemini: The next-gen language model that can do it all
Large language models like OpenAI’s GPT-4 and Google’s PaLM 2 have dominated the news cycle for the past few months. And while we all thought the world of AI would return to the usual slow pace, that hasn’t happened yet. Case in point: Google spent nearly an hour talking about AI at its recent I/O keynote where it also debuted cutting-edge hardware like the Pixel Fold. So it goes without saying that the company’s next-generation AI architecture, dubbed Gemini, deserves some attention.
Gemini can generate and process text, images, and other kinds of data like graphs and maps. That’s right — the future of AI isn’t just chatbots or image generators. As impressive as those tools may seem today, Google believes that they’re far from maximizing the technology’s full potential. So in this article, let’s break down what the search giant aims to achieve with Gemini, how it works, and why it signals the future of AI.
What is Google Gemini: Beyond a simple language model
Gemini is Google’s next-generation AI architecture that will eventually replace PaLM 2. Currently, the latter powers many of the company’s AI services, including the Bard chatbot and Duet AI in Workspace apps like Google Docs. Put simply, Gemini will allow these services to simultaneously analyze or generate text, images, audio, videos, and other data types.
Thanks to ChatGPT and Bing Chat, you’re probably already familiar with machine learning models that can understand and generate natural language. And it’s the same story with AI image generators — with a single line of text, they can create beautiful art or even photorealistic imagery. But Google’s Gemini will go one step further as it isn’t bound by a single data type — and that’s why you may hear it called a “multimodal” model.
Here’s an example that shows the impressive capabilities of a multimodal model, courtesy of Google’s AI Research blog. It shows how the AI can not only extract features from a video to generate a summary but also answer follow-up text questions.
Gemini’s ability to combine visuals and text should also allow it to generate more than one kind of data at the same time. Imagine an AI that could not just write the contents of a magazine, but also design the layout and graphics for it. Or an AI that could summarize an entire newspaper or podcast based on the topics you care about the most.
How does Gemini differ from other large language models?
Gemini differs from other large language models in that it’s not just trained on text alone. Google says that it built the model with multimodal capabilities in mind. That indicates the future of AI might be more general-purpose than the tools we have today. The company has also consolidated its AI teams into one working unit, now named Google DeepMind. All of this strongly suggests that the company is betting on Gemini to compete with GPT-4.
A multimodal model can decode many data types at once, similar to how humans use different senses in the real world.
So how does a multimodal AI like Google Gemini work? You have a few main components that work in unison, starting with an encoder and a decoder. When given input with more than one data type (like a piece of text and an image), the encoder extracts all relevant details from each data type (modality) separately.
The AI then looks for important features or patterns in the extracted data using an attention mechanism — essentially forcing it to focus on a specific task. For example, identifying the animal in the above example would involve looking only at the specific areas of the image with a moving subject. Finally, the AI can fuse the information it has learned from different data types to make a prediction.
When will Google release Gemini?
When OpenAI announced GPT-4, it spoke extensively about the model’s ability to handle multimodal problems. Even though we haven’t seen these features make their way to services like ChatGPT Plus, the demos we’ve seen so far look extremely promising. With Gemini, Google hopes to match or surpass GPT-4, before it gets left behind for good.
We don’t have the technical details on Gemini just yet, but Google has confirmed that it will come in different sizes. If what we’ve seen with PaLM 2 so far holds true, that could mean four different models. The smallest one can even fit on a typical smartphone, making it a perfect fit for generative AI on the go. However, the more likely outcome is that Gemini will come to the Bard chatbot and other Google services first.
For now, all we know is that Gemini is still in its training phase. Once that’s complete, the company will move on to fine-tuning and improving safety. The latter can take a while, as it requires human workers to manually rate responses and guide the AI to behave like a human. So with all of this in mind, it’s tough to answer when Google will release Gemini — but with mounting competition, it can’t be that far off.