Gemini Embeddings 2: Google Revolutionizes Multimodal Integration

⚡

Key Takeaways

1Google unveils Gemini Embeddings 2, a model that integrates text, images, audio, and video to enhance AI applications.

2The model uses the RAG method, transforming queries into vectors for optimized similarity search.

3Current limitations include 8192 tokens for text and 2 minutes for videos, but it promises immense potential.

💡Why it matters — Gemini Embeddings 2 could transform the way multimodal data is processed, offering richer and more interactive AI applications.

Gemini Embeddings 2: A Major Advancement for Multimodal Integration

The Rise of Large Language Models

Google recently unveiled Gemini Embeddings 2 Preview, an innovation that stands out for its ability to integrate various types of content such as text, PDFs, images, audio, and video. This embedding model positions itself as a versatile tool, essential for the integration of multimodal content in artificial intelligence applications.

Embedding is a core technology in the field of retrieval-augmented generation (RAG), which is one of the most fundamental applications in modern AI processing.

Understanding RAG and Embedding

The RAG method relies on segmenting, encoding, and storing information that can be searched through similarity functions. These functions link search terms to the embedded information. Encoding transforms queries into a series of numbers called vectors, which is the essence of embedding. These vectors are stored in a vector database.

When a user performs a search, the term is also encoded into embeddings. The resulting vectors are then compared to the content of the vector database, often using cosine similarity. The closer the search term vectors are to the stored information, the higher the relevance. Large language models can then interpret this data to retrieve and display the most pertinent information.

The Innovation of Gemini Embedding

Google's Gemini model is distinguished by its ability to integrate multimodal inputs. Unlike traditional models that often limited themselves to text and PDFs, Gemini offers the capability to incorporate images, audio, and video. Although this model is still in its preliminary phase, its capabilities promise to transform the use of embeddings.

Current Limitations of Inputs

The Gemini model imposes certain restrictions on the types of inputs it can process:

Text: Up to 8192 tokens, or approximately 6000 words.
Images: Up to 6 images per query, supporting PNG and JPEG formats.
Videos: Maximum duration of 2 minutes, in MP4 and MOV formats.
Audio: Maximum duration of 80 seconds, in MP3 and WAV formats.
Documents: Maximum length of 6 pages.

Setting Up a Development Environment

To fully leverage the potential of Gemini Embeddings 2, it is advisable to set up a dedicated development environment. The UV tool can be used for this, but other methods may also be suitable depending on preferences.

$ uv init embed-test --python 3.13
$ source embed-test/bin/activate
$ uv add google-genai jupyter numpy scikit-learn audioop-lts
$ uv run jupyter notebook

A Gemini API key is required, available on the Google AI Studio homepage. After logging in, a "Get API Key" link is accessible at the bottom left of the screen.

Usage Example: Embedding Images

To illustrate the use of Gemini, let’s consider the integration of 3 images: an orange cat, a Labrador, and a yellow dolphin. The goal is to pose specific questions or phrases and see if the model can identify the most relevant image. This is achieved by calculating a similarity score between each question and image.

The questions posed include:

Which animal is yellow?
Which one is most likely to be named Rover?
Something strange is happening here.
A perfect image.

Usage Example: Embedding Audio

For the audio, a recording of a man describing a fishing trip is used. He recounts seeing a bright yellow dolphin. The full transcript is:

“Hello, my name is Glen, and I want to tell you about a fascinating sight I saw Tuesday afternoon while fishing at sea with friends. It was a hot day with a yellow sun in the sky. We were fishing for tuna and had no luck. We had to spend the best part of 5 hours out there. So we were pretty down when we returned to shore. But suddenly, and I swear this is not a lie, we saw a pod of dolphins. Not only that, but one of them was a bright yellow color. We had never seen anything like it in our lives, but I can tell you that all our thoughts of a bad fishing day disappeared. It was fascinating.”

The goal is to determine where the speaker mentions the yellow dolphin, thus demonstrating the model's ability to effectively process audio data.