Gemini Embeddings 2: Google Revolutionizes Multimodal Integration
Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
Gemini Embeddings 2: A Major Advancement for Multimodal Integration
The Rise of Large Language Models
Google recently unveiled Gemini Embeddings 2 Preview, an innovation that stands out for its ability to integrate various types of content such as text, PDFs, images, audio, and video. This embedding model positions itself as a versatile tool, essential for the integration of multimodal content in artificial intelligence applications.
Embedding is a core technology in the field of retrieval-augmented generation (RAG), which is one of the most fundamental applications in modern AI processing.
Understanding RAG and Embedding
The RAG method relies on segmenting, encoding, and storing information that can be searched through similarity functions. These functions link search terms to the embedded information. Encoding transforms queries into a series of numbers called vectors, which is the essence of embedding. These vectors are stored in a vector database.
When a user performs a search, the term is also encoded into embeddings. The resulting vectors are then compared to the content of the vector database, often using cosine similarity. The closer the search term vectors are to the stored information, the higher the relevance. Large language models can then interpret this data to retrieve and display the most pertinent information.
The Innovation of Gemini Embedding
Google's Gemini model is distinguished by its ability to integrate multimodal inputs. Unlike traditional models that often limited themselves to text and PDFs, Gemini offers the capability to incorporate images, audio, and video. Although this model is still in its preliminary phase, its capabilities promise to transform the use of embeddings.
Current Limitations of Inputs
The Gemini model imposes certain restrictions on the types of inputs it can process:
- Text: Up to 8192 tokens, or approximately 6000 words.
- Images: Up to 6 images per query, supporting PNG and JPEG formats.
- Videos: Maximum duration of 2 minutes, in MP4 and MOV formats.
- Audio: Maximum duration of 80 seconds, in MP3 and WAV formats.
- Documents: Maximum length of 6 pages.
Setting Up a Development Environment
To fully leverage the potential of Gemini Embeddings 2, it is advisable to set up a dedicated development environment. The UV tool can be used for this, but other methods may also be suitable depending on preferences.
$ uv init embed-test --python 3.13
$ source embed-test/bin/activate
$ uv add google-genai jupyter numpy scikit-learn audioop-lts
$ uv run jupyter notebook
A Gemini API key is required, available on the Google AI Studio homepage. After logging in, a "Get API Key" link is accessible at the bottom left of the screen.
Usage Example: Embedding Images
To illustrate the use of Gemini, let’s consider the integration of 3 images: an orange cat, a Labrador, and a yellow dolphin. The goal is to pose specific questions or phrases and see if the model can identify the most relevant image. This is achieved by calculating a similarity score between each question and image.
The questions posed include:
- Which animal is yellow?
- Which one is most likely to be named Rover?
- Something strange is happening here.
- A perfect image.
Usage Example: Embedding Audio
For the audio, a recording of a man describing a fishing trip is used. He recounts seeing a bright yellow dolphin. The full transcript is:
“Hello, my name is Glen, and I want to tell you about a fascinating sight I saw Tuesday afternoon while fishing at sea with friends. It was a hot day with a yellow sun in the sky. We were fishing for tuna and had no luck. We had to spend the best part of 5 hours out there. So we were pretty down when we returned to shore. But suddenly, and I swear this is not a lie, we saw a pod of dolphins. Not only that, but one of them was a bright yellow color. We had never seen anything like it in our lives, but I can tell you that all our thoughts of a bad fishing day disappeared. It was fascinating.”
The goal is to determine where the speaker mentions the yellow dolphin, thus demonstrating the model's ability to effectively process audio data.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.