Proxy-Pointer RAG: Multimodal Responses Without Embeddings

⚡

Key Takeaways

1Proxy-Pointer RAG provides multimodal responses without relying on multimodal embeddings, enhancing the user experience.

2The model treats documents as hierarchical trees of semantic blocks, allowing for more accurate retrieval of images.

3The pipeline uses tools like Adobe PDF Extract and FAISS to efficiently extract and index multimodal content.

💡Why it matters — This approach could transform how enterprise chatbots integrate and present visual information, increasing their utility and relevance.

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

Large Language Model

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

A picture is worth a thousand words. Yet, very few enterprise chatbots can reliably return images anchored in their source documents.

The reason is that, while this would represent a significant improvement over a purely text-based user experience, achieving it reliably and consistently is challenging. However, there are numerous use cases where this would be invaluable. From real estate project clients to service technicians querying the latest machine settings, users would prefer to see images of targeted properties and maintenance charts as an integral part of the response. Instead, the best we can do is provide a response with links to source documents (brochures, videos, manuals) and web pages.

In this article, I will present an open-source multimodal Proxy-Pointer RAG pipeline that can accomplish this, primarily because it treats a document as a hierarchical tree of semantic blocks rather than as a set of words to be blindly shredded into pieces to answer a query.

Challenges of Multimodal Responses

Why is multimodal response a difficult problem to solve? What are the current techniques that can be applied?

When we think of multimodal RAG, it almost always means that you can search the knowledge base using images as well as a text query. It is rare for it to be the other way around. To understand the reasons, let’s examine the possible approaches generally used:

Image Captioning: Running an OCR/Vision model on images, transforming the image into a paragraph of text, and indexing it in a chunk with other texts. This is not ideal, as sliding window segmentation can lead to the caption being split across chunks.

The central problem is a misalignment between retrieval units and semantic units. Traditional RAG retrieves arbitrary chunks, while meaning — especially for images — belongs to coherent sections of a document.

When a chunk is retrieved, the LLM may only see a partial caption (for example, for Figure 5), making it difficult to determine the relevance of the image to that chunk or to another adjacent one that has not been retrieved. Moreover, the synthesizer often receives multiple chunks from different documents without shared context, which may contain several unrelated image captions. This complicates the LLM's task of reliably deciding which images, if any, are relevant to the user's query.

Multimodal Embedding: Another approach is to embed both images and text into a shared vector space using a multimodal model. While this allows for inter-modal retrieval, it introduces a different challenge. Multimodal embeddings optimize for similarity, not anchoring. Visually or structurally similar artifacts — such as financial tables from different companies — can appear almost identical in the vector space, even when only one is relevant to the query.

Without the context of the document structure, the system retrieves candidates based on similarity but cannot confidently determine which image actually belongs to the answer. As a result, the LLM is forced to choose between several plausible but potentially incorrect visuals — it is often safer to show nothing rather than risk showing the wrong one.

Proxy-Pointer Architecture

The Proxy-Pointer addresses this by replacing text-based chunking with tree-based chunking. We do not chunk by character count; we chunk by Section Boundaries. If a section contains 3 paragraphs and 2 images, none of the chunks exceed or overflow into the next section. The LLM can consider each section as a fully independent semantic unit and can confidently judge the images contained within.

I built a multimodal chatbot on 5 AI research papers (all under CC-BY license). These include CLIP, Nemobot, GaLore, VectorFusion, and VectorPainter. For PDF extraction, the Adobe PDF Extract API was used. As expected, the papers contain dense text as well as a total of 270 images (figures, tables, formulas) among them, which were successfully extracted by Adobe. The embedding model used is gemini-embedding-001 (with dimensions reduced to 1536 from the default 3072, making search faster and reducing memory usage). This is a text-only embedding model. No multimodal embedding model is used. For all LLM uses (noise filter, re-ranker, synthesizer, and final visual filter), gemini-3.1-flash-lite-preview is employed. The vector index used is FAISS.

Multimodal Retrieval Pipeline

For multimodal output, we modify the pipeline steps with the following principle: images (figures, tables, formulas, video clips, etc.) can be extracted as artifact files (.jpg, .png, .svg, .mp4, etc.) and stored alongside the document content. This is fairly straightforward if the source document is a web page or XML. For PDFs, although not perfect, an extractor such as the Adobe PDF Extract API used here can extract tables and figures as artifacts.

In the extracted document itself, which in our case is markdown, each figure is present as a relative path, for example; ![](figures/fileoutpart11.png) in the text, which points to the actual filename.

Additionally, inspired by the Tangram puzzle that forms different objects using a set of basic elements, we reformulate the synthesis task as a rearrangement of a set of features extracted from the reference image.

Here is the indexing pipeline:

Skeleton Tree: As before, we analyze the Markdown headers into a hierarchical tree using pure Python. Only now, a table of figures is nested within each node, noting each figure found in that node (section) with its path. The path is used to retrieve the image file for display.

The next 4 steps remain essentially the same as before:

Breadcrumb Injection: Prefix the full structural path to each chunk before embedding.
Structure-Guided Chunking: Split the text within section boundaries, never across them.
Noise Filtering: Remove distracting sections (table of contents, glossary, executive summaries, references) from the index using an LLM.
Pointer-Based Context: Use the retrieved chunks as pointers to load the complete and intact section of the document (which now contains image paths in the text) for the synthesizer.

The updated retrieval pipeline for multimodal retrieval is as follows:

Step 1 (Broad Recall): FAISS returns the top 200 chunks by embedding similarity. These are deduplicated by (doc_id, node_id) to ensure we are examining unique document sections, resulting in a narrowed list of the top 50 candidate nodes.
Step 2 (Anchor-Sensitive Structural Re-Ranking): The re-ranker now receives the full breadcrumb path + a semantic excerpt (150 characters) for each of the 50 candidates.
Step 3 (Context-Sensitive Synthesis and Image Selection): The LLM Synthesizer examines the final 5 sections and forms the textual response. Additionally, it makes the visual decision on the found images to determine which should be displayed.

The above pipeline is capable of providing 95% accuracy for image retrieval on the 20-question benchmark I created, judged by Claude. The complete results are available in the repository. If you wish to further refine the results, the next step is an optional visual filter.

Step 4 (Visual Filter — optional): For further refinements of the selected images, an optional visual selection step can be activated in config.py. Here, the LLM is prompted to actually view the 6 images.