Latent Memory: Revolution in LLM Agent Startup

Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
Latent Memory: Revolutionizing the Startup of Multi-Hop LLM Agents
Persistent Latent Memory for Multi-Hop LLM Agents: How a paper on 6G transitions solves the cold start problem for agents
No more passing prompt chains between agents. How to use a β-VAE and a gate MLP to maintain context across transfer boundaries.
Anubhab Banerjee
ILCP Visualization: Bridging the cold start gap by transferring compressed latent context directly between specialized agents.
A humorous yet real take on ILCP for agents — a β-VAE compressor, a style Xn transport, a gate MLP projector, and the incredibly practical realization that I had already solved this exact problem for 6G transitions. The agent version V1 is the wiring; the receipts in this post are the receipts from the 6G paper, properly labeled, as honest writing is the whole point of this series.
— the highlight — of the "Production Quality Agentic Inference" series. Each part has eliminated a type of redundant work in an agentic LLM pipeline. Part 1 eliminated redundant pre-filling (don’t read the same document twice). Part 2 eliminated redundant waiting (don’t queue fifty agents). Part 3 eliminated redundant CPU round-trips (don’t send every retrieval back to the GPU). Part 4 (this post, and the last) eliminates redundant context reconstructions — the agentic equivalent of throwing away your hidden state every time the conversation shifts to a new specialist.
The problem: in a multi-hop agent pipeline, every time control passes from agent A to agent B, the receiver discards A's hidden state and reconstructs context from a prompt chain. This is structurally the same "post-transfer cold start" that a user equipment (UE) experiences when moving between two base stations (from source to target), where the target base station resets the recurrent state per user to zero.
The solution: compress the recurrent state of the sender into a small latent payload, transport it across the transfer, and let the receiver use it as a soft prompt prefix instead of pre-filling everything from text. The same lesson "compute once, broadcast shared state" that the series has hammered since part 1, applied across reasoning hops instead of a single pipeline.
The 'unusual' receipts: the underlying method is Inductive Latent Context Persistence (ILCP), a peer-reviewed paper I co-authored recently, accepted at AI4NextG @ ICML 2026. During the 4G/5G drive test in Vienna, ILCP completely eliminates ping-pong transfers (0.0% vs. 6.5% no transfer baseline, 22.6% Transformer baseline), recovers post-transfer accuracy with an average increase of +5.1 pp / +13.3 pp at maximum, and operates end-to-end at 7.7 ms p99 per transfer decision on the same GTX 1080 as the rest of this series.
The honest part: these numbers are radio transfer figures for 6G, not LLM agent figures. The agent version V1 in this post (ilcp-for-agents) is the wiring — a β-VAE compressor, ongoing transport, a gate MLP, and a Qwen2.5-7B harness — and its agent-side benchmarks are explicitly future work. I refuse to whitewash RAN receipts as LLM receipts even where the temptation is strong.
The key point: the telecom thread that ran through parts 1 to 3 as an analogy is, in part 4, my published research solving the same problem in two different industries. The series comes full circle.
TL;DR: Multi-hop LLM agents currently transmit context as a chain. Agent A completes its reasoning, summarizes it into prompt text, and agent B reads this chain from the beginning — the receiver's KV cache, attention model, and any partial computation that agent A built are all discarded. This is the agentic version of the post-transfer cold start that 5G/6G base stations experience when a UE (mobile device) moves between two base stations: the target base station resets the recurrent state per UE and must rebuild it from scratch. We solved this problem with a method called Inductive Latent Context Persistence (ILCP): a β-VAE compresses a 128-dimensional GRU state into a 128-byte latent payload, transports it over the standard 3GPP Xn interface, and then a gate MLP projects it into the target base station's state space at the moment of transfer. During the 4G/5G drive test in Vienna, ILCP eliminates ping-pong transfers (0.0% vs. 6.5% no transfer), recovers next cellular post-transfer accuracy of +5.1 pp on average / +13.3 pp at maximum in the 50 to 250 ms window after transfer, and operates at 7.7 ms p99 per decision on a single GTX 1080. This part applies the same protocol to LLM agent transfers: ilcp-for-agents learns to compress a pooled hidden summary, transport it across the transfer, and project it back into a soft prompt prefix on the receiver side. V1 is the wiring (PyTorch, Qwen2.5-7B-Instruct, β-VAE, gate MLP, ongoing transport, exact match metric). The contribution here is the architectural transfer, not the numbers.
Mental model of architecture — keep this open while you read.
Agent A context → masked average pool → β-VAE encoder → z (32-dim latent) → ongoing transport payload → β-VAE decoder + gate MLP → K memory tokens → torch.cat on agent B's question embeddings → greedy decoding
Everything that follows is just a comment on a piece of this line.
Compress, Transport, and Project
- A confession: I solved this problem before I knew I had
In part 3, we pushed to slightly absurd lengths to keep our tensors exactly where they belong: on silicon. By writing a custom CUDA kernel for Top-K retrieval, we eliminated redundant CPU round-trips that bog down agentic RAG. The philosophy was absolute — once the GPU computes a rich, high-dimensional state, you don’t move it, and you certainly don’t destroy it. Yet, by the time this highly optimized retriever finishes its work and hands off to the next specialist in your pipeline, standard frameworks force you to do exactly that. We protect our tensor state for our lives inside a single node, only to willingly throw it in the trash as soon as we cross a reasoning hop.
Let me dramatize agent transfer the way every multi-hop pipeline does today.
You: “Agent A, read this 50-page report, create a summary, and pass it to agent B for fact-checking.”
Agent A: “Sure. Loading model. Reading report. Pooling context. Building attention on paragraph 47. Forming my opinions. ✅”
The GPU works for 30 seconds.
Agent A: “Done. Here’s a 200-token summary that I’m very proud of.”
You: “Great. Transfer to agent B.”
Agent A: “Wait, how exactly do you transfer?”
You: “...as a chain? In the prompt?”
Agent A: “Okay. So, you’re sending agent B my final chain. Not my hidden state. Not my attention calculated on the 50 pages I just read. Not the fact that paragraph 47 was particularly important. Not the calibrated confidence I built. Just the chain.”
You: “That’s how it works, yes.”
Agent A: “Cool. Cool cool cool. Have fun, B. 👋”
Agent B: “Hello, I’m a beautiful newborn with no state. Loading model. Reading agent A’s chain from the beginning. Building context. Pooling. Forming opinions. ✅”
The GPU spends another 30 seconds essentially doing the same work that agent A just finished.
You: “...is there a way to skip the second context construction and attention calculation?”
Agent B: “What second reading?”
This is exactly what every "multi-agent swarm" I’ve ever seen under the hood looks like. Each transfer is a chain-shaped throat through which the sender's internal state cannot pass. The receiver gets the output text and reconstructs the context from the text — which is the most expensive thing a transformer generally does, and the thing that this series has spent three parts trying to convince you to stop doing within a single pipeline cycle.
A fun fact is that I’ve already written about this problem, just not for LLM agents.
In 2026, my co-author and I published a paper titled “Inductive Latent Context Persistence: Closing the Post-Transfer Cold Start in 6G Radio Access Networks.” The framework is a mobile phone (also called user equipment or UE) moving between 5G/6G base stations (also called gNB). At each transfer, the target gNB discards the recurrent state per UE held at the source gNB and resets the hidden state per UE at the target gNB. The prediction model on the target side must then reconstruct this state from the few post-transfer radio measurements it just received, while the UE is already in motion. The paper calls this post-transfer cold start. Does that ring a bell?
Now, read this paragraph from the paper's contributions, slightly de-jargoned: “We treat the recurrent state per user as portable network context. To solve the practical problem that the standard inter-cell message has a small size budget, we show that a differential update of 128 bytes is sufficient to preserve the predictive quality of a 128-dimensional GRU state across the transfer boundary. Our proposed ILCP protocol compresses the hidden state with a variational β autoencoder, transports it over the standard 3GPP Xn interface, and projects it into the state space of the target gNB at the moment of transfer via a learned gate MLP.”
If you replace “source gNB” with “agent A,” “target gNB” with “agent B,” and “radio measurements” with “tokens from the next sub-task,” you have the architecture that this entire post is about. Same paper, same author, just the application domain is different.
The contribution of this post is not the method — the method is already in the paper. The contribution of this post is the mapping: taking ILCP and wiring it for multi-hop LLM agents, end-to-end, in a small PyTorch repository you’ve already seen. The receipts you are about to see in section 5 are t
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.