Brief IA

Memory and Context Engineering: Challenges of Agentic AIs

🔬 Research·Tom Levy·

Memory and Context Engineering: Challenges of Agentic AIs

Memory and Context Engineering: Challenges of Agentic AIs
Key Takeaways
1AI agents face context and memory issues during complex tasks, compromising their effectiveness.
2Context engineering focuses on managing temporary information within a single AI session.
3Memory engineering ensures the persistence and reliability of information across AI sessions.
💡Why it mattersEffective management of context and memory is crucial for the proper functioning of AI agents in complex environments.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

Introduction

With the evolution of artificial intelligence (AI) agents towards longer tasks and multi-session uses, a problematic pattern is emerging. Agents sometimes abandon constraints mid-task, previously used information reappears inappropriately, and the context of one step can unduly influence the next. Identifying the source of these failures proves complex, as no isolated component seems to be at fault.

The heart of the problem often lies in two key areas: context engineering and memory engineering. Although these two disciplines are interconnected, they are distinct and fail in different ways. To implement them correctly, specific systems are necessary.

This article explores the critical decisions behind each discipline and their point of interaction:

  • The implications of context engineering and the decisions that influence an agent's reasoning during a single call.
  • The aspects of memory engineering, including writing policy, storage, retrieval, and maintenance, that affect long-term reliability.
  • The common boundary between these two disciplines during information retrieval, and the frequent modes of failure when this boundary is poorly managed.

Understanding these two areas, separately and together, is essential to ensure that an agent operates effectively in real-world work environments.

An Overview of Context and Memory Engineering

Context engineering concerns the design of a single inference call: determining what should be included, compressed, where to place elements, and what should be omitted. The information at stake is temporary; it disappears once the call is complete.

In contrast, memory engineering focuses on what persists beyond a single interaction with a model. It encompasses the systems and policies responsible for writing, storing, retrieving, updating, and governing information so that future interactions can benefit from it. When an agent recalls information from a previous session, collaborates with another agent, or applies a previously learned user preference, it relies on memory engineering rather than context engineering.

While context engineering determines what information is available to the model during a specific request, memory engineering determines what information persists between requests and how it is maintained, retrieved, and deemed reliable over time. Here’s an overview:

| Aspect | Context Engineering | Memory Engineering | |-----------------------------|--------------------------------------|----------------------------------------| | Scope | A single inference call | Across calls, sessions, agents | | Where data lives | In the model's active window | External storage: vector DB, K/V, relational | | Main issue | What to include and how to organize it | What to persist, retrieve, and trust | | Fails when | The window is full, placement is incorrect, noise outweighs signal | Retrieval failures, obsolescence, poisoning, no writing policy | | Engineering surface | Prompt structure, compression, token budgeting | Storage schema, retrieval strategy, writing and updating policies | | Data lifespan | Duration of an LLM call | Depends on the type of memory |

Context Engineering: Assembling the Optimal Context Window

For an agent executing a multi-step workflow, each inference call assembles a context window from multiple sources: system prompt, task description, conversation history, tool outputs, retrieved documents, summaries from sub-agents. Context engineering encompasses the decisions that determine what each component contributes, in what form, and at what position.

Selective Inclusion

Not everything available should enter the context. For example, a database query returning hundreds of rows, a web search yielding five full articles, or a code executor logging verbose outputs can overload the window and reduce reasoning quality before reaching the token limit. The decision on what should be included as-is, what should be compressed into key facts, and what should be omitted is a design choice, not a default flaw.

Structural Placement

The location of information in the window affects how reliably the model uses it. Models pay more attention to content at the beginning and end of long contexts, with material in the middle receiving significantly less weight. This is known as the "lost in the middle" effect.

Strict constraints and critical instructions for the task should be positioned at the top of the window. The most relevant retrieved information for the current task should be placed near the end of the context window.

The user's current query or task should generally follow the retrieved information, positioning both the relevant context and the immediate goal as close as possible to the generation point. This arrangement increases the likelihood that the model will effectively utilize the retrieved information when producing its response.

Overview of Context Engineering

Arrival Compression

Tool outputs should be compressed after the return of a call, not after the window is full. A raw API response containing 3,000 tokens, of which the agent only needs 150, should be summarized before entering the context for the next step. Waiting until the window is full and then rushing to truncate is a reactive management of a problem that source compression prevents.

Conversation History Management

Conversation history grows faster than any other context component. For long-term agents, carrying the complete history in each call makes each subsequent inference more costly and less reliable. A compression strategy—sliding window, hierarchical summaries, or structured state extraction—should be applied at defined intervals, not when the window overflows.

Memory Engineering: Designing Persistent Memory Systems for AI

Once an inference call is complete, memory engineering determines what deserves to persist and under what conditions it will be reused. This covers four distinct concerns: what to write, where to store it, how to retrieve it, and how to maintain accuracy over time.

Writing Policy Design

Writing policy design is one of the most overlooked aspects of memory engineering, yet it has a disproportionate impact on memory quality over time. While retrieval systems often receive the most attention, the quality of retrieval is ultimately constrained by what enters memory storage in the first place.

A well-defined writing policy specifies:

  • What events trigger a memory write
  • What information is eligible for storage
  • The format in which information is stored, such as plain text, structured records, extracted facts, or summaries
  • Trust or validation requirements for accepting new entries
  • Which agents, tools, or system components are allowed to write to specific memory spaces
  • How updates, corrections, and conflicting information are managed
  • Retention rules, expiration policies, and lifespan requirements (TTL) for different types of memory

Without explicit writing policies, systems often tend to store too much information, assign equal trust to all entries, and retain data indefinitely. Over time, low-value and obsolete memories accumulate, signal-to-noise ratios decrease, and retrieval quality deteriorates. The result is a memory system that continuously grows while becoming progressively less useful.

Storage Level Selection

Different types of memory serve different purposes and require different storage backends. The choice of backend also constrains the available retrieval strategies.

  • Type of Memory: What It Stores
    • Working Memory: Active task state, intermediate results
    • Episodic Memory: Past interactions, task executions, decisions
    • Semantic Memory: Persistent facts, user preferences, domain knowledge
    • Procedural Memory: Learned workflows, successful action patterns

OpenAI's context customization cookbook makes a useful distinction between retrieval-based memory and state-based memory for use cases requiring continuity. Retrieval-based memory treats past interactions as loosely related documents and is fragile to variations in phrasing and conflicting updates. Structured state extraction—writing typed and validated facts rather than incorporating raw conversation snippets—produces more consistent results for facts that need to be reliably applied across sessions.

Overview of Memory Engineering

Retrieval Strategy

Reading from memory is not a one-time operation. A well-designed retrieval layer first checks working memory (fast, low-cost, exact key search), falls back on semantic search in episodic or semantic memory when nothing relevant emerges, applies metadata filters for recency and confidence level before returning results, and injects only what the current step needs.

Memory Maintenance

Storage without a maintenance policy degrades over time. Entries accumulate, obsolete facts compete with current facts, and retrieval quality drops as the signal-to-noise ratio decreases. The following maintenance routines are practically important: trust degradation on volatile facts, deduplication of semantically similar entries, TTL-based expiration of working memory and time-sensitive data, and periodic compression of old episodic records into session-level summaries.

A MemoryEntry schema that directly encodes these concerns makes the writing and maintenance logic more efficient.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.