Recursive Language Models: Dominating the Benchmarks

⚡

Key Takeaways

1Recursive Language Models (RLMs) dominate long-context benchmarks, surpassing traditional methods.

2A study revealed that RLMs use counterintuitive techniques to solve complex problems, such as counting letters in lists.

3The ReAct and CodeAct architectures show limitations compared to RLMs, particularly in context management and executing complex tasks.

💡Why it matters — RLMs are redefining the capabilities of language models, potentially influencing many technology sectors.

Large Language Models

Recursive Language Models (RLMs) are currently dominating all long-context benchmarks. This article explores what RLMs are, why they perform so well, and how they differ from existing agent designs.

Case Study

I spent a good part of last month implementing RLMs, running benchmarks, and producing a 50-minute tutorial video on the subject. Throughout this process, I answered over 100 questions on YouTube and X regarding RLMs. This article summarizes what I learned from answering these questions and the specific nuances of RLMs that made me say "eureka!"

The main reason RLMs seem inaccessible to many people is that some ideas are actually quite counterintuitive compared to existing methods (like ReAct, CodeAct, or sub-agents). The best way to understand RLMs is first to grasp where these other methods fail and to realize the missing piece in agent harnesses: the idea of passing context by reference, instead of replicating it.

Experiments

Among all the complicated experiments I conducted, the most enlightening was when I asked an RLM to:

"Generate 50 fruit names and count the number of Rs in each, returning as a dictionary."

A more advanced variation (let's call it Problem 2) was:

"Generate a dictionary of different categories: fruits, countries, animals. For each category, generate 50 names and count the number of Rs in each, returning as a nested dictionary."

For Problem 1, the expected output is something like:

{"strawberry": 3, "berry": 2, ... "grape": 1}

And for Problem 2, it looks like:

"fruits": {"strawberry": 3, "berry": 2, ... "grape": 1, ...},
"countries": {"united states of america": 1, "russia": 1, ...},
"animals": {"kangaroo": 1, "tiger": 1, ... "deer": 1, ...}

Although this may seem trivial, the way an RLM solves this problem is fundamentally different from other architectures like ReAct or CodeAct. Understanding how each method approaches this playful problem is essential to appreciate the beauty of RLMs.

The Agent Landscape

Direct Generation

The first method is direct generation. The LLM "thinks" about the user's request and generates a dictionary autoregressively.

Problems with this approach:

The LLM has no way to verify if it is mathematically correct.
The LLM is likely to be incorrect because, fundamentally, counting letters is not a "next word prediction" problem.
The chances of hallucination or errors are extremely high, even if the underlying LLM is intelligent.

ReAct (Reasoning and Action)

ReAct is a reasoning and action loop where the LLM first thinks about the problem (chain of thought) and then generates a tool call. In the system prompt, we pass a list of "function names" and instructions on how to call them.

For example, you might give a simple tool to the LLM that is just:

def count_alphabets_in_word(word: string, alphabet: string) -> int

With this idea, the ReAct agent can:

Generate a list of fruit names.
Use the tool to pass each fruit name and receive the integer output.
From its output memory, reconstruct the dictionary of fruits and their counts, then return.

The tracking of such a transaction would look like this:

Generate a dictionary with 50 fruits and the number of 'r' in each
<think> 50 fruit names are: strawberry, berry, grape, ... </think>
count_alphabets_in_word("strawberry", "r")
# Tool_Out (executes our function)
count_alphabets_in_word("berry", "r")  ## Tool call executed!
# Tool_Out (executes our function)
<think> I now have everything I need in my message history,
let's build this dictionary </think>
{ "strawberry": 3, "berry": 2, .... }

You see what the problems are, right? First, you need to define a count_alphabet_in_r function in advance for this specific use case. If you don't define a function, the agent simply reverts to the old method (i.e., directly generating alphabet counts)!

This ensures that the LLM has an idea of what the output should be, but the LLM still has to generate tokens one by one from its message history.

CodeAct

CodeAct allows the LLM to write code and execute it. This means you (the human) no longer need to write exact tools. You can simply give the LLM the ability to write any Python code and execute it in an isolated terminal environment, read the results, and generate the output.

This would unfold as follows:

Generate a dictionary with 50 fruits and the number of 'r' in each
<think> Alright, let's write some Python code for that. </think>
fruits = [ 'strawberry', 'berry', 'grape', .... ]
count_r = { k: sum(1 for c in fruit if c == 'r') for k, f in fruits }
print("Number of fruits: ", len(fruits)) print("Counts: " , count_r) '
# Tool output (Terminal output)
Number of fruits: 50
The counts are: {"strawberry": 3, "berry": 2 ....}
<think> Alright, I've read the terminal output,
let me write it back to return the output </think>
{ "strawberry": 3, "berry": 2, .... }

Basic Rules for ReAct vs CodeAct

Use ReAct when working on narrow products and you know exactly which tools the AI should use to solve a problem.
Use CodeAct when the domain is general.

Sub-agents

Sub-agent architectures are quite simple. There is a main agent, and it can launch smaller agents to perform sub-tasks. Each sub-agent is also a CodeAct agent that executes the tasks assigned to it and returns the output to the main agent.

The main advantage of the sub-agent architecture is that the main agent does not suffer from context degradation since it does not need to worry about the internal workings of the sub-agents. It is a complete black box.

We have already seen that the sub-agent architecture will easily solve Problem 1 with num_subagent = 0 (classic CodeAct), so let's see how this will work on Problem 2.

Generate a dictionary of different categories: fruits, countries, animals. For each category, generate 50 names. And count the number of Rs in each, returning as a nested dictionary
<think> Let's call some sub-agents and divide the tasks among them </think>
call_subagent("Return a dictionary of 50 fruit names and the number of r in each")
# Sub-agent (A new code-act module)
{"strawberry": 3, "berry": 2 ....}
call_subagent("Return a dictionary of 50 country names and the number of r in each")
{"france": 1, "russia": 1 ....}
call_subagent("Return a dictionary of 50 animal names and the number of r in each")
{"kangaroo": 1, "deer": 1 ....}
<think> I have responses from all the sub-agents, now I will write the final JSON </think>
"fruits": { "strawberry": 3, "berry": 2, .... },
"countries": { "france": 1, "russia": 1 .... },
"animals": { "kangaroo": 1, "deer": 1 .... }

We have made significant progress. CodeAct + Sub-agents can write arbitrary code for arbitrary things, but they still must:

READ the entire user prompt in their context window.
READ the entire output from the sub-agents in their context window.
WRITE autoregressively.