Recursive Language Models: Overcoming Context Rot and Enabling Complex Reasoning in AI Agents

Name: Before You Build Another Agent, Understand This MIT Paper
Uploaded: 2026-01-18T17:05:30.552341+00:00
Channel: Brainqub3
Description: Recursive Language Models: Overcoming Context Rot and Enabling Complex Reasoning in AI Agents The Core Challenge: Context Length vs.

Brainqub3

Jan 18, 2026

•

3 min read

YouTube video ID: m1Tc5Xzw1tM

Source: YouTube video by Brainqub3 — Watch original video

PDF

The Core Challenge: Context Length vs. Task Complexity

Context length isn’t enough – large documents (legal contracts, codebases) contain many internal references that create high task complexity.
Context rot: performance drops not only when the token limit is reached but also as the reasoning task becomes more intricate.
Lost‑in‑the‑middle problem: retrieving isolated "needles" from a haystack is solved, but multihop reasoning over inter‑linked clauses remains unsolved.

Why Traditional Approaches Fail

Naïve stuffing – dumping the entire document into an LLM leads to noise, high cost, and rapid degradation.
Summarization (e.g., Claude’s autocompact) – lossy; essential details for the task are often omitted, causing drift.
Retrieval‑Augmented Generation (RAG) – works for simple Q&A but cannot capture logical relationships needed for multihop reasoning; also depends heavily on fragile chunking strategies.

A Better Mental Model: Dependency Graphs

Treat contracts or codebases as nodes (clauses, functions) linked by edges (references, calls) rather than linear text.
This graph view mirrors how humans navigate cross‑referencing sections and enables systematic reasoning.

Introducing Recursive Language Models (RLM)

Ripple = Read‑Evaluate‑Print‑Loop (REPL) executed inside a Python script.
Read: fetch the current state of the data object (e.g., a contract variable).
Evaluate: run any programmatic operation – slicing, keyword search, custom logic.
Print: return results to the interpreter.
Loop: repeat until the query is resolved.
Recursion: the primary model can hand off sub‑tasks to a smaller model, creating a controlled, one‑layer deep recursion that mimics a hand‑off rather than an infinite loop.
This structure reduces required context, enables flexible searching, and builds the dependency graph on‑the‑fly.

Experimental Results

Tested on GPT‑5 and a 340‑billion‑parameter Quen model.
RLMs achieved higher accuracy at lower cost compared to plain context stuffing, summarization, or RAG.
They could reason over contexts orders of magnitude larger than the model’s native window without severe performance loss.

Limitations & Guardrails

Model size matters – small models showed noticeable degradation; high‑capacity models are still preferred.
Recursion safety – infinite loops can become expensive; the paper enforces a single‑layer recursion and synchronous execution.
When not to use RLM – for low‑complexity, short‑context tasks a single‑shot LLM often outperforms the ripple approach.
Operational complexity – monitoring, observability, and prompt engineering become more demanding.

Practical Implications

Beyond software engineering: legal analysis, policy review, internal document synthesis, and any domain with large, self‑referencing data assets.
Data provenance remains essential to mitigate hallucinations.
The approach opens a path to AI agents that can reliably handle high‑complexity, large‑context workloads without prohibitive cost.

Key Takeaways

Model complex documents as dependency graphs, not linear text.
Use code execution + recursion (RLM/Ripple) to intelligently search and synthesize information, dramatically reducing context requirements.
Apply this method selectively: ideal for large‑context, high‑complexity retrieval and synthesis tasks, but keep guardrails and model size considerations in mind.

Treating intricate documents as dependency graphs and leveraging a simple read‑evaluate‑print‑loop with controlled recursion lets AI agents overcome context rot, enabling accurate, cost‑effective reasoning over massive, self‑referencing data sets.

Frequently Asked Questions

Who is Brainqub3 on YouTube?

Brainqub3 is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Why Traditional Approaches Fail

1. **Naïve stuffing** – dumping the entire document into an LLM leads to noise, high cost, and rapid degradation. 2. **Summarization (e.g., Claude’s autocompact)** – lossy; essential details for the task are often omitted, causing drift. 3. **Retrieval‑Augmented Generation (RAG)** – works for simple Q&A but cannot capture logical relationships needed for multihop reasoning; also depends heavily on fragile chunking strategies.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Python Code Interpreter Library Recommended

Provides the read‑evaluate‑print functionality needed to implement Ripple loops for recursive language models, enabling efficient multihop reasoning over large documents

Amazon →

Graph Visualization Software

Helps visualize dependency graphs of contracts or codebases, making it easier to understand and debug the relationships that RLMs reason over

Amazon →

Large Language Model Api Subscription

Grants access to high‑capacity models (e.g., GPT‑5) required for optimal performance of recursive language model pipelines

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

agents can finally handle high
complexity work outside of software
engineering. Honestly, this feels like
the reasoning moment all over again, but
for AI agents. What's most surprising is
the fix is almost absurdly simple. Code
execution plus recursion. I've read
through the RLM's paper and distilled it
into the important mental models that
you're going to need to apply this. So
the first mental model that I kind of
want to communicate is why context
length is only half the size when we're
dealing with these high complexity
tasks. So one of the insights from the
RLM paper is that it's not just about
context length. It's not just about
having legal contracts that are like a
million tokens long, let's say, or
having data rooms that span across
millions of tokens or large code bases.
It's also about the inherent complexity
of those documents of those types of
data assets. So if you look at legal
contracts, take a merger agreement for
example, there are a lot of internal
self- references that introduce a high
degree of complexity of that type of
asset. So it turns out you can't read
those things just like you read a book
end to end. There are clauses that
reference other clauses that might have
come earlier or might come later. There
are things in there that create that
complex structure. So context window is
basically only half the story. The other
half of the story is task complexity.
And when we talk about task complexity,
we're talking specifically about the
complexity of the documents we're trying
to have the agent work with. So let's
talk a little bit about why LLMs have
struggled with this kind of task
complexity and high context workloads.
It's called context rot. So context rot
is this phenomenon where the more
context you stuff into a large language
model the more the performance
deteriorates and before we were looking
at context rot kind of unid
dimensionally as just a function of
context but it's actually a function of
context and task complexity. So a model
that has a million token context will
deteriorate long before that million
tokens is hit if your task complexity is
also high. that leads to a lot of
instability in how effectively a model
can utilize its context. There's also
another misconception. So we've had this
kind of lost middle problem for a while
and a lost in the middle problem is
basically give a large language model a
huge amount of context, put some needles
in there. So to represent a needle in a
haystack which is basically small pieces
of information that the model has to
retrieve from that large context and
that could be distributed anywhere
across the context and assess how well
the model is able to pull that back. So
this has been a relatively solved
problem for a while now and the RLS
paper does reference that but what we're
talking about is different. We're not
talking about retrieving needles from
haystacks. We're talking about reasoning
over document complexity which is an
entirely different problem. So it's a
different class of problem because it
requires something called multihop
reasoning. And you know I experimented a
while back with a chatbot called Jared
to build multihop reasoning. And what I
found was that the scaffolds I used
whether I was using langraph or anything
else were really fragile. Once you got
to two or three hops the chatbot would
completely break down. That is what
you're doing when you're analyzing a lot
of these legal agreements or a lot of
these complex policy documents. You're
doing multihop reasoning. you're looking
at one part and then having to reason
about it and then go and find another
part that might be relevant and reason
about that and applying some conditions
to it to do your analysis. Raw language
models deteriorate when the task
requires that type of complex multihop
reasoning. They can do it over a small
context, but as soon as you extend that
context, you see quickly that the
performance falls off a cliff. What this
leads to is models that are confidently
wrong. And that's the quickest way to
break down trust in any AI agent that
you produce. So, let's talk a bit about
the strategies that have been tried in
the past. The first one is the naive one
of just simply taking everything and
stuffing it into an LLM and hoping and
praying for the best. So, I've explained
before the context rock phenomenon, and
you should understand it's pretty
self-explanatory why this doesn't work.
It is also incredibly expensive and slow
and have been shown to lead to subpar
results. Just stuffing more context in a
model can actually reduce reliability
rather than improve it because sometimes
that context is noise. So sometimes you
are actually burying the signal even
further. The next approach that we
explore is summarization. You would have
dealt with summarization if you've used
claude because they have this feature
called the autocompat feature. What that
autocompat feature will do is once you
reach a certain saturation of the
context window, the large language model
itself will hand off that context to
another model which will compact and
summarize it freeing up the context of
the main model to continue work based on
that summary. Now the problem with this
is that summarization is lossy. So
information is lost about the task and
deciding which information to keep and
which to get rid of to get an effective
summary is a difficult task in and of
itself. And this has been shown by the
R&M's paper to actually be an expensive
approach because that resummarization
requires the entire context of the
previous task to do. So what ends up
happening with summarization is you
often lose important context in the
summary and then the agent will
gradually drift off task until it's way
off from where it initially started
with. And some of you might have already
experienced this with C code. People
have come up with scaffolds to present
this but a lot of the time what people
are trying to do is they're trying to
create some kind of elaborate
summarization which is brittle in the
end. Another approach that was used is
rag retrieval augmented generation. So
this came in really I would say 2022
2023 when people started to really see
the power of large language models and
what rag was trying to solve was
extremely small context windows of
around 8K. Rag turned out to be quite
powerful for things like question and
answer because what you're doing with
rag is you're doing the retrieval which
is usually just a similarity search. So
it's looking for semantic similarity or
keyword matching. So if you imagine with
question and answer pairs, you can
easily do semantic matching on the
questions themselves and retrieve
answers. But rag turned out to be quite
a brittle approach when the task
complexity grows because you're only
doing semantic similarity for retrieval.
There isn't much flexibility and the
types of things that you can pull back
from the context are rudimentary. So for
example, you're only being able to pull
similar documents. should not
necessarily be able to pull logical
relationships that you might need to do
that multihob reasoning that you need to
do over a legal contract or over a
codebase. So if any programmers you
should think does rag actually ever make
sense on a codebase. I wouldn't say it
does. The other thing that makes rag
brittle is that you have to actually
decide the chunking strategy. So the
chunking strategy is how you break up
and atomize that context so that it can
be retrieved. You don't retrieve the
whole context. You only retrieve what
you need. The type of trunking strategy
you use can change depending on the type
of document that you have. A legal
contract for a merger for a particular
type of entity might be completely
different trunking strategy to a
research document for a pharmaceutical
company for example. And then how do you
expand that? How do you scale that
strategy to thousands of documents that
you might want to work with or to
hundreds of documents that you might
want to work with. This is what makes
rag brittle when you want to scale it to
real production use cases. Now I want to
give you some mental models about
understanding the type of complexity
we're dealing with. Right? So when we're
talking about a legal contract or code
base, we're not really talking about
something that necessarily is going to
be read end to end. We're talking about
something that has a high degree of
internal self- refferencing. So in a
legal contract, one clause might
reference another clause. In a codebase,
one function might call another function
or you might have a class abstraction
that is used in multiple places in the
codebase. that class can be called by
another function. So there's a high
degree of self-referencing and that is
what makes these types of data assets
complex to reason over. Just think about
it yourself. Think about following all
of the different links in a legal
contract or following and finding out
all of the different places a function
is used in a massive codebase. That is
what makes it cognitively demanding and
that is actually what makes it complex.
So rather than trying to model these
things as like a story book that you
read end to end, I think it helps much
more to model these things as dependency
graphs on the nodes. For a legal
contract, you might have certain clauses
and then how they relate to other
clauses on the edges. On a codebase,
you'll have certain functions and APIs
and how they call or how they interact
with other functions and APIs as notes.
that is a much better mental model of
mapping the complexity of the type of
data assets we're dealing with here when
we're talking about these real
workflows. So in short, when we're
modeling these types of complex long
documents, we really want to be modeling
them as dependency graphs rather than
just long pieces of text. So hopefully
by now I'm building this picture of why
the previous approaches didn't work.
Once you understand that these things
are dependency graphs and once you
understand context rot, you understand
why stuffing everything into an LLM even
though the reasoning power is quite high
nowadays won't work. And you understand
why rag by semantic retrieval doesn't
work. And you also understand why the
summarization and compaction method that
we see so commonly nowadays also is
brittle. So let's explain how the RLM
actually changes that. So the RLM is the
recursive language models. RLM is
incredibly simple. You have what you
call a ripple. Ripple is simply a read
evaluate print loop. That's all it is,
right? So what you do is instead of
stuffing the entire context into the
language model, you provide the model
with a data asset and that data asset is
a variable in a Python script. So a
legal contract can be assigned to a
variable in a Python script and instead
of having that inside the context of the
language model, the language model
programmatically operates on that using
the functions read, evaluate, print and
loop. And there's an additional point
here. There is a recursive factor. So
that language model can call a smaller
language model or a language model of
the same type and recursively operate
over that. So the way to think about
recursion here is like a handoff. You
have one model operating over that data
object and then handing that off to a
smaller model to focus on different
parts of that data object. So what you
get there is effectively a much more
sophisticated way to do multihop
reasoning because that setup gives the
agent the ability to search over the
context flexibly depending on the task
and find out exactly where the relevant
information is to answer your query or
to deliver a task. And that turns out to
use a lot less context than stuffing the
model with context. What we built there
with Ripple is the ability to
intelligently search and synthesize. So
we're talking about intelligent
decomposition of long legal contracts of
code bases and intelligent symphysis of
those code bases all with very simple
primitives of a code interpreter and
recursion. So let's look at the
individual components. Read is obvious.
Read is just reading the data object at
the point in time of what it is.
Evaluate is where a lot of the magic
happens. So evaluate can be any
programmatic function over that data
object. So it could be a slice. It could
even be a keyword match on that data
object. It could be any programmatic
function on that data object. And then
print is how we return the result back
to the interpreter. So that is how the
overall system keeps track of where
things are. And then loop is continuing
to do that until the task is solved. And
it's this it's with this approach that
we're able to build that dependency
graph. And that is how we actually are
able to model the complexity and reason
over those complex documents. Not by
treating them as a story book and trying
to stuff the model in and read them auto
reggressively end to end, but by
actually giving the model the ability to
search over that codebase intelligently
and build that dependency graph or
search over that legal contract
intelligently and build that dependency
graph. And that dependency graph is what
is needed to actually answer those
complex legal queries or those complex
code queries in your codebase. So what
are some of the implications of this?
Well, they ran this over a few
experiments and they found that for most
of the runs the RLMs for the most part
they were cheaper and higher
performance. And they did this on GBT5
and also on Quen 340 billion parameter
coding model. The model did perform
slightly worse than the GPT5 model,
which tells us that to use this scaffold
appropriately, you still probably need
high performance models. But the
implications downstream are huge because
what they were able to do was reason
about context that were orders of
magnitude larger than the advertised
context windows of the models without
the same level of deterioration and for
a cost that was in the neighborhood of
the other approaches that we talked
about. So the rag type of approaches,
stuff in the context window approaches.
So what are the limitations? Because
there are limitations with this
approach. So the paper I thought was
brilliant. But with all papers you want
to apply to your own situation in
production and depending on what you
have available to you in production and
depending on the sensitivity of the
data, this might not be unlocked for you
yet. If you can only use very small
models, there is nothing in this paper
that shows that it works with very small
models. In fact, we see a deterioration
in capability even between GBT5 and the
340 billion parameter model, the Quen
coding model. There is also this fear of
infinite recursion. So, it's not
necessarily infinite recursion, but
there's a fear that sometimes because
you're letting the agentic system run
itself that it can get into these
recursion loops if it goes off on the
wrong tangent and that can become very
expensive. So there was a distribution
of the cost of running these tasks and
for the 95th percentile of tasks, it
became very expensive relative to the
other approaches when it hit those
recursion loops that would actually go
off in directions that you don't want it
to go off in. But the paper specified
how they put guard rails in place. And
this is one of the most important things
if you're going to use this approach.
You need the guardrails in place. So
first of all, they specified that it's
only one layer deep of recursion. And
then they made the workflow synchronous.
But I see that as an opportunity to
actually expand this because imagine if
it was asynchronous. So that would mean
being able to call multiple sub LLM at
the same time to do different parts of
the work. And you know, you can imagine
that would be extremely useful if you're
reasoning over a data room for example
with 100 legal type documents in there
and really complex maybe even some code
base in there and really complex things.
All right. So that asynchronity has not
been tested in this paper yet. It leaves
it open for experimentation. I'm sure
we're going to see lots of that type of
stuff in production. So that's one of
the limitations there that you are kind
of letting go of control of the system a
bit which gives it that flexibility but
can also run away from you if you don't
put the guardrails in place. One of the
other limitations is this is not an
approach to be used everywhere. The
paper actually mentions that when you're
dealing with really small context, just
doing a one shot with the LLM often
performs better. So you have to know
when to apply these paper also mentions
a slight subtlety. It says that if
you're just dealing with long context
sometimes it's better to not use the
recursion but still provide the ripple
environment for the LLM to operate in.
So effectively you're getting the large
language model to reason over that long
context without the handoff process to a
smaller agent for that recursion factor.
And that works better when you're
dealing with something that's long
context but low complexity where you
have that high complexity and the long
context. That's when the recursion
factor comes in. So that's one of the
limitations understanding where it
should be applied and when it shouldn't
be applied. And the third limitation as
well is complexity. These are much more
complex systems to monitor. They're much
more complex systems to implement at
production. The other thing to consider
as well is the nature of your prompt
matters because the nature of your
prompt will steer that evaluation
process. So you should be wary of that
as well. The observability does become
more difficult. So I want to leave you
with the implications here. The
implications are that look we can unlock
quite a lot of tasks outside of software
engineering that have previously been
unaccessible to AI agents. And some of
those I think I've mentioned before like
legal analysis but there is also policy
review. There is also information
symphysis on internal documents. So
there's a lot of companies that have
thousands of internal documents and
history of different types Excel
spreadsheets, word documents in the
organization that they just cannot make
sense of. And this symphysis layer does
provide a way to do that. Now of course
this doesn't get rid of hallucinations.
So you need to ensure proper data
provenence, but that's standard if
you're building AI agents at this point.
So if I'm going to leave you with a few
points, the points are that when you
model complex documents, the mental
model you should hold is not reading
them as long textbooks, but instead
modeling them as dependency graphs. Code
execution and recursion allows you to
intelligently search over that context
and build those dependency graphs that
allow you to then synthesize correct
answers or synthesize better responses
while reasoning over that long context.
It's not a one-sizefits-all. There are
places it should be applied and there
are places it shouldn't be applied. And
you should apply it if you're doing
large context, complex retrieval,
information synthesis, or research.
Thank you.