New AI Reasoning System Shocks Researchers: Unlimited Context Window

New AI Reasoning System


Context windows have gotten huge in just a few years. 8,000 tokens became 32,000, then 100,000, and now million-token models are part of the conversation.

On paper, it sounds like the long-context problem is over. Just dump everything into the prompt and let AI sort it out.

In real use, it doesn’t work like that. Quality slips, answers get vague, costs jump, and at some point the model starts to miss the thread. The fix researchers are excited about right now isn’t “even more tokens.”

It’s a different idea: stop forcing the model to read everything up front, and instead let it work through information step by step. That’s the promise behind recursive language models (RLMs), introduced by MIT and later turned into a more concrete system by Prime Intellect.

Open book with notes and highlights

Why long context windows fail in real use

The hype is easy to understand: a bigger context window feels like giving the model a bigger brain.

But long prompts create several problems at once:

  • Performance drops as input length grows, even before the hard limit.
  • Answers get fuzzy, especially when reasoning depends on many parts of the input.
  • Costs explode because you pay for every token processed.
  • The model can lose the plot, mixing details or focusing on the wrong parts.

What’s surprising is that the failure isn’t only about “not enough room.” It’s also about how models behave when you give them too much at once.

The research framing this shift is laid out in the Recursive Language Models paper (arXiv), which treats long-context failures as a real, measurable phenomenon rather than a vibes-based complaint.

Related: if you’ve been tracking long-context progress across models, this Grok 4.1 long-context benchmark analysis is a helpful snapshot of how quickly “bigger context” has become a selling point.

“Context rot” is the real issue (and benchmarks show it)

Researchers have started using a blunt phrase for what happens when prompts get long: context rot.

The idea is simple: as inputs scale up, quality degrades. And it degrades faster when the task requires more than simple retrieval.

Simple retrieval vs real reasoning

Some tasks scale pretty well with longer inputs.

If the job is “find a phrase somewhere in this giant document,” models can often manage. That’s closer to search.

Where things break is when the answer depends on many parts of the input at once, or on relationships between parts. That’s where models start dropping details, confusing items, or making confident mistakes.

The benchmarks that make the drop obvious

Two benchmark styles called out in the discussion are:

  • Oolong: focuses on transforming or working across lots of entries.
  • Ulong Pairs: requires pair-wise aggregation across the input (a quadratic-style workload).

In the MIT results described, GPT-5’s performance drops sharply as input length moves from a few thousand tokens toward hundreds of thousands, especially on linear and quadratic tasks. On Ulong Pairs, performance essentially collapses, with F1 scores approaching zero.

Here’s the key takeaway: this happens before hitting the hard context limit, which suggests the bottleneck isn’t only “window size.”

Benchmark typeWhat it demandsObserved behavior (GPT-5 in the paper discussion)
Simple searchFind one thing in a huge inputHolds up relatively well
Oolong-style transformsWork across many entriesDrops as length grows
Ulong PairsPair-wise aggregation across inputF1 falls close to zero

How recursive language models work (the mindset shift)

RLMs are not “a bigger model” and not “a clever compression trick.”

They change what the model sees, and when.

Treat the prompt like an external world

Instead of stuffing a massive prompt into the model’s context and hoping for the best, RLMs treat the full input like an external workspace. The model doesn’t read everything.

It explores it.

It can inspect parts, search through it, pull small chunks, take notes, write little helper code to locate what matters, and call smaller helper models to process pieces.

A useful analogy is: the AI stops trying to memorize the whole book, and starts flipping pages, highlighting lines, and asking an assistant to summarize a paragraph when needed.

Person searching through notes on a desk

The “main model + workspace + helpers” setup

In the description, there’s one main model running the show. It connects to an environment where the full input lives. The main model can:

  • Skim a small section to understand what it’s looking at
  • Search for relevant words or patterns
  • Pull out short snippets for closer analysis
  • Store notes and intermediate results
  • Hand specific chunks to cheaper, smaller helper models

Then, once it has what it needs, it assembles a final answer that looks like a normal chat response. You still ask one question and get one answer, but behind the scenes the model is working in steps.

Why this changes the “long input” ceiling

There’s a benchmark described where the system can be given up to 1,000 full documents at once, which is millions of words. A normal “one-shot prompt” approach can’t realistically read all of that.

With RLMs, the system doesn’t try. It searches, zooms in, and ignores the rest.

So the main limit becomes less about “how much can I cram into context” and more about “how good is the system at finding what matters.”

Benchmark results: accuracy goes up, cost can go down

The numbers shared are hard to ignore because they show two wins at once: quality and cost.

1,000 documents at once

On the 1,000-document style benchmark:

  • RLM paired with GPT-5 reaches a little over 91% accuracy
  • Average cost per question comes in at just under $1

The older approach (forcing the model to read everything directly) is estimated around $1.5 to nearly $3 per query, assuming it could even handle the data.

ApproachAccuracy (reported)Average cost per query (reported)
RLM + GPT-5~91%< $1
“Stuff it all in context” baselineNot reliably feasible at this scale~$1.5 to ~$3

Code QA on LongBench V2

A tougher example is Code QA from LongBench V2:

SetupAccuracy (reported)
GPT-5 alone24%
+ Summarization agent41.33%
RLM setup62%

Then there’s a result that really clarifies what’s happening.

An ablation removes recursion but still gives the model access to the external environment (referred to as a “ripple environment” in the discussion). That variant hits 66% accuracy, which is higher than the full RLM in that case.

That’s a big signal: moving context out of the model’s head and into an external workspace helps a lot, even before recursion kicks in.

Ulong Pairs (the quadratic stress test)

This is where the gap gets extreme.

SetupF1 score (reported)
GPT-50.04
Summarization agents~0
Kodak (with retrieval)24.67
Full RLM58.00
“Ripple only” (no recursion)43.93
Qwen 3 Coder (base)< 0.1
Qwen 3 Coder (RLM)23.11

The way it’s described: the external environment gives the model somewhere to put all that context so it doesn’t overload, and the recursive sub-calls give it a way to reason over the input in manageable chunks.

What an RLM does step by step (and why it feels more “human”)

The MIT work also shows what these systems do while they’re working, and the behavior is pretty intuitive.

1) A quick first look

The model glances at the start of the input to classify what it’s dealing with.

Is it a list, documents, logs, code, or something else? That early read helps it choose a strategy.

2) Selective search instead of full reading

Next, it searches for relevant words, patterns, or lines.

This is a quiet but important point: the model is already shrinking the problem without loading the entire input into working memory.

3) Split the workload into pieces

When things get complex, the model breaks the input into smaller chunks (lines, entries, individual docs).

Each chunk can be handled separately, often by helper models. The main model stays in control, collects the useful parts, and combines them.

4) Build long outputs in parts

When the final answer itself is long, RLMs can build it piece by piece, save partial results, then assemble the final output at the end.

So it’s not forced to “talk continuously” inside a single output limit. It’s assembling a result from stored pieces.

Where recursion matters (and where it hurts)

The “recursive” part means the model can go back, ask again, refine something, or check its own work with smaller, focused calls.

That can catch mistakes that happen when too much information gets mixed together.

But it can also create extra work and higher cost. Some runs are fast and clean, others wander and double-check too much.

Cost and efficiency: competitive on average, messy in the worst case

One of the more honest parts of the discussion is about variance.

On average, RLM runs can be surprisingly competitive, sometimes cheaper than a single base model call that tries to handle everything directly.

But some runs go long. They make many sub-calls, burn budget, and still don’t always land the right answer.

The authors point out clear “low-hanging fruit” in current implementations:

  • Calls are synchronous and blocking
  • There’s no parallelism
  • There are no learned policies for when to split, when to stop, or how much checking is enough

That’s a roadmap for future improvement.

Prime Intellect’s RLMNV: turning the blueprint into a real system

Prime Intellect took the MIT approach and built a more concrete system called RLMNV (described as an environment-first setup).

The design choice is strict: the main model only gets access to a simple workspace. No giant tool outputs flooding its memory.

All the heavy lifting (web search, file access, messy retrieval) gets pushed to smaller helper models, while the main model stays focused on reasoning and decision-making.

Minimal workspace setup on a computer screen

Batch execution to speed things up

Prime Intellect also adds a way to send out many small tasks at once using LLM batch.

Instead of doing everything step by step, the system can split work and process it faster.

A strict “final answer” rule

At the end, the system enforces a simple discipline: the model must write the final answer into a specific place and mark it done.

That sounds small, but it stops the common failure mode where an agent rambles, half-finishes, or never cleanly commits to a final output.

The tasks they tested

Prime Intellect tested the setup across different scenarios:

  • Deep Dive: web research on long, noisy pages
  • Math Python: competition-style math using a coding environment
  • Oolong: reused directly
  • Verbatim Copy: reproducing complex structured data like JSON, CSV, or mixed code

Models like GPT-5 Mini and Prime Intellect’s own Intellect-3 MoE became more reliable when wrapped in this structure, according to the described results.

For a broader agent angle, this post on open-source AI agents with 128k context windows pairs well with the RLM idea, because both are pushing toward “real work” across long horizons.

Same RLM structure, different model behavior (GPT-5 vs Qwen 3 Coder)

One of the most interesting observations is that RLMs can work with different base models, but they don’t all behave the same way.

On Browse Comp Plus:

  • RLM with GPT-5 nearly solves the benchmark.
  • RLM with Qwen 3 Coder struggles on about half the tasks.

The system prompt is described as basically identical. The main difference is a warning telling Qwen 3 Coder not to overuse helper calls.

That tiny change leads to very different behavior:

  • GPT-5 tends to be cautious and selective.
  • Qwen 3 Coder is more aggressive, splitting line by line, especially on oolong-style tasks.

So while the scaffold is general, results depend heavily on whether the base model has good judgment about when to search, when to split, and when to stop.

Limits today, and why the upside is big

Current RLM systems, as described, still have constraints:

  • Only one level deep (helper calls are normal LLM calls, not full RLMs)
  • Runs are largely sequential, not parallel
  • No reinforcement learning guiding decisions like splitting, recursion depth, or stopping
  • Some runs overthink, re-check too much, and waste budget

The upside is how trainable this looks.

The argument is that RLM runs create a new kind of reasoning trace, one that includes exploration behavior: what it looked at, what it ignored, what it verified, and how it decomposed the task.

Reasoning traces can be trained. If you combine this structure with reinforcement learning, you could teach models how to explore huge inputs efficiently, how deep to recurse, and when it’s time to stop.

There’s also a clean computer science analogy: out-of-core algorithms process data sets much larger than memory by controlling what gets loaded when. RLMs apply a similar approach to AI, using small working memory plus symbolic access to a massive external store.

Server racks in a data center

For readers who want to go deeper than summaries, Prime Intellect’s perspective is worth comparing against the MIT framing in Prime Intellect’s Recursive Language Models overview. There’s also a more narrative explanation from the author in Alex L. Zhang’s Recursive Language Models write-up, plus an implementation direction to explore via the recursive language model code repository.

What I learned (and what changed how I think about AI)

Three ideas stuck with me.

First, long-context failures aren’t mysterious. The “context rot” framing matches what people see day to day: the longer and more interconnected the input, the more likely the model is to blur details or pick the wrong thread.

Second, the biggest win isn’t recursion by itself. The ablation results (where the model uses an external workspace but no recursive sub-calls) make a strong case that externalizing context is a major improvement on its own. It’s a cleaner way to work, because the model stops pretending it can hold everything at once.

Third, this shifts the question from “how many tokens can it read?” to “how well can it work?” That includes searching, chunking, verifying, and assembling. It also explains why two base models can act so differently under the same RLM scaffold. Judgment calls matter, and today’s models weren’t trained specifically for that role.

Conclusion

Bigger context windows helped AI, but they didn’t solve the hardest long-input problems. RLMs point to a more practical path: keep a smaller working memory, move the full input into an external environment, and let the system work through information step by step.

The early benchmark results described here suggest this can handle millions of tokens at competitive cost, and it can rescue tasks that cause standard long-context prompting to collapse. The most exciting part is that this progress happens at inference time, without changing the underlying architecture.

If you’re thinking about agents that can handle huge codebases, months of logs, or company-scale knowledge, recursive language models are one of the clearest answers on the table right now.

Post a Comment

0 Comments