The assumption driving AI for the last three years is wrong. Or at least, it might be. The assumption: more parameters equals more intelligence. Stack more layers, train on more tokens, spend more compute. That has been the equation. A 22-year-old just built something that questions it — not theoretically, but in working PyTorch code — and the math behind it is harder to dismiss than most people expected.
The project is called OpenMythos. The architecture at its core is called a Recurrent Depth Transformer, or RDT. And the claim on the table is this: a 770 million parameter RDT matches the output quality of a 1.3 billion parameter standard transformer trained on identical data. Half the parameters. Same result.
That is either a significant efficiency breakthrough or a well-constructed research hypothesis that hasn't fully been stress-tested yet. Probably somewhere between both. Either way, it is worth understanding what the architecture actually does.
In This Article
- The Problem With Standard Transformers
- What a Recurrent Depth Transformer Actually Does
- The Three-Part Structure: Prelude, Loop, Coda
- Why Mixture-of-Experts Makes the Loop Smarter
- Latent Space Reasoning vs Chain-of-Thought
- The Stability Problem and How OpenMythos Addresses It
- What the Research Actually Shows
- OpenMythos and Claude Mythos: The Hypothesis
- My Take
- Key Takeaways
- FAQ
The Problem With Standard Transformers
Thesis: Standard transformers scale capability by adding unique layers. That works, but it locks reasoning depth at training time.
Every transformer-based model you've heard of — GPT, LLaMA, Mistral, Claude, Gemini — follows roughly the same structural logic. You take an input, pass it through a fixed stack of transformer layers, and get an output. Each layer has its own weights. Each layer processes the representation slightly differently. Want more capability? Add more layers. Add more parameters. That has been the path.
The problem is structural. Once you train the model with a specific number of layers, that depth is fixed. It doesn't matter if the question you're asking needs five steps of reasoning or fifty. The model has the same fixed depth every time. Chain-of-thought prompting was invented partly to work around this, forcing the model to generate intermediate tokens so it can "use" more of its own output as context. That helps. It doesn't solve the underlying rigidity.
Research at Saunshi et al. (2025) put a specific number on the gap: standard transformers trained on 5-hop reasoning chains reliably fail when tested on 10-hop chains at inference time. The model can't extend its depth beyond what it saw in training. It has no mechanism to do so.
Verdict: Stacking more layers solves the capability problem but creates a rigidity problem. Reasoning depth becomes hard-coded into the model's structure at training time. RDT attempts to decouple those two things.
What a Recurrent Depth Transformer Actually Does
Thesis: Instead of stacking hundreds of different layers, RDT takes a smaller set of layers and runs them multiple times.
The core idea is simpler than it sounds. A standard transformer applies each layer once and moves on. An RDT applies the same computational block repeatedly — in OpenMythos, up to 16 times — before producing any output. The weights don't change between iterations. The hidden state does.
Think of it less like reading a book and more like revising a draft. First pass: rough understanding. Second pass: sharper. By the sixteenth pass, the internal representation of the input has been refined through 16 rounds of processing — all using the same underlying parameters.
Each iteration updates the hidden state using three components: the previous hidden state, the original input signal, and the transformer computation itself. That last part matters. The input signal is re-injected at every loop. Without that, the model would drift — by loop 10, it would have lost track of what it was actually supposed to be processing. The re-injection is what keeps the reasoning anchored.
The mathematical structure is an update rule: hidden state at step N equals a learned combination of the previous state and the current computation. The matrices controlling how much weight to give each component are learned during training.
Verdict: The efficiency gain is real in theory. Fewer parameters, more inference-time compute. The depth of reasoning isn't fixed — it's a function of how many loops you run.
The Three-Part Structure: Prelude, Loop, Coda
OpenMythos implements this as three distinct components. Clean separation. Each does one thing.
Prelude: A standard transformer layer that runs exactly once. Its job is to encode the input. Take raw tokens, produce a meaningful initial representation. It sets the starting hidden state for everything that follows.
Recurrent Block: This is the loop. The same transformer block, same weights, applied up to 16 times. At each iteration, the hidden state is updated. The input signal is re-injected to prevent drift. By the final iteration, the representation has been refined through 16 passes of processing.
Coda: Another standard transformer layer, runs once, at the end. Takes the refined hidden state and produces the final output. The coda doesn't do heavy lifting — the recurrent block already did it.
First time I saw this structure, I thought the prelude and coda were redundant. They're not. The prelude converts raw tokens into a representation suitable for the recurrent block. The coda converts the refined latent state back into something that can produce a token prediction. They're adapters, essentially. The loop is where the actual reasoning happens.
Why Mixture-of-Experts Makes the Loop Smarter
Thesis: Running the same computation 16 times sounds wasteful. Mixture-of-Experts (MoE) is what makes each pass genuinely different.
The obvious objection to RDT: if you're reusing the same weights every loop, aren't you just doing the same thing 16 times? That would be useless. The answer is no, and the reason is MoE routing.
Instead of a single feed-forward layer inside the recurrent block, OpenMythos uses a mixture of 384 experts. Think of each expert as a small specialized sub-network, trained to handle different types of reasoning or content domains. At any given loop iteration, only a small subset of those experts is active — 8 experts per input, per Kimi K 2.6's implementation of the same idea.
The routing is dynamic. The same input at loop 3 might activate a completely different set of experts than at loop 9. So each pass isn't repeating the same computation — it's routing through different specialists, processing different aspects of the problem.
This gives you two complementary axes. MoE provides breadth — access to a wide range of domain-specific processing. The loop provides depth — multiple passes of iterative refinement. That combination is what makes the parameter efficiency claim credible. You don't need to store separate weights for every reasoning pattern. You route to the right experts, multiple times, and let the iteration do the depth work.
Verdict: MoE is not an add-on here. It is what prevents the loop from being circular. Without it, RDT's efficiency argument is much weaker.
Latent Space Reasoning vs Chain-of-Thought
Thesis: RDT does its reasoning silently, in continuous vector space. Chain-of-thought reasons out loud, in tokens. These are fundamentally different mechanisms with different tradeoffs.
When a model like o3 or Claude 3.7 uses chain-of-thought, it generates intermediate reasoning tokens before producing a final answer. You can see the thinking. It's written out in discrete steps, visible in the output. That's horizontal recurrence — the model extends its thinking by extending the sequence.
RDT is different. All 16 iterations happen entirely inside the hidden state vectors. No intermediate tokens are generated. No visible reasoning steps. The model just thinks, internally, through 16 passes — and then produces a single output at the end. It's vertical recurrence rather than horizontal.
Research from Saunshi et al. (2025) draws a formal equivalence: each loop iteration in an RDT is functionally equivalent to one step of chain-of-thought, but operating over real-valued vectors rather than discrete tokens. That means 16 loops corresponds roughly to 16 chain-of-thought steps — except the intermediate states are continuous, not tokenized.
There is a specific capability advantage to continuous space. A real-valued vector can represent multiple possible next reasoning directions simultaneously. A discrete token cannot — it collapses to one choice. This means RDT can run something closer to a breadth-first search over reasoning paths within a single forward pass, representing multiple hypotheses in parallel before converging on the output.
Nobody reads the latent state rule until they run into a model that seems to "just know" something it couldn't have derived step by step. That's probably what's happening.
The Stability Problem and How OpenMythos Addresses It
Thesis: Looping creates instability. The hidden state can either explode or drift into noise. OpenMythos uses three specific mechanisms to prevent this.
Recurrent architectures have a well-documented stability problem. Run a value through a loop repeatedly and it will either grow without bound — vanishing or exploding gradients — or drift so far from the original input that the model loses track of what it was processing. This is not a new problem. RNNs struggled with it for years before LSTMs were invented.
OpenMythos addresses this with three mechanisms:
LTI-stable injection (Parcae paper, Prairie et al. 2026): Linear Time Invariant injection constrains the system mathematically so that the hidden state cannot grow unbounded regardless of how many loops run. The architecture is provably stable under this constraint.
Adaptive Computation Time (ACT): Each token gets a learned signal that decides when to stop looping. Easier parts of the input stop early — maybe at loop 4. Harder parts run all 16 iterations. This means you're not burning compute on reasoning depth that isn't needed. Dynamic depth per token, not a fixed 16 for everything.
Depthwise LoRA adapters: Small parameter additions that slightly modify the behavior at each loop step. The base weights are shared across all 16 iterations. But each iteration gets a tiny unique adapter that makes it genuinely different from the previous one. Not different enough to cost significant memory. Different enough to prevent each loop from being a pure repeat.
Verdict: These aren't elegant theoretical solutions — they're engineering patches for a real problem. That the problem exists and requires multiple mitigations is worth noting. Anyone claiming RDT is a solved architecture is overselling.
What the Research Actually Shows
Thesis: Three specific experiments give the RDT hypothesis empirical weight. None of them are definitive. Together they are harder to dismiss.
Parameter efficiency: The Parcae paper demonstrates that a 770M parameter RDT matches a 1.3B standard transformer trained on identical data. That's roughly 41% fewer parameters for equivalent quality. The caveat: "equivalent quality" is benchmark-specific. What counts as equivalent depends heavily on which tasks you measure.
Systematic generalization: RDTs handle combinations of knowledge they never saw during training. Standard transformers fail this test. The experiment is simple — train on a set of reasoning patterns, then test on novel combinations of those patterns. Vanilla transformers collapse. Looped transformers pass. This is probably the most significant result because it suggests the bottleneck in current LLMs isn't knowledge storage. It's knowledge composition.
Depth extrapolation: Train on 5-hop reasoning chains. Test on 10-hop or 20-hop. Standard transformer: fails. Recurrent transformer: adds more inference-time loops and keeps going. The model doesn't need to have seen 20-hop reasoning during training — it can extend its reasoning depth dynamically by running more iterations at inference time.
The Parcae paper also establishes the first predictable scaling laws for looped training. Both optimal recurrence and optimal token count follow power laws with consistent exponents across scales. That matters because it means you can predict how much to loop and how much data to train on without running expensive experiments at every scale point.
Verdict: The evidence supports the hypothesis. It doesn't prove it at production scale. OpenMythos is not a deployed model. It's a research implementation. The results from Huginn-3.5B (Geiping et al., 2025) — an actual trained looped transformer — show real reasoning gains. That's the closest thing to a real-world test we have so far.
OpenMythos and Claude Mythos: The Hypothesis
Thesis: OpenMythos is not a leaked model. It's a testable hypothesis — and the hypothesis is specific enough to be falsifiable.
Anthropic has never published a technical paper on Claude Mythos. What exists is fragments: hints from researchers, patterns in model behavior, speculation in the research community. Kye Gomez, a 22-year-old founder of the Swarms agent framework, took all of that public information and built a working PyTorch implementation of what the architecture might be.
The project proposes that Claude Mythos belongs to the RDT class — a looped transformer with MoE routing, latent-space reasoning, and the stability mechanisms described above. OpenMythos ships seven pre-configured scale variants from 1B to 1T parameters, switchable between MLA and GQA attention, all running the same core recurrent depth architecture.
The value of OpenMythos isn't in whether it correctly guesses Anthropic's internals. It probably doesn't get every detail right. Nobody outside Anthropic knows. The value is in giving the research community a concrete, runnable implementation of an architectural class that the broader literature increasingly suggests is important. Researchers can now test the hypothesis, run their own experiments, and either validate or refute specific claims.
Worth noting: Kimi K 2.6, Moonshot AI's 1 trillion parameter model released around the same period, independently uses the same architectural components — MoE with 384 experts, multi-head latent attention for memory efficiency, and multi-agent parallelism. Whether or not Mythos is an RDT, these ideas are converging across labs.
RDT is the move. The research is pointing too consistently in this direction for it to be coincidence.
My Take
The parameter efficiency claim is probably real, within a specific range of tasks. 770M matching 1.3B on benchmarks designed for compositional reasoning isn't surprising given how the architecture works. The loop literally extends the depth of reasoning without adding parameters. Of course it's going to outperform a fixed-depth model on tasks that require variable reasoning depth. That's not a coincidence, it's the mechanism working as designed.
What I'm more skeptical of is the broad efficiency claim at production scale. The Parcae paper's power law results are encouraging — they mean you can predict how to train looped models without brute-forcing it. But OpenMythos is a research codebase. Huginn-3.5B is the only actually trained looped model with published results, and it's 3.5B parameters. Nobody has trained a 70B or 400B looped transformer and published the results. The gap between a working hypothesis and a production architecture is not small.
The depth extrapolation result is the one that stays with me. A model trained on 5-hop chains that handles 10-hop chains at inference — just by running more loops — is doing something structurally different from anything in the standard transformer family. That's not a marginal improvement. That's a different category of generalization. Standard transformers can't do that. At all.
Whether Anthropic actually built Mythos this way is probably unknowable from the outside. But that's almost beside the point now. The architecture class is real, it has published research behind it, and a working open-source implementation exists. That means anyone can run experiments. The hypothesis is no longer purely speculative — it's testable. That's what makes OpenMythos worth paying attention to, regardless of what Anthropic's internal codebase actually looks like.
Key Takeaways
- An RDT reuses a small set of weights up to 16 times instead of stacking hundreds of unique layers
- 770M parameter RDT matches 1.3B standard transformer on identical training data (Parcae paper)
- MoE routing with 384 experts ensures each loop pass activates different specialists — not just repeating the same computation
- All reasoning happens in latent space — no intermediate tokens generated
- RDTs handle depth extrapolation: trained on 5-hop reasoning, they handle 10-hop at inference by adding more loops
- Three stability mechanisms prevent hidden state explosion: LTI injection, Adaptive Computation Time, depthwise LoRA adapters
- OpenMythos is a testable hypothesis in PyTorch — not a leaked model, not a fine-tune
FAQ
Is OpenMythos actually Claude Mythos?
No. OpenMythos is a theoretical reconstruction — a hypothesis built from public research about what Claude Mythos might be. Anthropic has never confirmed or denied the RDT hypothesis. The project's value is in giving researchers a working implementation of an architecture class, not in revealing Anthropic's internals.
Why does running the same weights 16 times produce different results each loop?
Two reasons. First, the hidden state changes between iterations — each pass refines the representation, so the input to the transformer block is different even though the weights aren't. Second, MoE routing is dynamic — different experts activate at different loop iterations, so the actual computation is genuinely different each pass.
How is this different from chain-of-thought reasoning?
Chain-of-thought extends reasoning by generating intermediate tokens — the thinking is horizontal, written out in the output sequence. RDT extends reasoning by running more internal loop iterations — the thinking is vertical, happening inside hidden state vectors. No intermediate tokens are produced. The model "thinks" silently and outputs once.
What is the stability problem with looped transformers?
If you run any value through a multiplicative loop repeatedly without constraints, it can grow without bound (exploding gradients) or shrink to zero (vanishing gradients). This was the central challenge of early RNNs. In RDTs, the same risk exists — hidden states can diverge over 16 iterations. LTI-stable injection mathematically constrains the update rule to prevent this.
Does more loops always mean better output?
No. The Parcae paper shows gains follow a saturating exponential curve — each additional loop improves quality, but with diminishing returns. There's also a risk of overthinking: too many loops can cause the hidden state to drift past the correct answer into noise. Adaptive Computation Time addresses this by letting the model decide when to stop looping per token.
What is the Parcae paper and why does it matter?
Parcae (Prairie et al., 2026) from UCSD and Together AI establishes the first predictable scaling laws for looped language models. It shows that optimal recurrence and token count follow power laws — which means engineers can predict training configurations at new scales without running exploratory experiments from scratch. It also introduces the LTI stable injection mechanism that OpenMythos uses for loop stability.
Where This Goes Next
The honest caveat is this: RDT research is still at the 3.5B scale. Huginn-3.5B is the largest trained looped transformer with published results. Nobody has run this at 70B or 400B and shown that the efficiency gains hold, that the stability mechanisms scale, that the depth extrapolation results survive on harder reasoning benchmarks at production data volumes.
OpenMythos gives the research community the tools to start running those experiments. Seven scale variants from 1B to 1T, all in PyTorch, open source. The architecture class now has a standard implementation that anyone can run, fork, and test. That's a different situation from six months ago when this was mostly theoretical.
If the scaling results hold at larger sizes, the dominant assumption in AI development — that more parameters is the primary path to more capability — will need serious revision. If they don't hold, we'll at least know why, which is also useful.
Either way, the architecture is worth understanding now. Because if this is what Anthropic actually built Mythos on — and the public research increasingly suggests the RDT hypothesis is plausible — then the models that follow will work fundamentally differently from everything that came before them. Knowing the mechanism matters.
The OpenMythos codebase is available at github.com/kyegomez/OpenMythos. The Parcae paper on scaling laws for looped models can be found via the Thinking Deeper, Not Longer paper on arxiv.org — the closest published work in the same research thread.
If you're tracking how AI reasoning architectures are evolving, the story of Google's strike team response to Claude gives useful context for why architectural efficiency like this matters competitively. The Abacus AI agent swarm breakdown covers a different architectural idea — master-worker hierarchies — that represents the multi-agent parallel execution path rather than the per-model depth path.
0 Comments