AI Is Now Improving Itself at 5 Levels Simultaneously

AI Research AlphaEvolve Google DeepMind AI Architecture Transformer Models

Breakthroughs

20 Yrs

Oldest Record Broken

25%

Compute Saved (Attention Residuals)

52%

Agent Task Completion (OpenViking)

There's a pattern worth understanding in how AI progress actually happens — and this past week illustrated it unusually clearly. Five separate research groups published results that, taken together, tell a coherent story. The story isn't "AI got smarter." It's more specific and more interesting than that: AI is now being used to improve AI itself, at five distinct levels of the stack simultaneously.

Most coverage of this week treated these as five separate news items. They're not. AlphaEvolve inventing its own algorithms, Moonshot AI redesigning how transformer layers communicate, GLM-OCR showing you don't need 70 billion parameters to read a complex document, OpenViking rethinking how agents store memory, and IBM Granite 4.0 proving compact speech models work in practice — these are happening at the same moment because the field has entered a phase where the tools are good enough to turn on themselves.

Let me take each one in turn and explain what it actually does — and why it matters beyond the press release.

📋 Table of Contents

AlphaEvolve — The AI That Invents Algorithms Instead of Solving Problems
Attention Residuals — Moonshot AI's Quiet Redesign of the Transformer
GLM-OCR — Why 0.9 Billion Parameters Can Now Read What Bigger Models Miss
OpenViking — The File System Approach to AI Memory
IBM Granite 4.0 — The Case for Compact Speech Models
The Pattern Across All Five — Why This Week Was Different
My Take

AlphaEvolve — The AI That Invents Algorithms Instead of Solving Problems

Diagram showing AlphaEvolve's evolutionary loop: LLM generates code → evaluator scores it → best versions mutate further

To understand why AlphaEvolve's result in Ramsey theory matters, it helps to first understand what Ramsey theory is — and why mathematicians treat it with a mixture of fascination and despair.

Start with a simple observation: put six people in a room. No matter how those six people relate to each other — strangers, acquaintances, friends — you will always find either three people who all know each other, or three people who are all strangers. This isn't a coincidence. It's mathematically guaranteed. Ramsey theory is the study of exactly when these kinds of unavoidable patterns appear inside large networks. The question becomes: at what size does a structure guarantee that a certain pattern must exist?

The legendary mathematician Paul Erdős once framed the difficulty of computing these Ramsey numbers memorably: if aliens threatened to destroy Earth unless we calculated R(5,5) — the Ramsey number for groups of five — humanity's best option would be to dedicate all of its resources to the calculation. For R(6,6), he suggested, humanity should try to destroy the aliens instead. That's how hard these numbers are. Even today, R(5,5) is only known to fall somewhere between 43 and 48. Nobody knows the exact value.

This context is what makes AlphaEvolve's result meaningful. The system pushed forward the lower bounds of five Ramsey numbers simultaneously — R(3,13), R(3,18), R(4,13), R(4,14), and R(4,15). Each moved by exactly one. That sounds modest. But R(3,18) had not moved in exactly 20 years. R(3,13), R(4,13), and R(4,14) had stood for 11 years each. Moving any one of these by even a single integer is considered a significant achievement. AlphaEvolve moved five in one deployment.

The five Ramsey numbers AlphaEvolve improved (March 2026):

R(3,13): 60 → 61 | Record stood for 11 years
R(3,18): 99 → 100 | Record stood for 20 years
R(4,13): 138 → 139 | Record stood for 11 years
R(4,14): 147 → 148 | Record stood for 11 years
R(4,15): 158 → 159 | Record stood for 6 years

But the how matters more than the what here. AlphaEvolve didn't directly solve the Ramsey problems. It searched for the algorithms that search for the answers. Think of it this way: instead of digging for the treasure itself, AlphaEvolve invented better shovels. The system begins with a population of simple algorithms, then uses Google's Gemini model to modify them — changing strategies, rewriting code, proposing new search approaches. Each modified algorithm gets tested. If it performs better than its predecessor, it survives and evolves further. If it fails, it gets discarded. Over many iterations, the system converges on algorithms capable of breaking records that human-designed approaches couldn't crack.

When researchers examined the algorithms AlphaEvolve independently invented, they found something striking: the system had rediscovered techniques that human mathematicians had previously developed by hand — including algebraic graph approaches like Paley graphs and quadratic residue graphs. The system wasn't guessing randomly. It was converging on actual mathematical strategies.

Visual showing the five Ramsey numbers improved, with old vs new bounds and years the records stood

The broader implications extend beyond Ramsey theory. According to the official Google DeepMind AlphaEvolve announcement, when tested across 50+ open problems in mathematics — geometry, combinatorics, number theory, analysis — AlphaEvolve matched state-of-the-art solutions in 75% of cases and improved on them in 20%. It also broke a 50-year-old record in matrix multiplication, finding a way to multiply two 4×4 complex matrices using 48 scalar multiplications instead of 49 — a result that had eluded researchers since Volker Strassen's landmark 1969 algorithm. And those aren't just academic results: AlphaEvolve has already been deployed inside Google's data centers, chip design processes, and AI training pipelines, saving meaningful compute resources at scale.

Attention Residuals — Moonshot AI's Quiet Redesign of the Transformer

Side-by-side comparison of standard transformer residual connections (equal weight blending) vs attention residuals (dynamic layer weighting)

While AlphaEvolve was making headlines, a research team from Moonshot AI was working on something quieter but potentially more fundamental: a proposed modification to how transformer neural networks are structured internally.

To understand the problem they were solving, think about how a modern transformer model processes information. The model is built from many layers stacked on top of each other. Each layer processes information and passes it to the next. To prevent instability — what happens when information gets distorted as it passes through many layers — models use what are called residual connections: each layer's output gets added to the original input before being passed forward. This keeps gradients stable during training and has been a cornerstone of transformer design since 2017.

The problem Moonshot identified is subtle but real. As models get deeper — more layers — those residual connections treat every layer's contribution with equal weight. Over time, the outputs of all those layers get blended together uniformly, and the signal from earlier layers gets progressively diluted in a growing pile of mixed information. The model can't distinguish which earlier layers were actually useful from which were noise.

The Moonshot team's solution — attention residuals — borrows the attention mechanism itself (the same mechanism transformers use to decide which words in a sentence are most relevant to each other) and applies it to decide which layers of the network matter most at each step. Instead of blending everything together uniformly, each layer can now dynamically weight its relationship to earlier layers. It's effectively giving the model a smarter memory of its own computational process.

The results from their benchmarks are worth noting. Models with attention residuals consistently performed better across standard evaluation sets — and in several cases, they matched the performance of models requiring approximately 25% more computing power. They also tested the architecture inside their Kimi Linear model (48 billion total parameters, trained on 1.4 trillion tokens), and found improvements across reasoning, coding, and knowledge benchmarks. A 25% compute efficiency gain at that scale is the kind of number that gets people's attention in a field where training costs run into the tens of millions of dollars.

What attention residuals change in plain terms:

Standard residuals = "Mix everything from all layers with equal weight"
Attention residuals = "Dynamically decide which layers deserve more weight at each step"

Result: Models that are smarter about their own internal reasoning process — and potentially 25% more efficient to achieve the same performance.

GLM-OCR — Why 0.9 Billion Parameters Can Now Read What Bigger Models Miss

Visual of GLM-OCR's two-step approach: document broken into regions (tables, paragraphs, diagrams), then each region processed separately

The third development is the most practically immediate for anyone building real applications. Researchers from Jiu AI and Singhua University released GLM-OCR — a document understanding model with just 0.9 billion parameters that can read complex documents containing tables, mathematical formulas, stamps, structured fields, and messy page layouts.

To appreciate why this is significant, it helps to understand the limitation it's solving. Traditional OCR (Optical Character Recognition) systems are well-optimized for reading clean, plain text. A page of uniformly formatted prose? They handle it efficiently. But the moment a document contains a table with merged cells, a mathematical formula, a stamped approval mark, or an invoice with mixed structured and unstructured data — traditional OCR falls apart. The system tries to read the whole page in one linear pass and loses the structural relationships that make tables and forms meaningful.

GLM-OCR's approach is architecturally different. First, it breaks the document into meaningful regions — identifying tables, paragraphs, and diagrams as separate sections. Then it processes each region independently. This "divide the page into meaningful segments first" approach means the model can handle the specific structural rules of a table without being confused by the free-form text around it. The model also generates multiple words simultaneously rather than word by word, which contributes to the reported 50% faster processing speed compared to traditional approaches.

The output format detail is what makes this genuinely deployment-ready: GLM-OCR can output directly to JSON or Markdown. For a developer building a system that needs to extract data from invoices, forms, or reports, that means the model's output slots directly into an existing data pipeline without a second parsing step. Combined with the model's small size — 0.9B parameters is easily deployable on modest hardware — this is the kind of tool that becomes a component in production systems rather than a research demonstration.

OpenViking — The File System Approach to AI Agent Memory

Comparison showing chaotic vector database chunks vs OpenViking's organized folder/directory structure with tiered content loading

The fourth development addresses a problem that anyone who has built or used AI agents will recognize immediately: memory management. Specifically, the limitations of the dominant current approach — vector databases.

Here's how most AI agents currently handle memory. They take text — previous conversations, documents, knowledge — break it into chunks, convert those chunks into numerical vectors, and store them in a database. When the agent needs to retrieve something, it converts the query into a vector and finds the chunks whose vectors are most similar. This approach works. But it has real limitations: the similarity search is essentially matching patterns, not understanding structure. An agent searching for "the budget figure from the Q3 planning document" might retrieve fragments from many Q3 discussions that happen to use similar words, rather than finding the specific organized information it needs.

OpenViking takes a structurally different approach. Instead of storing information as random text chunks in a vector space, it organizes memory like a computer file system — folders, directories, and files that can be navigated with commands similar to a terminal. An agent can browse to a specific directory rather than searching blindly through thousands of chunks. This is a conceptually simple but practically significant change: it makes the agent's memory browsable rather than just searchable.

OpenViking also introduces tiered context loading: every piece of stored information automatically gets three versions — a one-sentence summary, a medium overview, and the full content. When the agent needs information, it reads the summary first. Only if necessary does it retrieve the full file. This dramatically reduces the number of tokens the model processes, which directly reduces cost and latency. In testing using a long conversation dataset, OpenViking improved agent task completion rates from 35% to over 52% while using fewer tokens. That's a 17-percentage-point improvement in task success — which in production agent deployments is the difference between a useful tool and an unreliable one.

IBM Granite 4.0 — The Case for Compact Speech Models

The fifth development is IBM's Granite 4.01B Speech — a compact speech recognition and translation model with 1 billion parameters that supports six languages (English, French, German, Spanish, Portuguese, Japanese) and can translate speech to and from English, including cross-language translation like English to Italian and English to Mandarin.

The design philosophy here is a deliberate contrast to the trend of building increasingly large models. IBM's approach with Granite is to build the smallest model that achieves production-grade performance on a specific task — in this case, accurate multilingual speech recognition. The system uses a two-step modular architecture: first, speech gets converted to text; then a language model processes that text for translation or response generation. This modular design makes integration into existing applications straightforward — developers can swap out either component independently as better versions become available.

On the Open ASR Leaderboard, Granite 4.01B achieved a word error rate of 5.52% — a strong result for a model of this size. It performed particularly well on LibriSpeech and SPGISpeech datasets. The Apache 2.0 license removes the commercial use restrictions that limit deployment of many comparable models. For companies building multilingual voice interfaces or transcription tools, a compact, commercially-usable, production-quality model is considerably more useful than a large model that performs slightly better but costs three times as much to run.

The Pattern Across All Five — Why This Week Was Different

Five-layer stack diagram showing AI improving at algorithm discovery, architecture, efficiency, memory, and deployment simultaneously

Here's the framing that ties all five together. Each development is operating at a different level of what you might call the AI capability stack:

Level	Development	What It Improves
Algorithm Discovery	AlphaEvolve	AI inventing better algorithms for itself and for science
Core Architecture	Attention Residuals	How transformer layers communicate internally
Specialized Capability	GLM-OCR	Document understanding at minimal model size
Memory Systems	OpenViking	How AI agents store and retrieve information
Deployment Efficiency	IBM Granite 4.0	Production-grade performance at practical model sizes

What makes the current moment unusual isn't that any one of these is a world-changing breakthrough on its own. It's that all five levels of improvement are happening simultaneously, and in many cases the tools being improved are being used to improve themselves. AlphaEvolve has already been deployed to improve the training pipelines that make models like the Gemini model powering AlphaEvolve more efficient. That's a genuinely recursive loop — AI improving AI improving AI — and it's operating in production, not in theory.

The attention residuals work from Moonshot suggests the transformer architecture — which has been essentially unchanged in its core design since 2017 — still has meaningful room for improvement that can be found by applying attention mechanisms to the architecture itself. The GLM-OCR result demonstrates that specialization is producing meaningful capability gains at scales that were previously considered too small to be useful. OpenViking shows that the dominant current paradigm for AI agent memory (vector databases) has structural limitations that a different design philosophy can meaningfully address. And IBM Granite shows the deployment reality: compact, commercially-licensed, production-ready models are where a large part of actual enterprise value gets built.

My Take

The coverage of AlphaEvolve has predictably focused on the Ramsey number records — partly because the alien-destroying-Earth framing makes for a compelling headline. But I think that framing actually obscures the more important signal. The records themselves are modest by any mathematical standard: each Ramsey number moved by exactly one. What's significant is the infrastructure claim underneath: that a single system, in one deployment, pushed five different decade-old records simultaneously. That pattern — general-purpose beats specialized, breadth without sacrificing depth — is the same pattern that made large language models disruptive. It showed up first in language, then in code, and is now showing up in mathematical algorithm discovery. The question worth asking isn't "what did AlphaEvolve find?" It's "what happens when this approach is applied to the 80% of scientific problems where the evaluation function is clearly defined and the search space is bounded?"

The attention residuals paper is the development I'd watch most carefully among the five — not for its immediate results but for what it signals about where transformer research is heading. The transformer architecture has been so dominant for so long that it's easy to treat it as settled. The Moonshot result suggests it isn't. Applying attention to the architecture itself — letting the model decide which of its own layers matter — is a conceptually clean idea that, if it generalizes across model sizes and training regimes, could quietly change the efficiency calculus for everyone training large models. A 25% compute reduction at scale is not a small number. I want to see replication results from independent teams before reading too much into the specific benchmarks, but the direction feels sound.

The honest note about OpenViking: 35% to 52% task completion on a single long-conversation benchmark is encouraging, but that benchmark is a narrow test. The real question for AI agent memory systems isn't whether they work better in testing — it's whether they hold up when agents are deployed in messy real-world tasks where information is inconsistently formatted, the queries are ambiguous, and the "right answer" isn't clearly defined. The file-system metaphor is intuitive, but intuitive architectures don't always survive contact with real user behavior. I'd treat OpenViking as a promising direction rather than a solved problem.

The thread connecting all five is efficiency. Not raw capability — efficiency. AlphaEvolve finds better algorithms instead of brute-forcing solutions. Attention residuals get more from fewer parameters. GLM-OCR does more with 0.9 billion parameters than traditional approaches needed 7 billion for. OpenViking reduces the tokens an agent needs to process. IBM Granite fits production-grade speech into a model small enough to deploy affordably. The direction of travel in AI right now is not simply "bigger models." It's "the same output for less compute" — and this week's results show that direction has real momentum across multiple layers of the stack simultaneously.

📚 More From Revolution in AI

Sources & Further Reading

AI Is Now Improving Itself at 5 Levels Simultaneously — Here's What That Actually Means

AlphaEvolve — The AI That Invents Algorithms Instead of Solving Problems

Attention Residuals — Moonshot AI's Quiet Redesign of the Transformer

GLM-OCR — Why 0.9 Billion Parameters Can Now Read What Bigger Models Miss

OpenViking — The File System Approach to AI Agent Memory

IBM Granite 4.0 — The Case for Compact Speech Models

The Pattern Across All Five — Why This Week Was Different

My Take

Posted by Vinod Pandey

Post a Comment

0 Comments

Most Popular

6 Humanoid Robots That Are Real, Priced, and Shipping in 2026

Engine AI T800: The $1.5 Billion Humanoid Built to Fight and Work

Microsoft's 7 MAI Models Explained: What Each One Does and Why the Timing Is Everything

Recent Post

Did OpenAI Just Silently Upgrade ChatGPT? The GPT-5.4 Mini Theory (March 2026)

OpenAI's "Spud" Model Is Done Training — And Terence Tao Just Proved Why This Time Might Be Different

Claude Max $200/Month vs OpenClaw API Costs: Which Actually Costs Less in 2026?

Footer Menu Widget

Contact form

AI Is Now Improving Itself at 5 Levels Simultaneously — Here's What That Actually Means

AlphaEvolve — The AI That Invents Algorithms Instead of Solving Problems

Attention Residuals — Moonshot AI's Quiet Redesign of the Transformer

GLM-OCR — Why 0.9 Billion Parameters Can Now Read What Bigger Models Miss

OpenViking — The File System Approach to AI Agent Memory

IBM Granite 4.0 — The Case for Compact Speech Models

The Pattern Across All Five — Why This Week Was Different

My Take

Posted by Vinod Pandey

You may like these posts

Post a Comment

0 Comments

Most Popular

6 Humanoid Robots That Are Real, Priced, and Shipping in 2026

Engine AI T800: The $1.5 Billion Humanoid Built to Fight and Work

Microsoft's 7 MAI Models Explained: What Each One Does and Why the Timing Is Everything

Recent Post

Did OpenAI Just Silently Upgrade ChatGPT? The GPT-5.4 Mini Theory (March 2026)

OpenAI's "Spud" Model Is Done Training — And Terence Tao Just Proved Why This Time Might Be Different

Claude Max $200/Month vs OpenClaw API Costs: Which Actually Costs Less in 2026?

Footer Menu Widget

Contact form