Google's "Mathematical AGI" Moment: What DeepMind's Aletheia Really Proved (and Why Energy Is Now Part of the Story)

Google's "Mathematical AGI" Moment


For years, a lot of people had a quiet line in their head about AI. Sure, it could talk, write, code, even make art, but AGI would hit a wall in real mathematics. Not contest math, not textbook exercises, but the kind of research problems where even understanding the question takes a specialist.

That line just got a lot harder to defend.

Google DeepMind revealed an AI research agent called Aletheia that autonomously solved six open, PhD-level math problems from active research areas, with no human help during the solving process, and under a real submission deadline. The reactions from working mathematicians are the part that sticks.

The breakthrough isn't "hard math," it's open math

The headline sounds dramatic until you separate two ideas: "difficult" and "unknown."

A hard contest problem is difficult, but the path to a solution usually exists inside a shared toolbox. Someone clever can combine known tricks under pressure and get it done. That's amazing, but it's still a sprint on a marked track.

Open research math is different. Sometimes only a few dozen people on Earth fully understand what's being asked. There may be no agreed solution strategy. There may not even be a solution. You can burn months chasing something that turns out to be impossible, or true for a boring reason no one noticed, or true but only after you invent a new lens to look through.

That's the context for the FirstProof challenge, a set of 10 frontier problems drawn from modern mathematical research. DeepMind's Aletheia solved six of them (Problems 2, 5, 7, 8, 9, and 10), fully correctly, within the official deadline, with zero human intervention during the solving run.

On-screen text explains the FirstProof challenge format and highlights that Aletheia solved 6 out of 10 problems.

A detail that matters: the remaining four problems did not get "pretty-looking" fake proofs. Aletheia either reported no solution found, or said nothing until time expired. In math, that restraint is not weakness. It's what keeps the whole project from turning into a time-wasting machine.

One problem, labeled Problem 7, had been open for years. People familiar with it said no other AI system got close. Aletheia solved it, and the proposer of the problem personally confirmed the reasoning.

And then there's the line from Terence Tao that hit like a cymbal crash: "AI has become my junior co-author." It's casual, but it carries a serious implication about where day-to-day research is headed.

When a system can search, fail, backtrack, and still land on correct new proofs, it's not doing "math homework." It's doing research behavior.

If you want a broader read on how messy the word AGI has become lately, this related piece puts some of the competing claims side by side: Inside Integral AI's first AGI-capable model.

What the FirstProof challenge tests (and why it's not the IMO)

The International Mathematical Olympiad is the famous benchmark most people know. It's brutal, and it rewards deep skill. Still, it's a closed world in an important way: you're expected to solve problems using known techniques, with creativity in how you combine them.

FirstProof is more like being dropped into a foggy forest with a compass that might be wrong.

These are research-level questions from active areas of mathematics. In some cases, you don't just need technique, you need a willingness to try an idea, realize it's dead, throw it out, and keep going anyway. That "keep going" part is where humans are limited by time, mood, ego, and basic fatigue. A machine agent doesn't have those limits. It has a compute budget, a clock, and rules.

Here's what makes the FirstProof result feel heavier than "AI got better at math":

Aletheia had to work through long chains of reasoning, including failed attempts, and still produce correct proofs. DeepMind ran two versions of Aletheia (built on slightly different underlying models) and used cross-checking between them to reduce the odds of a silent error.

That cross-check idea sounds simple, but in research math it's everything. A single convincing, wrong proof can burn weeks of an expert's life. The cost isn't embarrassment, it's lost human time.

For the primary source, DeepMind documented the FirstProof run in an arXiv paper: Aletheia tackles FirstProof autonomously (arXiv PDF).

Diagram shows Aletheia as a long-horizon research agent built on Gemini Deep Think, designed for sustained reasoning and self-correction.

How Aletheia works: a research agent that argues with itself

Aletheia is not framed as a chatbot that "knows math." It's described as a long-horizon research agent built on top of Gemini 3 Deep Think. The key is that it's designed to do work that takes time, where thousands of attempts can fail before one path clicks.

DeepMind's approach to reliability is the part worth sitting with. Instead of hoping the model "behaves," the system creates internal conflict on purpose. Two roles constantly push against each other:

  • The generator proposes ideas aggressively. It tries strategies, makes conjectures, explores routes that might be wrong, and keeps moving.
  • The verifier attacks every step. It checks logic, looks for cracks, and rejects anything that doesn't hold exactly.

So the system spends a lot of its time in a kind of structured argument. That might sound inefficient, but it matches what careful mathematicians already do in their head, just with more stamina and less ego.

The most important behavioral point shows up when things go badly. When Aletheia couldn't solve a problem, it didn't bluff. It either said "no solution found" or ran out the clock. DeepMind even says they're willing to solve fewer problems if that's the price of not producing nonsense.

That may sound conservative, but it's also how you build a tool people can actually trust.

DeepMind also released full interaction logs, including failed attempts and wrong turns. If you want to see what "research behavior" looks like in raw form, those logs matter as much as the final proofs: Aletheia interaction logs on GitHub.

Problem 7: the proof that made people uncomfortable (in a real way)

Problem 7 sits where algebraic topology and differential geometry overlap, which is a polite way of saying: you can lose a weekend just parsing the setup.

In simplified terms, the question asks whether a certain kind of discrete group can appear as the fundamental group of a compact, boundaryless manifold, under strict conditions on the universal cover. If that sentence feels like a filter, it is.

Aletheia didn't just answer. It proved the answer is "no" in two different ways, and that's the part that feels less like pattern matching and more like mathematical taste.

The video displays a plain-language summary of Problem 7, describing its link to manifolds, universal covers, and group actions.

The first proof: a clean contradiction using Lefschetz numbers

The first proof is almost rude in how direct it is.

Using the assumption about the universal cover (described as rationally acyclic in the narration), Aletheia computes a Lefschetz number tied to an element of order two. Then it forces the same quantity to be both non-zero and zero:

On one side, the argument says the Lefschetz number must be non-zero. On the other side, because the group action is free with no fixed points, the Lefschetz number must be zero. That creates an impossible equality, described bluntly as:

0 = ±1

Contradiction, done.

A highlighted contradiction shows a Lefschetz-number argument forcing an impossible equality, summarized as "0 equals plus or minus 1."

What's unsettling (in a good way) is how little structure the proof uses. It doesn't depend heavily on the geometry that makes the problem look intimidating. It also proves something stronger than the original question: in this setting, no discrete group containing torsion elements can work at all.

That "stronger than asked" move is something mathematicians recognize instantly. It's a sign the solver isn't just chasing the target, it's seeing the shape behind it.

The second proof: fully geometric, different tools, same collision

Then Aletheia offers a second proof that goes the opposite direction. Instead of staying abstract, it leans into geometry:

It constructs an equivariant map from the universal cover to a symmetric space associated with the group. Then it compares Lefschetz numbers on both sides.

  • On the universal cover side, the group action is free, so the Lefschetz number vanishes.
  • On the symmetric space side, the Cartan fixed point theorem forces fixed points, which pushes the Lefschetz number to be non-zero.

Different machinery, same contradiction, same "no."

One expert reaction mentioned in the narration was that this looked like the first time they'd seen an AI combine multiple deep theorems in a way that didn't feel stitched together.

It wasn't cheap: the compute story matters

To solve Problem 7, Aletheia used 16 times the reasoning budget DeepMind used on a prior flagship math result (mentioned as the "Erdos 151 problem" in the narration). Whatever you call the comparison point, the punchline is simple: this took sustained exploration.

DeepMind even visualized the reasoning cost over time, showing repeated dead ends, backtracking, and persistence.

A chart of reasoning effort over time shows repeated spikes, indicating dead ends, backtracking, and renewed attempts before a final proof is found.

Another solved problem: a very "human" move in number theory and representation theory

Problem 7 gets the spotlight, but the other solves weren't soft targets.

One of the problems Aletheia solved came from number theory and representation theory. In broad terms, it involved matrix groups over non-Archimedean local fields, and the existence of a universal Whittaker function that guarantees a certain integral never vanishes across paired representations.

Even in a simplified retelling, the structure of the proof is the interesting bit.

Aletheia's approach starts with a choice that changes the whole board. It picks a particular Whittaker function that compresses the integration domain into a compact set, and in the process removes an entire complex parameter that would otherwise hang over the argument. With that simplification in place, the problem reduces to whether a finite functional can be zero.

From there, Aletheia runs a contradiction:

Assume the integral vanishes for all representations. Then use finite Fourier analysis to show that this assumption would force the representation to have invariant vectors under a subgroup larger than its conductor allows. That clashes with the definition of the conductor itself, so the assumption collapses.

a proof sketch using a carefully chosen Whittaker function and a contradiction involving Fourier analysis and representation conductors.

The narration points out what human mathematicians immediately notice: that initial "pick the right Whittaker function" move feels like experience. It's the kind of simplifying decision people learn after years of wrestling with similar integrals.

For extra context on how DeepMind frames the jump from competition math to research math, their earlier related paper is also on arXiv: Towards Autonomous Mathematics Research (arXiv PDF).

Why people are calling this the end of "manual" math research

This part doesn't mean mathematicians are out of a job next week. It means the workflow changes.

For centuries, math progress has depended on humans doing everything by hand: reading, guessing, trying, failing, rewriting, checking, getting stuck, and sometimes giving up. The bottleneck has always been time and attention.

A research agent doesn't get tired. It doesn't cling to a beautiful idea that's wrong. It doesn't need a walk to cool off. It just keeps pushing until it finds a proof, or until the budget runs out.

That's why the "junior co-author" line lands. It's not about replacement, it's about throughput. If even a small slice of research time shifts from grind work to supervision and taste, the pace changes.

Also, there's a quiet second-order effect: if AI systems can propose and verify proofs faster than humans can read them, then human review becomes the new bottleneck. That's not a math problem, it's a social and tooling problem.

For a wider view on where Google seems to be steering agents and long-horizon systems next, this piece is a good companion: Demis Hassabis' Google AI 2026 vision.

Google's energy play: because long-horizon reasoning burns real power

The story took an unexpected turn, but it makes sense once you think about it.

At the same time as the Aletheia news, Google revealed plans for a major data center complex south of Minneapolis, tied to a renewable-heavy power buildout and what's described as the world's largest battery storage system.

The project details, as described:

  • About 1.4 GW of wind
  • 200 MW of solar
  • A 300 MW iron-air battery capable of delivering power for up to 100 hours (more than 4 days)

A clean energy infographic lists wind, solar, and a 300 MW iron-air battery providing up to 100 hours of storage for a Minnesota data center.

Unlike common lithium-ion grid batteries that often target 4 to 8 hours of storage, this system is meant for multi-day gaps, the kind you get in major weather events.

The iron-air battery tech comes from Form Energy and works through reversible rusting: discharging is iron reacting with oxygen to release energy, charging reverses the process.

This isn't just a "green PR" side quest. It's part of the compute story. Aletheia's Problem 7 run alone required 16 times the reasoning budget of a previous big math milestone. If long-horizon agents become normal, power becomes a hard constraint, not an afterthought.

For reporting on the Minnesota deal and the 100-hour system, here are two useful sources: American Public Power Association coverage of the 300 MW iron-air system and Fortune's report on the 100-hour battery buildout.



What I learned while sitting with this (my honest take)

I've seen enough AI demos to have a reflex now: wait for the fine print. So at first, I treated this like another "look how smart the model is" cycle that would fade once people checked the work.

Then I kept coming back to two details.

First, it solved six open problems under a deadline, and it didn't pretend to solve the rest. That sounds small, but it's the difference between a tool you can hand to an expert and a tool that wastes experts.

Second, the idea of releasing the interaction logs, dead ends included, changes the vibe. It's a very different feeling to watch a system take wrong turns, back up, and keep going. It feels less like magic, and more like process. Weirdly, that makes it more believable, not less.

Most of all, the energy angle snapped something into focus for me. Long-horizon reasoning isn't just "a smarter model." It's an industrial activity. If we want AI agents that think for hours or days, we're also signing up for the power, storage, and grid decisions that come with that.

Conclusion: mathematical AGI is starting to look less like a slogan

Aletheia's FirstProof results don't settle every AGI argument, but they do change what's plausible. Solving open research problems, checking the work, refusing to bluff, and doing it within real constraints looks like the early shape of machine research.

At the same time, the Minnesota energy buildout hints at the next bottleneck: not ideas, but electricity. If this is where AI is going, AGI won't be just a software story. It'll be a systems story, from proofs to power plants.

What part feels bigger to you, the math itself, or the fact that someone is already building the infrastructure to run this kind of thinking at scale?

Post a Comment

0 Comments