ARC-AGI-3 Launched: Best AI Scores 0.37% While Humans Score 100% — What This Means for OpenAI Spud and Anthropic's Next Claude

AI Revolution Benchmarks AGI Watch OpenAI Anthropic

A cracked chess king piece lying sideways on dark slate, lit by a single cold blue rim light — representing AI's failure to pass ARC-AGI-3 in 2026.

0.37%

Best frontier LLM score on ARC-AGI-3

100%

Human baseline score

12.58%

Best non-LLM agent (RL + graph-search)

$850K

Prize pool for ARC-AGI-3 track alone

If the world's most powerful AI models — GPT-5, Claude Sonnet 4.5, Gemini 3.1 — all scored below 1% on the same test, would you call that a coincidence or a verdict? On March 25, 2026, that verdict arrived. The ARC Prize Foundation launched ARC-AGI-3, and within 24 hours the results were clear: frontier LLMs scored under 1% while humans scored 100%. Meanwhile, OpenAI quietly shelved Sora to free compute for a new model codenamed "Spud," and Anthropic is warning government officials that its next Claude release will — in their own framing — stir government urgency on a stalled Pentagon deal. A lot happened in 48 hours. This article breaks down what actually matters — and what doesn't.

What Is ARC-AGI-3 — And Why Does the Format Change Matter?

Thesis: ARC-AGI-3 is not a harder version of the old benchmark — it is a completely different type of test, and that distinction matters more than the score.

The first two versions of ARC-AGI tested what researchers call passive fluid intelligence: show the model a static grid pattern, ask it to complete the next one. No memory required. No interaction. No goals to infer. By 2025, frontier models were cracking 84–90% on those grids — and the ARC team concluded those benchmarks had effectively been gamed, not solved.

Evidence: The benchmark paper notes that models like Gemini 3 showed in their chain-of-thought traces that they had encountered ARC-like tasks in training — either incidentally or deliberately. The public and private test sets were similar enough that dense sampling of the task space could shortcut performance. It wasn't memorization exactly — it was a higher-level attack on benchmark construction itself. The authors cite this explicitly as the reason the new benchmark uses a fully separate distribution for its private test set, making task-space gaming structurally harder.

ARC-AGI-3 closes that loophole entirely. Agents are now dropped into turn-based interactive game environments. No stated rules. No win conditions. No instructions of any kind beyond: "You are playing a game. Your goal is to win. Reply with the exact action you want to take." The agent sees a 64×64 visual grid, picks an action, observes what changes, and must infer both what the goal is and how to reach it — all from scratch, in real time. The benchmark includes 150+ handcrafted environments and 1,000+ levels, with each environment introducing new mechanics across 8–10 progressive levels. Critically, what an agent learns in level 1 must carry forward to level 2. Memory is now being tested, not just pattern recognition.

Verdict: This shift — from static puzzles to interactive goal inference — is the most significant format change in the benchmark's seven-year history. The old ARC tested whether AI could recognize patterns. The new one tests whether it can figure out what game it's playing while playing it. Those are not the same skill, and the scores make that gap impossible to ignore.

The Scoring System That Makes 12% Feel Like 0%

Thesis: The scoring methodology is designed to be deliberately unforgiving — and that's a feature, not a flaw.

ARC-AGI-3 doesn't use binary pass/fail. Instead it uses a metric called RHAE (Relative Human Action Efficiency) — measuring how many actions an AI takes to complete each level compared to the second-best human performance on the same level. The score is then squared. This matters more than it sounds: an agent that takes 10× as many actions as the human baseline doesn't score 10% — it scores 1%. Inefficiency is penalized quadratically.

Evidence: There's also a hard cap — if an AI takes more than 5× the number of actions a human used, that attempt is immediately terminated. The human baseline is derived from data across 1,200+ human players in a 30-day preview period, and the score is pegged to the second-best human run, not the median. That means even a slightly inefficient human would score only around 50% under this rubric. The best human run — always on the first attempt to keep conditions fair — used approximately 28% fewer actions than the second-best, meaning the benchmark is somewhat adversarial even to humans.

Two glass measuring cylinders — one nearly full, one almost empty — illustrating the performance gap between humans and AI on ARC-AGI-3.

During the preview phase, the best-performing agent — not a frontier LLM, but a simpler RL and graph-search approach — scored 12.58%. Frontier models like Gemini 3.1 came in at 0.37% on the general leaderboard. One group, Symbolica AI, built a multi-agent harness where a sub-agent summarized each game state for an orchestrator and managed to solve all three public environments. But harnesses specifically engineered for ARC-AGI-3 are banned from the official leaderboard. The competition explicitly measures general-purpose reasoning, not benchmark-specific engineering.

Verdict: The prize pool — $850,000 for the ARC-AGI-3 track alone, with a $700,000 grand prize for the first agent to score 100% — runs through December 2026. Milestone checkpoints fall in June and September. Whether any team gets close to 50% by year-end is genuinely uncertain. The scoring makes 50% harder to reach than it sounds.

Why Frontier LLMs Are Failing Below 1%

Thesis: The sub-1% performance of GPT-5, Claude, and Gemini on ARC-AGI-3 isn't a fluke — it reveals a structural limitation in how these models process sequential interactive environments.

Frontier LLMs are built for text in, text out — or at most, image in, text out. They process each prompt as a relatively self-contained input. ARC-AGI-3 demands that an agent maintain a coherent world model across hundreds of sequential steps, update that model as new evidence appears, and do all of this without any language cues. There are no words in the environment. There is no instruction. Just a 64×64 colored grid and the observable result of the last action taken.

Evidence: The ARC paper documents a consistent failure mode: as context windows grow across hundreds of game steps, models get overwhelmed by accumulated state. Performance degrades not because they become less intelligent step-by-step, but because the signal drowns in the noise of their own history. This is precisely why the Symbolica multi-agent harness worked — it compressed game state into short textual summaries, giving the orchestrator a manageable context at each step. The irony is that this compression behavior — summarizing experience to maintain a higher-level plan — is exactly what ARC-AGI-3 is designed to test as a native capability, not an engineered workaround.

The pattern holds beyond ARC. NetHack — a sequential decision-making benchmark that Tim Rocktäschel of Google DeepMind helped author — has been unsaturated for six years. On NetHack, even Gemini 3 Pro, the highest-scoring model, reaches only 6.8%. Once a benchmark requires genuine exploration, hypothesis revision, and efficient planning under uncertainty across time, current LLMs consistently fall short.

Verdict: This is not a data problem or a scale problem. The paper is explicit: throwing more parameters or longer training runs at ARC-AGI-3 will not produce a solution. The path runs through novel algorithmic ideas. That's a meaningful claim, and one that the 12.58% performance of the simpler RL/graph-search agent — outperforming every frontier LLM by more than 30× — makes hard to dismiss.

OpenAI Spud and the Next Claude: Does the Hype Hold Up?

Thesis: Both OpenAI and Anthropic are signaling major upcoming model releases. ARC-AGI-3 gives us a useful pre-release filter for evaluating those claims before the models drop.

OpenAI's situation is unusually transparent right now. Sam Altman publicly confirmed that initial development of a new model codenamed "Spud" is complete and it will be ready within weeks. To free up the compute required, OpenAI shut down Sora — its standalone AI video app, launched just six months earlier — and the $1 billion Disney licensing deal tied to it collapsed alongside. If you want the full breakdown of what actually happened with the Sora shutdown and the numbers behind that decision, that analysis covers it in detail. Altman described Spud as something that will "really accelerate the economy." The company is also consolidating ChatGPT, Codex, and its browser into a single super-app, with Spud as the anchor model.

Anthropic's signaling is less about product and more about consequence. According to Axios reporting from March 26, Anthropic has been privately warning US government officials that its next major model advance will supercharge both offensive and defensive cyber capabilities — language designed to create urgency. The implicit message: if you are not in a deal with us when this drops, you will notice. Whether that's credible or strategic positioning, it is notable that government sources on both sides of the Pentagon dispute confirm talks are not fully dead.

Evidence: ARC-AGI-3 is now the most useful pre-release filter available for evaluating these claims. The benchmark's authors designed it specifically to measure the residual gap between frontier AI and human-level general intelligence — to see what actual deficiencies remain when the newest models arrive. When Spud and the next Claude drop, their ARC-AGI-3 scores will be among the most informative single numbers published about either model.

Verdict: Hype is easy. The question is whether either model shows a qualitative improvement on the kind of interactive, goal-directed, memory-dependent reasoning ARC-AGI-3 measures. Historically, each new model generation has shifted the landscape of narrow capabilities while leaving deeper structural gaps intact. There's no particular reason to expect that pattern to break — but if either model cracks even 5–10% on ARC-AGI-3, that would be a signal worth taking seriously.

Anthropic, the Pentagon, and Why Benchmarks Now Have Political Weight

Thesis: The Anthropic–Pentagon dispute is no longer just a contract story. It has made AI capability benchmarks politically consequential in a way they weren't six months ago.

Here's the sequence worth understanding: Anthropic held a $200 million Pentagon contract and was the only major commercial AI model cleared for classified government use — including active deployment in the capture of Venezuelan President Nicolás Maduro in January 2026, and operations involving Iran. On February 27, after failed negotiations over two specific restrictions — use of Claude for domestic mass surveillance and fully autonomous weapons — President Trump ordered all federal agencies to cease using Anthropic's products, and Defense Secretary Hegseth designated the company a supply chain risk to national security.

OpenAI moved within hours to sign its own Pentagon deal. A federal judge has since called the government's actions against Anthropic "troubling," describing them as an apparent attempt to cripple the company. Talks are reportedly ongoing, with negotiators on both sides describing themselves as "very close" — a phrase that has now been in circulation for months without resolution.

Evidence: The connection to ARC-AGI-3 is direct: Anthropic is explicitly using its upcoming model's expected capability jump as leverage to revive these negotiations. Government sources cited in the Axios piece confirm that cyberwarfare officials within the Pentagon still want access to Claude specifically — and that the upcoming release is a primary reason they're pushing to revive talks. In this context, benchmark performance is no longer just a technical metric. It is a negotiating chip.

Verdict: The business impact on Anthropic is real but likely overstated in most coverage. The company's annualized revenue from Claude Code alone topped $2.5 billion by February 2026 — and for a detailed breakdown of what Claude's subscription economics actually look like versus API alternatives, this cost analysis goes into the numbers. Claude hit No. 1 on the App Store following the Pentagon backlash, partly driven by a "QuitGPT" campaign. The enterprise contract loss hurts. Anthropic is not in existential trouble. What's more significant is the precedent: AI model capability is now a factor in geopolitical negotiations, and benchmarks are part of how those claims get tested.

Jensen Huang Said AGI Is Already Here. ARC-AGI-3 Disagrees.

Thesis: The debate over whether AGI has arrived is a definitional argument, not a factual one — and ARC-AGI-3 makes that distinction explicit in its design.

Jensen Huang, CEO of Nvidia, said this week that artificial general intelligence has already been achieved. Similar claims have been made by figures at OpenAI. The ARC Prize Foundation's position is precisely the opposite: as long as a measurable gap exists between AI and human learning efficiency in novel environments, the AGI threshold has not been crossed.

Evidence: The ARC-AGI-3 paper is methodologically careful here. The benchmark caps AI scores at 100% — meaning even if a model someday solves the environments more efficiently than humans, it still only scores 100%. The ceiling is intentional: the benchmark cannot be used to claim AI exceeds human intelligence. But it can be used to claim AI has not yet reached it. That asymmetry is deliberate. A 0.37% score from Gemini 3.1 on the general leaderboard is not ambiguous data.

There is also a mathematical quirk worth noting. The human baseline is derived from the second-best human run across 10 participants per environment. The best human run — always on the first attempt to keep conditions fair — typically used around 28% fewer actions than the second-best. Under the quadratic penalty system, even the second-best human would score approximately 50% against their own baseline. The benchmark is somewhat adversarial even to humans — which the paper's authors acknowledge but have not yet formally addressed in the scoring methodology.

Verdict: "AGI is here" is a claim about definitions, not about data. Huang likely means something like "AI can now perform most economically valuable tasks better than most humans most of the time" — which is defensible on its own terms. What ARC-AGI-3 measures is narrower and more specific: adaptive, goal-inferred, memory-dependent reasoning in novel environments where no scaffolding is provided. On that metric, the gap between humans and frontier AI is not small. It is not closing rapidly. As of March 2026, it is 99.63 percentage points wide.

My Take

The number that keeps pulling my attention isn't 0.37% — it's 12.58%. That's what a simple RL and graph-search approach scored in the preview phase, outperforming every frontier LLM by more than 30×. Not GPT-5. Not Claude. Not Gemini. A simpler algorithmic system designed for sequential decision-making under uncertainty. If that single data point doesn't prompt a rethink of how the industry frames "intelligence," nothing will.

What ARC-AGI-3 has actually done is separate two things the field has been conflating: language capability and general reasoning. Frontier LLMs are extraordinary at the former. They have become genuinely useful tools for writing, coding, synthesis, and structured analysis. But they are failing almost completely at interactive, goal-directed reasoning in environments that provide no linguistic scaffold. The environments don't speak. There's no prompt to optimize. And current models have no real answer for that.

The Spud and next-Claude hype deserves a specific kind of scrutiny now. Both companies are signaling qualitative leaps. That may be true on coding benchmarks, on reasoning over structured problems, on multimodal tasks. But if neither model shows meaningful improvement on ARC-AGI-3's interactive format — even reaching 2–3% on the general leaderboard — then "qualitative leap" needs to be interrogated more carefully. A model that is 40% better at writing code is not moving toward the kind of fluid adaptive intelligence this benchmark is measuring. Those are different axes.

The most useful framing for where we are: AI is an exceptional first-drafter and an increasingly capable specialist. It is not yet an adaptive reasoner in the sense ARC-AGI-3 defines. Those positions are not contradictory — they describe different capabilities. The confusion between them is costing people an accurate model of what AI can and cannot do. Whatever its methodological quirks, ARC-AGI-3 is doing the work of making that distinction testable and public.

⚡ Key Takeaways

ARC-AGI-3 launched March 25, 2026 — the first interactive AI benchmark requiring goal inference, exploration, memory, and planning without any instructions.
Frontier LLMs (GPT-5, Claude, Gemini 3.1) scored under 1%. The best preview agent scored 12.58% using RL and graph-search — not a language model.
The RHAE scoring system (quadratic penalty) is intentionally harsh: taking 10× human actions = 1% score, not 10%. A 5× cap terminates attempts entirely.
OpenAI shut down Sora to free compute for "Spud" — its next major model, completing development and expected within weeks.
Anthropic is using its upcoming model's capability jump as leverage in reviving Pentagon negotiations, with both sides describing talks as ongoing.
Jensen Huang's "AGI is here" claim and ARC-AGI-3's 0.37% result coexist because they're measuring fundamentally different things — that distinction matters.
$700,000 grand prize for the first agent to score 100% on ARC-AGI-3. Milestone checkpoints in June and September 2026.

FAQ

What is ARC-AGI-3 in simple terms?

It's a benchmark that drops an AI agent into a video-game-like environment with no instructions, no stated rules, and no win conditions. The agent must figure out what the goal is and achieve it as efficiently as a human would. Every environment is entirely new, so memorization provides zero advantage. As of March 2026, humans score 100% and the best frontier AI scores under 1%.

Why did simpler AI methods outperform GPT-5 and Claude on ARC-AGI-3?

Because the benchmark doesn't reward language ability or memorized knowledge — it rewards efficient exploration and adaptive planning across sequential states. Reinforcement learning and graph-search algorithms are specifically designed for this kind of sequential decision-making. Frontier LLMs are not. When the linguistic scaffolding is stripped away entirely, their core advantage disappears.

What is OpenAI's "Spud" model?

Spud is the internal codename for OpenAI's next major model, which completed initial development in late March 2026. OpenAI freed compute resources by shutting down Sora to prioritize it. Sam Altman has described it as a model that will accelerate the economy, with a release expected within weeks of this writing. No benchmark results are public yet.

How does the Anthropic–Pentagon dispute connect to ARC-AGI-3?

Anthropic is reportedly using its upcoming model's expected capability jump — specifically in cyber-offense and cyber-defense — as leverage to revive stalled Pentagon negotiations. ARC-AGI-3 is the benchmark most likely to give an objective readout on how much of that claimed leap is real. It's become the external measuring stick against which those capability claims will be tested publicly.

Has any AI model ever scored above 50% on ARC-AGI-3?

No. The best score during the preview phase was 12.58%, achieved by a simple RL and graph-search system. The best frontier LLM score was Gemini 3.1 at 0.37% on the general leaderboard. Scoring above 50% under the quadratic penalty system would require performance approaching human action efficiency — something no current system is remotely close to achieving.

Is ARC-AGI-3 a fair test? Aren't the penalties too harsh?

That's a legitimate methodological critique. The quadratic penalty means inefficiency compounds fast — an agent taking 3× human actions scores only 11%, not 33%. The human baseline is also pegged to the second-best run, not the median, meaning even the second-best human would score around 50% by this measure. The benchmark's authors are aware of this and have flagged that reporting median human performance would be a useful addition to future versions. That said, a benchmark that's slightly adversarial to everyone is arguably more honest than one calibrated to make AI look good on a press release.

The gap ARC-AGI-3 is measuring is real. Whether Spud or the next Claude closes it meaningfully — or whether they push narrow benchmarks further while leaving interactive reasoning intact — is the most useful question to track for the rest of 2026. The benchmark has plenty of runway. Whether current model architectures do is considerably less clear.

Sources: ARC Prize Foundation · ARC-AGI-3 Technical Paper (arxiv.org) · Axios — Anthropic Pentagon coverage (March 26, 2026) · Fast Company — ARC-AGI-3 launch

ARC-AGI-3 Launched: Best AI Scores 0.37% While Humans Score 100% — What This Means for OpenAI Spud and Anthropic's Next Claude

What Is ARC-AGI-3 — And Why Does the Format Change Matter?

The Scoring System That Makes 12% Feel Like 0%

Why Frontier LLMs Are Failing Below 1%

OpenAI Spud and the Next Claude: Does the Hype Hold Up?

Anthropic, the Pentagon, and Why Benchmarks Now Have Political Weight

Jensen Huang Said AGI Is Already Here. ARC-AGI-3 Disagrees.

My Take

FAQ

What is ARC-AGI-3 in simple terms?

Why did simpler AI methods outperform GPT-5 and Claude on ARC-AGI-3?

What is OpenAI's "Spud" model?

How does the Anthropic–Pentagon dispute connect to ARC-AGI-3?

Has any AI model ever scored above 50% on ARC-AGI-3?

Is ARC-AGI-3 a fair test? Aren't the penalties too harsh?

Posted by Vinod Pandey

Post a Comment

0 Comments

Most Popular

Microsoft's 7 MAI Models Explained: What Each One Does and Why the Timing Is Everything

South Korea's Humanoid Robots Are Racing Unitree. Nvidia Hasn't Picked a Korean Partner Yet.

6 Humanoid Robots That Are Real, Priced, and Shipping in 2026

Recent Post

Did OpenAI Just Silently Upgrade ChatGPT? The GPT-5.4 Mini Theory (March 2026)

OpenAI's "Spud" Model Is Done Training — And Terence Tao Just Proved Why This Time Might Be Different

Claude Max $200/Month vs OpenClaw API Costs: Which Actually Costs Less in 2026?

Footer Menu Widget

Contact form

ARC-AGI-3 Launched: Best AI Scores 0.37% While Humans Score 100% — What This Means for OpenAI Spud and Anthropic's Next Claude

What Is ARC-AGI-3 — And Why Does the Format Change Matter?

The Scoring System That Makes 12% Feel Like 0%

Why Frontier LLMs Are Failing Below 1%

OpenAI Spud and the Next Claude: Does the Hype Hold Up?

Anthropic, the Pentagon, and Why Benchmarks Now Have Political Weight

Jensen Huang Said AGI Is Already Here. ARC-AGI-3 Disagrees.

My Take

FAQ

What is ARC-AGI-3 in simple terms?

Why did simpler AI methods outperform GPT-5 and Claude on ARC-AGI-3?

What is OpenAI's "Spud" model?

How does the Anthropic–Pentagon dispute connect to ARC-AGI-3?

Has any AI model ever scored above 50% on ARC-AGI-3?

Is ARC-AGI-3 a fair test? Aren't the penalties too harsh?

Posted by Vinod Pandey

You may like these posts

Post a Comment

0 Comments

Most Popular

Microsoft's 7 MAI Models Explained: What Each One Does and Why the Timing Is Everything

South Korea's Humanoid Robots Are Racing Unitree. Nvidia Hasn't Picked a Korean Partner Yet.

6 Humanoid Robots That Are Real, Priced, and Shipping in 2026

Recent Post

Did OpenAI Just Silently Upgrade ChatGPT? The GPT-5.4 Mini Theory (March 2026)

OpenAI's "Spud" Model Is Done Training — And Terence Tao Just Proved Why This Time Might Be Different

Claude Max $200/Month vs OpenClaw API Costs: Which Actually Costs Less in 2026?

Footer Menu Widget

Contact form