Seven days. That is how long it took OpenAI to fire back at Anthropic after Claude Opus 4.7 shipped. GPT-5.5 dropped on April 23, 2026, and the leaderboard flipped — again.
But here is the part worth slowing down on: on the 10 benchmarks where both providers publish numbers, Claude Opus 4.7 leads on 6 of them. GPT-5.5 leads on 4. And yet, by most mainstream coverage, GPT-5.5 "won." That gap between the headline and the data is exactly what this article is about.
The leads cluster by category. GPT-5.5 dominates long-running agentic tool-use tasks. Claude Opus 4.7 dominates precision coding and reasoning-heavy tests. These are not competing on the same axis. Picking the wrong one for your workload is real money — and for API teams running production pipelines at scale, real performance.
1. The Release Context: What Just Happened
Anthropic released Claude Opus 4.7 on April 16, 2026. OpenAI released GPT-5.5 one week later. Both ship with 1M-token context windows. Both claim to be the best model for agentic coding. Both are priced at $5 per million input tokens on the standard tier.
The era where one lab held a context-size advantage is over. The era where one lab was clearly faster is narrowing. What remains as a differentiator: retrieval quality under pressure, agentic task coverage, benchmark choice, and — critically — token efficiency in long-running loops.
GPT-5.5 is the first fully retrained base model from OpenAI since GPT-4.5. It was co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems. Despite being a larger, more capable model, it matches GPT-5.4's per-token latency in real-world serving — which OpenAI's own engineering team describes as genuinely unusual. Bigger models are almost always slower.
Claude Opus 4.7 shipped with a focus on precision coding and reasoning. Anthropic's broader architectural bets — including its recurrent depth transformer experiments — show a lab increasingly willing to diverge from the standard transformer playbook. Opus 4.7 also reads images at roughly 3.3x the resolution of any comparable model (up to 2,576 pixels on the long edge), which matters for vision-heavy agentic tasks.
Seven days between flagship releases. The pace alone tells you something about where the industry is right now.
2. Benchmark Breakdown: Where Each Model Actually Wins
Most coverage picked up GPT-5.5's lead on 14 benchmarks from OpenAI's own comparison table. That number is real. But it includes benchmarks where only OpenAI published a score for Claude — meaning Anthropic's own published number for the same test is not in that table. The honest picture requires looking at what both labs independently reported.
On the 10 benchmarks where both providers report results: Claude Opus 4.7 leads on 6 (GPQA Diamond, HLE without tools, HLE with tools, SWE-Bench Pro, MCP Atlas, FinanceAgent v1.1). GPT-5.5 leads on 4 (Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, CyberGym). The split is clean and category-level.
| Benchmark | GPT-5.5 | Claude Opus 4.7 | Winner |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 69.4% | GPT-5.5 (+13.3) |
| SWE-Bench Pro | 58.6% | 64.3% | Claude (+5.7) |
| OSWorld-Verified | 78.7% | 78.0% | GPT-5.5 (+0.7) |
| MCP Atlas | 75.3% | 79.1% | Claude (+3.8) |
| GDPval (Knowledge Work) | 84.9% | 80.3% | GPT-5.5 (+4.6) |
| BrowseComp | 90.1% | — | GPT-5.5 |
| FrontierMath Tier 4 | 39.6% (Pro) | 22.9% | GPT-5.5 Pro (+16.7) |
| ARC-AGI 2 | 85.0% | — | GPT-5.5 |
| HLE (without tools) | — | 46.9% | Claude (GPT-5.5 Pro: 43.1%) |
Sources: OpenAI official launch page, Anthropic model documentation, LLM-Stats independent verification. Where both labs published different numbers for the same benchmark, both are cited. "—" means that provider did not publish a score for that benchmark.
3. Agentic Coding: The Headline Battle
Terminal-Bench 2.0 is where GPT-5.5's lead is decisive — 82.7% against 69.4% for Claude. That is a 13-point gap, not noise. The benchmark tests complex command-line workflows requiring planning, iteration, and tool coordination in a sandboxed terminal. The kind of task where the model has to drive a terminal, run tests, and recover from its own bad output.
Michael Truel, co-founder at Cursor, noted that GPT-5.5 stays on task for complex, long-running engineering work without stopping prematurely. That observation matches what Terminal-Bench measures. It's not about writing clean code in one pass — it's about sustained execution across a messy, multi-step task.
On OSWorld-Verified — which tests whether a model can actually operate a real computer environment, clicking, typing, and navigating software rather than analyzing screenshots — GPT-5.5 leads at 78.7% versus Claude's 78.0%. That margin is a statistical tie. Call it even.
The real agentic question for production teams is token efficiency in loops. GPT-5.5 uses roughly 40% fewer output tokens than GPT-5.4 on equivalent Codex tasks. One comparison puts GPT-5.5 at approximately 72% fewer output tokens than Claude Opus 4.7 on the same coding tasks. That number is contested and depends heavily on task type — but even if the actual figure is half of that, it compounds massively in agentic loops where every step adds up.
4. Precision Coding and Reasoning: Where Claude Holds
SWE-Bench Pro is where Claude Opus 4.7 leads cleanly — 64.3% against GPT-5.5's 58.6%. This benchmark evaluates real-world GitHub issue resolution end-to-end. Not a terminal workflow. A codebase, a real issue, one shot to fix it.
There is a note worth keeping in mind here. OpenAI flagged evidence of memorization on SWE-Bench Pro from other labs. Anthropic published a filter re-score analysis showing their margin holds on decontaminated subsets. OpenAI did not run a matched re-test. The gap may narrow under perfectly controlled conditions. It probably doesn't disappear.
On MCP Atlas — which tests multi-tool orchestration via the Model Context Protocol — Claude leads 79.1% to 75.3%. For teams building heavily on MCP, that 4-point gap in tool-call reliability in complex, chained scenarios is not abstract.
GPQA Diamond and Humanity's Last Exam (without tools) also go to Claude. These are reasoning-heavy tests that don't involve tool use — pure model intelligence on hard academic problems. The broader pattern in AI research points to a distinction between models that reason deeply in context versus models that execute reliably across long tool-use chains. Claude Opus 4.7 and GPT-5.5 are, at this moment, on different sides of that line.
Dan Shipper at Every ran what amounts to a controlled test: he gave both models the same broken codebase and asked whether they could arrive at the same solution a senior engineer had landed on. GPT-5.4 couldn't do it. GPT-5.5 could. But in the same review, the note was that Opus 4.7 produces better architectural plans — and that GPT-5.5 actually performed best when it executed a plan written by Opus 4.7. That one observation tells you a lot about how teams might realistically route between these models.
5. Pricing Math: The Real Cost Comparison
Both models list at $5 per million input tokens. The divergence is on output: GPT-5.5 at $30 per million output tokens, Claude Opus 4.7 at $25 per million output tokens. GPT-5.5 Pro jumps to $30 input / $180 output. That is a significant premium.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| GPT-5.5 (standard) | $5.00 | $30.00 | Plus/Pro/Business/Enterprise |
| GPT-5.5 Pro | $30.00 | $180.00 | Pro/Business/Enterprise only |
| Claude Opus 4.7 | $5.00 | $25.00 | 2x surcharge above 200K input tokens |
| GPT-5.4 (previous) | $2.50 | $15.00 | Still available at half GPT-5.5 cost |
Sam Altman's argument on pricing is that token efficiency offsets the rate increase. The math: if GPT-5.5 uses 40% fewer output tokens on the same Codex tasks, then 100M output tokens on GPT-5.4 at $15/M = $1,500. The same task on GPT-5.5 at 60M tokens (40% fewer) at $30/M = $1,800. You pay 20% more, not 100% more. For teams where Codex's higher task completion rate means fewer retries, the gap closes further.
The long-prompt surcharge on Claude is worth flagging. Opus 4.7 charges 2x on input once you exceed 200K tokens. For teams running large context windows routinely, that changes the math considerably. GPT-5.5 has no equivalent surcharge at this tier.
Batch processing discount for GPT-5.5: 50% off. Priority processing: 2.5x. Those tiers matter for production workloads running off-peak or needing guaranteed throughput.
6. The Inference Efficiency Story Nobody Covered Properly
The most technically interesting part of this GPT-5.5 launch got buried under benchmark tables. GPT-5.5 participated in optimizing its own inference infrastructure during training.
Before GPT-5.5, OpenAI split GPU requests into a fixed number of chunks to balance workload across cores. A fixed number is not optimal for all traffic patterns. So Codex — running on GPT-5.5 — analyzed weeks of production traffic data and wrote custom heuristic algorithms to optimize how work gets partitioned across the hardware. That single improvement increased token generation speeds by over 20%. The model helped build the systems that run the model.
That is not a marketing claim. It is a specific, falsifiable engineering outcome with a measurable result. The implication is that future models may increasingly self-optimize their deployment infrastructure — which changes the cost curve in ways that are hard to predict from current pricing alone.
Latency data from independent testing puts Claude Opus 4.7 at a time-to-first-token of around 0.5s versus GPT-5.5's roughly 3s baseline. Per-token throughput is closer between the two. For interactive applications where the user is waiting on a first response, Claude is noticeably snappier. For batch jobs that care about total throughput rather than first-token latency, GPT-5.5 closes the gap.
7. Scientific Research: A Surprising Gap
This section did not get enough attention in the launch coverage. On GeneBench — which tests multi-stage scientific data analysis in genetics and quantitative biology — GPT-5.5 scored 25.0% against GPT-5.4's 19.0%. On BixBench, a real-world bioinformatics benchmark, GPT-5.5 reached 80.5% against GPT-5.4's 74.0%.
Performance on GeneBench widens as tasks get longer — there is a clear separation appearing around the 15,000 token output mark. That suggests GPT-5.5's gains on sustained reasoning show up more on extended scientific workflows than on shorter tasks.
An internal version of GPT-5.5 equipped with a custom tool framework contributed to a new mathematical proof about Ramsey numbers, a core research area in combinatorial mathematics where new results are rare and technically difficult. The proof was then verified in Lean, the formal proof verification system. This is not GPT-5.5 summarizing a paper about Ramsey theory. It contributed an actual mathematical argument that held up to formal verification.
Daria Unutmaz, an immunology professor, used GPT-5.5 Pro to analyze a gene expression dataset with 62 samples and nearly 28,000 genes — generating a full research report that he said would have taken his team months. On FrontierMath Tier 4, which covers postdoctoral-level math problems, GPT-5.5 Pro reached 39.6% against Claude Opus 4.7's 22.9%. That nearly 17-point gap on the hardest math tier is the biggest categorical win in this whole comparison. Nobody's leading on Tier 4 — 39.6% still means failing 60% of the hardest math problems in the benchmark — but the gap itself is notable.
8. Workload Routing Guide: Which One to Use When
Use GPT-5.5 for:
- Long-running agentic terminal workflows — sustained execution over complex multi-step tasks where the model needs to drive a shell, recover from failures, and keep going without stopping early
- Knowledge work at scale — GDPval performance (84.9%) across 44 professions makes it the default for cross-functional automation (finance, comms, legal, data science)
- Hard math and scientific research — FrontierMath Tier 4 lead is real; GeneBench and BixBench gains are real
- Web research requiring deep browsing — BrowseComp at 90.1% is a genuine lead
- High-volume agentic pipelines — token efficiency at scale; the per-task cost argument holds if you're running at volume
Use Claude Opus 4.7 for:
- GitHub issue resolution and refactoring — SWE-Bench Pro lead of 64.3% is meaningful for real-world codebase work
- MCP-heavy multi-tool orchestration — 79.1% on MCP Atlas means better tool-call reliability in chained scenarios
- Dense visual input processing — 3.3x the image resolution of comparable models; computer-use agents reading full-resolution screenshots
- Reasoning-heavy tasks without tools — GPQA Diamond and HLE without tools both go to Claude
- Interactive applications needing fast first response — 0.5s TTFT vs GPT-5.5's ~3s baseline matters when users are waiting
- Architectural planning — the observation that GPT-5.5 performs best executing a plan written by Opus 4.7 is worth taking seriously
The best production setups are increasingly multi-model: route GPT-5.5 for agentic execution and sustained terminal work, Claude Opus 4.7 for architecture, precision coding, and tool-heavy reasoning. Neither model needs to win everything for that setup to be optimal.
My Take
The benchmark war framing misses what is actually happening here. OpenAI and Anthropic are not racing to build the same model faster. They have made different architectural bets, optimized for different workflows, and ended up in genuinely different places. That is good for users, even if it makes simple "winner" declarations impossible.
The number that moved me in this whole comparison isn't from a benchmark table. It's that GPT-5.5 performed best when executing a plan written by Claude Opus 4.7. One reviewer ran this test and noted it almost as an aside — but if that pattern holds at scale, it describes an obvious and valuable production architecture: Opus 4.7 as the planner, GPT-5.5 as the executor. Two models, complementary strengths, better results than either delivers alone.
The Ramsey number proof is also worth sitting with. Not because it means GPT-5.5 is doing original mathematics at a level that replaces mathematicians — it isn't, and 39.6% on FrontierMath Tier 4 means it fails most of the hardest problems. But a formally verified new proof in combinatorics, contributed by a model, is not a demo. It is a peer-reviewed output in a domain where new results are genuinely rare. That is qualitatively different from summarizing existing work.
On pricing: I think the token efficiency argument is real but requires verification against your specific workload. OpenAI's published number is for Codex tasks specifically. If your production workload looks different, run the math yourself before assuming the 40% token reduction applies. The rate per token is still double GPT-5.4 — efficiency only saves you if the efficiency applies to your task type.
Key Takeaways
- Claude Opus 4.7 leads on 6 of 10 shared benchmarks; GPT-5.5 leads on 4. The headline "14 benchmark wins" includes tests where only OpenAI published Claude's score.
- GPT-5.5 dominates Terminal-Bench 2.0 (+13.3 points) — the largest gap in the comparison. Claude leads SWE-Bench Pro (+5.7 points).
- Token efficiency is real: GPT-5.5 uses roughly 40% fewer output tokens on Codex tasks than GPT-5.4. Vs Claude Opus 4.7 the gap is reported higher but contested — verify against your workload.
- Claude Opus 4.7's 0.5s TTFT vs GPT-5.5's ~3s matters for interactive applications. Batch jobs care more about throughput, where the gap closes.
- GPT-5.5 helped optimize its own inference infrastructure during training, increasing token generation speeds by over 20%. That is novel and not yet matched by any publicly announced Anthropic equivalent.
- Best production setup: multi-model routing. Opus 4.7 for planning and precision; GPT-5.5 for execution at scale.
Frequently Asked Questions
Is GPT-5.5 better than Claude Opus 4.7 overall?
Neither model is better overall — they're better at different things. GPT-5.5 leads on agentic terminal workflows, knowledge work breadth, and hard math. Claude Opus 4.7 leads on precision coding (SWE-Bench Pro), multi-tool orchestration (MCP Atlas), and pure reasoning tasks. On the 10 benchmarks where both providers published scores, Claude leads 6 to GPT-5.5's 4.
How much does GPT-5.5 cost compared to Claude Opus 4.7?
Both list at $5 per million input tokens. GPT-5.5 charges $30/M output tokens; Claude Opus 4.7 charges $25/M output tokens. However, Claude applies a 2x surcharge on inputs above 200K tokens. GPT-5.5 batch processing gets a 50% discount. GPT-5.5 Pro is significantly more expensive at $30/$180 per million input/output tokens.
What is Terminal-Bench 2.0 and why does GPT-5.5's lead matter?
Terminal-Bench 2.0 tests a model's ability to navigate and complete tasks in a sandboxed terminal environment — planning, iterating, using tools, and recovering from errors across a complex command-line workflow. GPT-5.5 scored 82.7%, Claude Opus 4.7 scored 69.4%. A 13-point gap in agentic tool execution is the largest categorical difference between these two models.
Which model is faster in practice?
Claude Opus 4.7 has a time-to-first-token of around 0.5 seconds versus GPT-5.5's roughly 3 seconds baseline. If your application is interactive and users are waiting on a first response, Claude is noticeably faster. For batch processing and sustained agentic runs where total throughput matters more than first-token latency, GPT-5.5 closes the gap. Both models match or improve on their predecessors' speed despite being larger.
What is GPT-5.5's inference self-optimization and why is it notable?
During training, GPT-5.5 analyzed weeks of production traffic data and wrote custom heuristic algorithms to optimize how GPU compute is partitioned across requests. This increased token generation speeds by over 20%. It is the first publicly documented case of an AI model contributing to the optimization of the infrastructure used to run it — which has implications for how future model deployment costs might evolve.
Should I switch from Claude Opus 4.7 to GPT-5.5 for coding work?
Depends on what kind of coding work. For multi-file refactoring, GitHub issue resolution, and architectural reasoning across large codebases, Claude Opus 4.7 still leads on SWE-Bench Pro. For executing long-running terminal workflows, automated deployment pipelines, and agentic tasks where the model needs to drive a shell and recover from errors, GPT-5.5 is the stronger choice. The best production architecture may use both: Claude for planning and design, GPT-5.5 for execution at scale.
April 2026 is shaping up as the most competitive week in AI model history. Two frontier models, seven days apart, each with genuine leads in different categories, each requiring honest assessment rather than a simple ranking. The honest answer — that it depends on what you're building — is not a hedge. It is the most useful thing to know.
For a deeper look at the architectural thinking behind these releases, this breakdown of Google's response to Claude's enterprise momentum gives useful context on why the agentic coding race has become the primary battleground for AI labs in 2026.
Benchmark data sourced from OpenAI official launch page (openai.com/index/introducing-gpt-5-5), Anthropic model documentation, LLM-Stats, DigitalApplied, and BenchLM independent verification. Published April 2026. Model specifications and pricing subject to change — always check provider documentation for current rates.
0 Comments