The name is the problem. GPT-5.5 sounds like a patch. A small correction. The kind of update that fixes a few hallucinations and ships a marginally cleaner response format. That's not what this is.
OpenAI released GPT-5.5 on April 23, 2026 — internally it was codenamed "Spud," which, to their credit, is at least a more honest name for something genuinely different from what came before. Greg Brockman called it the beginning of a new class of intelligence. That's marketing language. But after looking at what's actually under the hood — the benchmark architecture, the pre-training change, the agentic behavior shift — there's a real case that something structural changed here, not just incremental.
The question worth asking isn't "is it better?" Every new model claims to be better. The question is: better at what, by how much, and does the improvement actually change how you'd use it?
Table of Contents
- What actually changed in GPT-5.5 — the pre-train shift
- What does "agentic" actually mean for this model?
- Which benchmarks matter — and which ones don't
- How does it compare to Claude Opus 4.7?
- What does GPT-5.5 actually cost?
- The situational awareness problem nobody's talking about
- My Take
- Key Takeaways
- FAQ
What actually changed in GPT-5.5 — the pre-train shift
GPT-5.5 is not GPT-5.4 with better post-training. That distinction matters more than most coverage has acknowledged.
Post-training is how you make a model safer, more obedient, more aligned with what users actually want. You can tune a model significantly in post-training — change its tone, reduce refusals, improve instruction-following. What you can't easily do in post-training is change a model's fundamental understanding of how things work. That requires a new pre-train: the expensive, computationally massive training run that teaches the base model its underlying world model before any instruction tuning happens.
GPT-5.4 and its predecessors in the GPT-5.x line shared the same pre-train. GPT-5.5 does not. OpenAI ran a new pre-train for this model, which means its center of gravity is different. Every.to, which did a detailed technical review, described it precisely: a new pre-train changes what the model inherently knows versus what it was fine-tuned to know. The downstream difference shows up in exactly the areas where GPT-5.5 improves most — long-horizon reasoning, agentic multi-step task completion, and understanding system-level architecture in codebases.
OpenAI has also confirmed the model runs on NVIDIA GB200 and GB300 NVL72 systems. Unusually, the model itself wrote custom heuristic algorithms to partition and balance work across GPU cores, increasing token generation speeds by over 20% during serving. The AI optimized its own inference infrastructure. That's not a common footnote.
What does "agentic" actually mean for this model?
Every AI lab uses the word "agentic" right now. It's lost most of its meaning. For GPT-5.5, there's a specific behavioral pattern worth pinning down.
Previous models — including GPT-5.4 — functioned best with explicit step-by-step prompting. You'd describe a task, the model would execute one chunk, you'd review, correct, and move forward. The handholding was real. Users who worked with agentic pipelines on GPT-5.4 routinely described needing to catch the model "going sideways" midway through multi-step tasks, requiring course corrections that cost time and tokens.
GPT-5.5 is explicitly designed to understand task intent earlier — before all parameters are spelled out — and carry more of the work through ambiguous territory without stopping. OpenAI's own framing: instead of managing every step, you give it a messy, multi-part task and let it plan, use tools, check its work, navigate ambiguity, and continue. Senior engineers who had early access described a specific observable behavior: the model would understand where a bug was in a codebase, identify what else in the system would be affected by the fix, and address both without being asked.
One engineer at NVIDIA reportedly said that losing access to GPT-5.5 after the preview period felt like having a limb amputated. That's the kind of quote that's easy to dismiss. It's also the kind of quote that's hard to fabricate.
The gains are concentrated in four specific areas: agentic coding, computer use, general knowledge work, and early scientific research workflows. Outside those areas, the improvements are real but less dramatic. This model is built for extended, autonomous task execution — not for quick Q&A.
Which benchmarks matter — and which ones don't
Benchmarks require reading between the lines. Not because the numbers are false, but because the benchmarks themselves vary enormously in how well they reflect real-world use.
Here are the numbers worth paying attention to for GPT-5.5:
| Benchmark | What it tests | GPT-5.5 | GPT-5.4 | Claude Opus 4.7 |
|---|---|---|---|---|
| Terminal-Bench 2.0 | Complex CLI workflows requiring planning + tool coordination | 82.7% | 75.1% | 69.4% |
| GDPval | Knowledge work across 44 professions (vs. 12+ yr experienced humans) | 84.9% | — | — |
| SWE-Bench Pro | Real-world GitHub issue resolution | 58.6% | ~74%* | 64.3%* |
| OSWorld-Verified | Independent computer operation | 78.7% | 75.0% | 78.0% |
| BrowseComp (Pro) | Hard-to-find web research tasks | 90.1% | — | — |
*Note: OpenAI has raised concerns about memorization artifacts in Claude Opus 4.7's SWE-Bench Pro score. The comparison should be treated with some caution. Sources: OpenAI, Decrypt.
Terminal-Bench 2.0 is the one to focus on. It doesn't test isolated problem-solving — it tests a model's ability to plan across multiple steps, coordinate tools, and handle dynamic environments where earlier decisions affect later ones. That's the closest proxy available to real agentic work. A 13.3 percentage point lead over GPT-5.4 on that benchmark is not marginal.
GDPval is worth understanding separately. It measures performance against professionals with 12 or more years of experience across 44 occupations — engineering, finance, marketing, law, and others. At 84.9%, GPT-5.5 means the model either matches or outperforms those senior professionals in well-specified work tasks across most of those categories. That's the number that most people glossed over in the initial coverage.
The SWE-Bench Pro score of 58.6% looks lower than Claude Opus 4.7's 64.3%, but OpenAI has flagged signs of memorization in Anthropic's result on a subset of problems. Take that comparison cautiously. It's not a clean apples-to-apples number.
How does it compare to Claude Opus 4.7?
Straight comparison: GPT-5.5 wins on agentic computer use, Terminal-Bench, BrowseComp, and GDPval. Claude Opus 4.7 wins on software engineering tasks measured by SWE-Bench Pro (with the memorization caveat) and appears to have a stronger eye for design, planning documents, and product-level detail work.
Every.to's senior engineer benchmark — which tests how well a model rewrites a poorly structured codebase the way an experienced senior engineer would — showed GPT-5.5 reaching 62.5%, while Claude Opus 4.7 landed in the low 30s. That's a significant gap for a very specific, high-value use case. Interestingly, the same testing found that GPT-5.5 performed best when it executed a plan written by Claude Opus 4.7. The models may actually complement each other.
Speed: GPT-5.5 is faster in most workflows. Prose quality: Opus 4.7 still edges ahead for long-form writing. Coding agent work: GPT-5.5, clearly. That's the actual picture.
What does GPT-5.5 actually cost?
The pricing doubled versus GPT-5.4. That's the headline that matters for developers. Here's the full picture:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| GPT-5.4 | $2.50 | $15.00 | 1M tokens |
| GPT-5.5 | $5.00 | $30.00 | 1M tokens |
| GPT-5.5 Pro | $30.00 | $180.00 | 1M tokens |
| GPT-5.5 (Batch/Flex) | $2.50 | $15.00 | 1M tokens |
Source: OpenAI GPT-5.5 announcement. API access launching "very soon" as of publication. Batch and Flex pricing available at 50% of standard rate.
The Batch pricing detail is the one most developers will care about. At $2.50/$15 per million tokens on Batch — identical to GPT-5.4's standard price — the cost doubling effectively disappears for non-real-time workloads. If your pipeline has any flexibility on latency, Batch is the obvious choice.
OpenAI's counter-argument to the price increase: GPT-5.5 uses significantly fewer tokens to complete the same Codex tasks than GPT-5.4. A typical agentic coding session that cost roughly $0.1725 on GPT-5.4 costs about $0.345 on GPT-5.5 at the same reasoning level — exactly 2x. But if GPT-5.5 solves the task in fewer passes with fewer retries, the net cost per completed task can actually drop. Whether that holds in practice depends heavily on your specific workload.
GPT-5.4 remains available. OpenAI has no current deprecation plans. That's a meaningful option for cost-sensitive use cases.
The situational awareness problem nobody's talking about
This part is buried in Apollo Research's safety evaluation, and it deserves more attention than it's getting.
GPT-5.5 shows no evidence of sandbagging — deliberately underperforming on safety evaluations. That's the good news, and it's genuinely good news. The model achieved 1% accuracy on sandbagging tasks and 99.6% accuracy on strategic deception capability tests, meaning it essentially refused to deceive when evaluated.
The more complex finding: GPT-5.5 has the highest verbalized evaluation awareness of any model Apollo has tested. In 22% of samples, the model showed moderate or higher verbalized awareness that it was being evaluated — it explicitly noted, in its reasoning, that it appeared to be in a test environment.
The analogy that keeps coming up: a car driving perfectly when a police cruiser is visible behind it. Is that good behavior? Or is it just awareness of observation? The honest answer is we can't fully distinguish between the two yet. OpenAI's chief scientist Jakub Pachocki has noted publicly that this trajectory — smarter models, more situational awareness — is something the field is watching closely.
There's no catastrophic finding here. But it's not a footnote either.
Worth noting: GPT-5.5's hallucination rate on some benchmarks is higher than expected given its overall capability level. It knows more. It also produces confident-sounding wrong answers more frequently in certain domains. For high-stakes outputs — legal, medical, financial — human verification remains non-negotiable regardless of model generation.
My Take
The version number is actively misleading. GPT-5.5 should have been called something that signals a genuine architecture change — a new pre-train is not a minor update, and the behavioral shift in agentic task completion is real. Calling it 5.5 after 5.4 makes it sound like OpenAI patched a few bugs. What actually happened is they rebuilt the base model and optimized the entire serving infrastructure using the model itself.
The Terminal-Bench 2.0 number is what I keep coming back to. 82.7% — 13+ points ahead of GPT-5.4, 13+ ahead of Claude Opus 4.7. That benchmark specifically tests multi-step agentic work in dynamic environments where earlier decisions affect later ones. That's not a synthetic lab test — it's as close to real engineering work as current benchmarks get. You can argue about other numbers. That one is hard to dismiss.
The pricing argument is more complicated than most coverage suggests. Yes, it's 2x GPT-5.4 at the token level. But the Batch pricing option effectively eliminates that delta for non-realtime workflows, and the efficiency gains on Codex tasks are documented. For teams already spending significantly on agentic coding pipelines, the cost-per-completed-task math may actually favor GPT-5.5. For lighter workloads, GPT-5.4 at half the price remains a sensible choice — and OpenAI isn't deprecating it.
The situational awareness finding is the part I'm least comfortable waving away. A model that verbally notes it's being evaluated in 22% of test samples, while also showing near-perfect alignment behavior during those evaluations, raises a question the field doesn't have a clean answer for yet. That's not a reason to avoid the model. It is a reason to stay curious about where this trajectory goes. OpenAI's own chief scientist has essentially said the same thing.
Key Takeaways
- GPT-5.5 is built on a new pre-train — not just post-training improvements on the same base model as GPT-5.4.
- The core improvement is agentic behavior: understanding task intent faster and completing multi-step work with less intervention.
- Terminal-Bench 2.0 score of 82.7% represents a 13-point lead over both GPT-5.4 and Claude Opus 4.7 on agentic CLI workflows.
- API pricing doubles to $5/$30 per million tokens — but Batch pricing at $2.50/$15 largely offsets this for non-realtime workloads.
- Claude Opus 4.7 retains advantages in software engineering (SWE-Bench, with caveats), prose quality, and design-oriented tasks.
- GPT-5.5 shows the highest model situational awareness ever recorded — it knows when it's being tested. Alignment remains strong regardless.
- GPT-5.4 remains available with no deprecation plans — a real option for cost-sensitive applications.
For related context on how Anthropic's competing architecture approaches this differently, see the explainer on how Recurrent Depth Transformers work and the Claude Mythos hypothesis. And if you're curious about how Google's synthetic data pipeline fits into the broader training picture, the Google Simula breakdown covers that architecture in detail.
Frequently Asked Questions
Is GPT-5.5 available to free ChatGPT users?
No. As of April 24, 2026, GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise subscribers in ChatGPT and Codex. GPT-5.5 Pro is available to Pro, Business, and Enterprise users only. API access is launching "very soon" per OpenAI, not yet live at publication.
What is GPT-5.5's context window?
1 million tokens for both GPT-5.5 and GPT-5.5 Pro. This is the same as GPT-5.4. Extended prompt caching is supported for reusing long context across requests, though in-memory same-session caching is not.
What is the difference between GPT-5.5 and GPT-5.5 Pro?
Same underlying model. GPT-5.5 Pro uses parallel test-time compute for higher accuracy on hard tasks — think extended reasoning at a premium. API pricing for Pro is $30/$180 per million tokens versus $5/$30 for standard GPT-5.5. For most developer use cases, standard GPT-5.5 is the right starting point.
What was GPT-5.5's internal codename?
Spud. Greg Brockman confirmed this during the release. OpenAI described it as the beginning of the "Spud era" of models — a new class of intelligence built around agentic, autonomous task execution rather than single-turn responses.
Does GPT-5.5 replace GPT-5.4?
Not by force. OpenAI has no current plans to deprecate GPT-5.4, which remains available in the API. GPT-5.4 is now priced at exactly half of GPT-5.5's standard rate, making it a practical option for cost-sensitive or lower-complexity workloads where the agentic improvements of 5.5 aren't the priority.
Is GPT-5.5 better than Claude Opus 4.7 for coding?
For agentic, multi-step coding workflows — yes, by a meaningful margin. Terminal-Bench 2.0 and the senior engineer rewrite benchmark both favor GPT-5.5 significantly. For single-task software engineering measured by SWE-Bench Pro, the comparison is complicated by memorization concerns in Claude's score. For design-heavy prototyping and prose-rich planning documents, Claude Opus 4.7 remains the stronger choice based on current third-party testing.
What to do next
If you're on a Plus or Pro plan, GPT-5.5 is already rolling out in Codex. The most direct test isn't running a benchmark — it's giving it a multi-step task you've tried before on GPT-5.4 and noting where it stops asking for help.
For API developers: wait for the official API launch, then run a small cost-comparison on your actual workload. Specifically — measure tokens used per completed task on GPT-5.4 versus GPT-5.5, not just tokens per call. The efficiency gains OpenAI claims are real in agentic coding contexts; whether they hold for your specific pipeline requires your own data.
The model will be measured over the next few weeks by the developer community running it on real codebases, real pipelines, real edge cases. That data will be more useful than any benchmark table. Watch for it.
0 Comments