Stats Strip: DeepSeek V4 launched April 24, 2026 | V4 Pro output: $3.48/M tokens | GPT-5.5 output: $30/M tokens | V4 Flash output: $0.28/M tokens | Context window: 1 million tokens (both models)
$180 per million output tokens. That is what GPT-5.5 Pro charges. DeepSeek V4 Pro charges $3.48. Not for a cheaper, weaker model. For a model that scores within 0.2 points of Claude Opus 4.6 on the SWE-bench Verified coding benchmark, and beats it on LiveCodeBench and Codeforces.
The gap is not subtle. It is not a rounding error. It is 51x on output pricing, for performance that is close enough to matter in most production workflows.
Whether DeepSeek V4 Pro is "better" than GPT-5.5 is the wrong question. The right question is: at these prices, does it need to be? This article breaks down exactly what V4 and V4 Flash cost, how the cache pricing math works in real developer workflows, where the benchmarks actually hold up, and where they fall short. No hype in either direction — just numbers.
1. What DeepSeek Actually Released
On April 24, 2026, DeepSeek released preview versions of two models: DeepSeek V4 Pro and DeepSeek V4 Flash. Both are open-weight under the MIT license. Both support a 1 million token context window. Both are available via DeepSeek's API and on Hugging Face.
The two models are not the same product at different price points. They are different bets on different use cases.
V4 Pro is a Mixture-of-Experts model with 1.6 trillion total parameters and 49 billion active parameters per inference pass. The full model is enormous. What runs during any given request is not. DeepSeek calls it the best open-source model available today, and the benchmark data largely supports that claim — at least within the open-weight category.
V4 Flash is the leaner version: 284 billion total parameters, 13 billion active. Smaller active footprint than many mid-range models — but with access to far more specialized expert knowledge than those models carry. On SWE-bench Verified, Flash scores 79.0% against Pro's 80.6%. For most developer coding tasks, that 1.6-point gap is not meaningful.
Both models run in Thinking and Non-Thinking modes. Both support up to 384,000 output tokens. The old deepseek-chat and deepseek-reasoner endpoints now route to V4 Flash under the hood, and will be fully retired on July 24, 2026.
2. The Pricing Math: Pro, Flash, and Cache
Here are the official DeepSeek API rates, as of April 2026:
| Model | Input (cache miss) | Input (cache hit) | Output |
|---|---|---|---|
| V4 Flash | $0.14 / M | $0.028 / M | $0.28 / M |
| V4 Pro | $1.74 / M | $0.145 / M | $3.48 / M |
The cache pricing is where the real savings sit. DeepSeek caches prompt prefixes automatically. Any prefix of 1,024 tokens or more that repeats across requests within the same account gets billed at the cache-hit rate, which is roughly 80% off Flash input and 92% off Pro input.
Run the numbers on a real agent setup. Say you have a 20,000-token system prompt that never changes, and you run 100 different user requests of 500 tokens each:
- First call: 20,500 tokens at cache-miss rate — $0.036 for Pro
- Calls 2-100: 20,000 tokens at cache-hit + 500 at cache-miss — roughly $0.003 per call for Pro
- Total for 100 calls: around $0.33 on V4 Pro
- Same 100 calls on GPT-5.5 standard: closer to $2.50 on input alone
That difference compounds fast at production scale. Legal review pipelines, internal document agents, and RAG applications that reuse system prompts get close to the cache-hit rate on almost every request. The gap between DeepSeek and Western pricing in those workflows is not 7x — it is closer to 50x.
One thing worth noting: DeepSeek is currently offering a limited-time 75% discount on V4 Pro, valid until May 5, 2026. Discounted pricing during a model launch is normal. The listed rates are what you plan around.
3. V4 vs GPT-5.5 vs Claude Opus 4.7: Side-by-Side Cost
Here is the comparison that has been circulating across developer communities since the V4 launch. These are standard cache-miss rates at the time of writing:
| Model | Input / M | Output / M | License |
|---|---|---|---|
| DeepSeek V4 Flash | $0.14 | $0.28 | MIT (open weights) |
| DeepSeek V4 Pro | $1.74 | $3.48 | MIT (open weights) |
| Claude Opus 4.7 | $5.00 | $25.00 | Closed |
| GPT-5.5 Standard | $5.00 | $30.00 | Closed |
| GPT-5.5 Pro | $30.00 | $180.00 | Closed |
On output pricing, which is usually where inference costs land the hardest, V4 Pro is 7.2x cheaper than Claude Opus 4.7 and 8.6x cheaper than GPT-5.5 Standard. Against GPT-5.5 Pro, it is 51x cheaper.
V4 Flash at $0.28 output is in a different category entirely. There is no Western frontier model at that price point that competes on capability. GPT-5.4 Nano exists but is not a comparable product for serious coding or agent workloads. According to Simon Willison's analysis, V4 Flash is now the cheapest small model in this performance tier — cheaper than even GPT-5.4 Nano.
The MIT license adds another dimension. You are not only renting API access. The weights are on Hugging Face. V4 Pro is 865GB. V4 Flash is 160GB. Companies with data sovereignty requirements, regulated industries, or on-premise constraints can self-host without paying DeepSeek anything after download.
4. What the Benchmarks Actually Say
DeepSeek did something unusual in their technical report. They published their own gaps. Most model launches present five graphs where they win, then stop. DeepSeek ran comparisons against GPT-5.4 and Gemini 3.1 Pro, found that V4 Pro trails the frontier by three to six months on general reasoning, and printed it anyway. That is worth noting.
Here is where V4 Pro actually performs well:
- SWE-bench Verified (coding): 80.6% — within 0.2 points of Claude Opus 4.6's 80.8%, and comparable to where GPT-5.5 standard lands
- LiveCodeBench: 93.5%, ahead of Claude Opus 4.6 at 88.8%
- Codeforces rating: 3,206 — placing V4 Pro around the 23rd percentile of actual human contest participants
- Apex Shortlist (STEM and math): 90.2%, ahead of Claude Opus 4.6 at 85.9% and GPT-5.4 at 78.1%
- GDPval-AA (economic knowledge work): Ranked first among open-weight models by Artificial Analysis, scoring 1,554 Elo — though Claude Opus 4.6 still leads at 1,619
Vals AI called V4 the top open-source model on their Vibe Code benchmark, and Arena.ai ranked V4 Pro in thinking mode 14th overall in their code arena and third among open-source models. For agentic coding specifically, DeepSeek's internal survey of 85 developers found that more than 90% said V4 Pro was ready to be their primary coding agent.
The pattern is consistent. On code, math, and STEM tasks, V4 Pro competes at or near the top. On general knowledge work and writing quality, there is still a real gap to the best closed models.
5. Where V4 Pro Still Falls Short
The gaps are real. They matter for specific workloads.
On Humanity's Last Exam, a benchmark testing expert-level cross-domain reasoning, Gemini 3.1 Pro scores 44.4% against V4 Pro's 37.7%. Claude Opus 4.6 sits at 40.0%. That is not a minor rounding difference. For research workflows requiring deep expert-level synthesis across multiple domains, the closed frontier models still have a measurable edge.
MMLU Pro shows a similar story: Gemini 3.1 Pro at 91.0% against V4 Pro at 87.5%. GPQA Diamond — testing graduate-level scientific reasoning — has Gemini at 94.3% versus V4 Pro at 90.1%. On SimpleQA-Verified, which tests factual knowledge retrieval, V4 Pro scores 57.9% against Gemini's 75.6%. That gap matters for anything requiring accurate real-world fact recall.
On Terminal-Bench 2.0 — complex command-line agent workflows — GPT-5.5 scores 82.7% while V4 Pro lands at 70.0% (67.9% in some reports). That is the area where GPT-5.5's agentic optimization shows most clearly.
V4 is also text-only right now. No image, audio, or video input. OpenAI, Google, and Xiaomi all ship multimodal models. DeepSeek has indicated multimodal capabilities are in development, but nothing is available yet.
There is also the reliability question. DeepSeek's hosted API routes through infrastructure in China. During peak demand, 503 errors and latency spikes are a documented issue. For mission-critical production systems, routing through OpenRouter, Fireworks, or Together AI adds a reliability layer worth the modest premium.
6. Who Should Use V4 Pro vs V4 Flash
This is not a one-size answer. The two models target genuinely different use cases.
V4 Flash is the right default for most high-volume workloads: chat interfaces, summarization pipelines, RAG applications, code autocomplete, document classification, and routing layers that escalate harder tasks upstream. At $0.28 per million output tokens, it is cheap enough that token cost stops being the primary engineering conversation. Flash scores 79.0% on SWE-bench Verified. That is not a downgraded product — it is one point below most closed frontier models from 2025.
V4 Pro makes sense when you have measured a quality gap that matters: deep codebase analysis, complex multi-step agent workflows, architecture planning, and inference over very long documents where the 1M context window is genuinely in use. At $3.48 output, it is still less than a quarter of Claude Opus 4.7's price for workloads where the performance difference between Flash and Pro actually moves results.
A practical routing pattern that has gained traction among developers: send 60-70% of traffic to V4 Flash as the default tier, escalate complex coding tasks to V4 Pro or Claude Opus 4.7, and keep GPT-5.5 for agentic desktop automation tasks where Terminal-Bench performance matters. According to analysis from Lushbinary's April 2026 routing guide, this kind of multi-model strategy can reduce total API costs by 40-60% without meaningful quality loss.
One more thing on Flash: it runs in Thinking mode. Thinking off by default for speed, but you can enable it per-request with a single parameter. Same model, same price, stronger reasoning on harder tasks when you need it.
7. The Architecture Behind the Low Cost
The pricing is not just aggressive marketing. It is a consequence of architectural choices that reduce inference cost without gutting capability.
The Mixture-of-Experts setup is the core of it. V4 Pro has 1.6 trillion total parameters, but only 49 billion activate per inference pass. You are not running 1.6T parameters on every token. You are routing to the relevant expert sub-networks. V4 Flash runs 13 billion active parameters. Both models process context efficiently because only the necessary parts wake up for any given request.
For long context specifically, DeepSeek introduced a hybrid attention mechanism combining Compressed Sparse Attention and Heavily Compressed Attention. In the 1M-token context setting, V4 Pro requires only 27% of single-token inference compute and 10% of KV cache compared to V3.2. That is a significant engineering improvement. A 1M context window is only economically viable if processing it does not cost 10x what a shorter context does.
DeepSeek also replaced the AdamW optimizer with the Muon optimizer for large-scale MoE training, and added Manifold-Constrained Hyper-Connections to stabilize signal propagation across layers. These are not headline-grabbing innovations, but they explain why the model trains efficiently enough that DeepSeek can undercut Western pricing and still operate a viable API business.
The hardware angle matters too. V4 is optimized for Huawei Ascend NPU platforms in addition to Nvidia infrastructure, with acceleration ratios between 1.50x and 1.73x on general inference workloads. DeepSeek has stated that V4 Pro throughput is currently limited by compute constraints, but that prices could drop further once Huawei Ascend 950 super nodes begin shipping at scale in the second half of 2026. Already cheap, and potentially getting cheaper.
My Take
The benchmark conversation around V4 is a distraction from the thing that actually matters: the pricing forces a different conversation in enterprise procurement. When you can hand a CTO a side-by-side where V4 Pro outputs 8x more tokens per dollar than GPT-5.5, the burden of proof flips. Now you have to justify paying the premium, not justify considering the alternative. That shift is real and it happened this week.
What I find more interesting than the headline price gap is the cache math. Most people are comparing V4 Pro to GPT-5.5 at standard rates. But the typical enterprise agent workflow reuses system prompts, tool schemas, and RAG context constantly. At cache-hit rates, V4 Pro input drops to $0.145 per million. That is not a promotional rate — that is the operational cost of any well-built agent pipeline. The effective savings in production are larger than the headline numbers suggest.
On benchmarks: DeepSeek's decision to publish their own gaps is the signal worth paying attention to. A lab that prints its weaknesses has more credibility than one that doesn't. V4 is genuinely weaker on cross-domain expert reasoning, factual retrieval, and complex terminal-based agent tasks. Those gaps are documented in DeepSeek's own material. That's not a red flag — it is scope clarity. Know what you are buying.
The open-weight angle is underrated. This is not just about API pricing. The weights are on Hugging Face under MIT license. Hospitals, law firms, defense contractors — any regulated industry that cannot send prompts to external APIs — now has a genuinely frontier-adjacent model they can self-host. That market was essentially locked out of the 2025 AI cycle. V4 changes that.
- V4 Pro outputs at $3.48/M vs $30/M for GPT-5.5 Standard — an 8.6x difference on the line item that matters most
- Cache-hit rates compress V4 Pro input to $0.145/M — the real cost for agent pipelines that reuse system prompts
- V4 Flash at $0.28 output is the cheapest high-capability small model currently available, scoring 79.0% on SWE-bench
- Closed frontier models (GPT-5.5, Gemini 3.1 Pro) still lead on expert cross-domain reasoning, factual recall, and terminal agent tasks
- MIT license means self-hosting is a real option — V4 Flash is 160GB, V4 Pro is 865GB on Hugging Face
- Pricing may drop further in H2 2026 when Huawei Ascend 950 super nodes ship at scale
FAQ
Is DeepSeek V4 Pro actually better than GPT-5.5?
Depends on the task. On SWE-bench Verified (coding), V4 Pro scores 80.6% — essentially the same as GPT-5.5. On LiveCodeBench and Codeforces, it leads. On Terminal-Bench 2.0 (complex command-line agent tasks) and cross-domain expert reasoning, GPT-5.5 leads clearly. Neither model wins everywhere.
What is the DeepSeek V4 context window limit?
Both V4 Pro and V4 Flash support 1 million input tokens and up to 384,000 output tokens via DeepSeek's API. This is the same for both Thinking and Non-Thinking modes. For Think Max mode, DeepSeek recommends setting the context window to at least 384K tokens.
Can I self-host DeepSeek V4?
Yes. Both models are available on Hugging Face under the MIT license — V4 Flash at 160GB, V4 Pro at 865GB. You can download, fine-tune, and deploy on your own infrastructure without paying DeepSeek. The hardware requirement for Pro is significant. Flash is more practical for most self-hosting setups; a quantized version of Flash may run on a 128GB M5 MacBook Pro.
Does DeepSeek V4 support the Anthropic and OpenAI API formats?
Yes. DeepSeek's API supports both OpenAI ChatCompletions format and Anthropic Messages format. If your stack already uses the OpenAI SDK, you change the base URL to https://api.deepseek.com and update the model ID. No new SDK required.
When is DeepSeek retiring the old deepseek-chat endpoint?
The deepseek-chat and deepseek-reasoner endpoints are deprecated and will be fully retired on July 24, 2026. They currently route to V4 Flash non-thinking and thinking modes respectively, so API users are already getting V4 Flash quality without changing anything — migration is just updating the model ID before the deadline.
Is there a free tier for DeepSeek V4?
There is no unlimited free API tier. New accounts occasionally receive trial credits. The chat interface at chat.deepseek.com remains free for general use, with both Expert Mode (V4 Pro) and Instant Mode (V4 Flash) accessible. For API access at scale, it is pay-as-you-go from the first token.
The most useful first step right now is running V4 Flash on your actual workload. Not a synthetic benchmark. Your prompts, your outputs, your quality bar. DeepSeek offers trial credits for new accounts. For a baseline, run the same 20 tasks you use to evaluate any model change. If Flash passes that bar, the pricing math from there is simple.
For teams already evaluating GPT-5.5 vs Claude Opus 4.7, the full breakdown of GPT-5.5 vs Claude Opus 4.7 pricing and benchmarks on this site covers where each closed model leads and where the premium is actually justified. Worth reading before finalizing any routing strategy that involves all three model families. And if you want to understand the agentic capabilities GPT-5.5 was specifically built around, the GPT-5.5 agentic shift explainer covers that architecture in detail.
0 Comments