72 to 75 percent. That is where Grok 4 sits on SWE-bench verified, the benchmark developers actually use to measure coding capability. Claude Opus 4.6 is at 80.8 percent. GPT-5.5 is at 88.7 percent. The gap is not subtle. And it explains almost everything that happened in the last week of May.
Three companies made moves. Each move looked different on the surface. Together they describe the same underlying reality: the models pulling ahead are the ones trained on what developers actually do, not just what code looks like in a repository.
The Gap That Explains Everything
Grok has 6 percent enterprise adoption as of March 2026. OpenAI sits at 55 percent. Anthropic has jumped from 20 percent a year ago to 47 percent. Those numbers did not happen because of marketing. They happened because developers tried the tools on real work and made a call.
The SWE-bench gap is the technical version of the same story. A 6 to 16 point deficit on a benchmark that simulates real software engineering tasks is not the kind of gap you close by training on more GitHub code. GitHub code is already in every model. The models at the top of that leaderboard learned something different — how developers think while they work, not just what finished code looks like.
On May 24, Elon Musk announced that Grok V9, internally called V9-Medium, had completed training at 1.5 trillion parameters — three times the size of the current V8 model running in production. The scale jump is real. But the detail that mattered more was a single phrase in the announcement: the model was trained on a large amount of Cursor data, with more still coming.
Cursor is used by over 67 percent of Fortune 500 companies. It is projected to cross 6 billion dollars in annualized revenue by end of 2026. Jensen Huang has called it his favorite enterprise-level AI service. SpaceX made a 60 billion dollar acquisition move on it in April, with a 10 billion dollar cooperation fee payable even if the deal does not close. That is not a company buying a product. That is a company buying the data.
What Cursor Data Actually Contains
Someone on the internet did the obvious thing and asked Grok directly what the Cursor training data contained. The model answered: high-quality real programming interactions, including developer prompts, code context, editing operations, and task completion records.
That description is worth pausing on. It is not code. It is the process of writing code. The wrong turns, the rollbacks, the debugging sessions, the multi-file edits that span an hour of work. Public repositories show you what survived. Cursor data shows you how a developer got there.
Current language models already produce code that looks correct. The harder problem — navigating a real codebase, understanding what a developer is trying to do three steps ahead, catching errors before they compound — requires a different kind of training signal. One that exists in Cursor's logs and almost nowhere else at scale.
Grok Build, the coding agent xAI launched on May 14, supports up to 8 sub-agents running in parallel, handles file editing, dependency management, and shell command execution, and is natively compatible with the configuration format Claude Code uses. That last detail is telling. You do not build compatibility with a competitor's ecosystem unless your users are already switching between the two.
Qwen 3.7 Max and the 35-Hour Test
The Code Arena leaderboard placed Qwen 3.7 Max fourth globally in late May 2026, ahead of GPT-5.5 and Gemini 3.5 Flash. Only Claude Opus 4.7 and Opus 4.6 sit above it. For context: this is the first time a Chinese model has reached this position in programming evaluations.
The benchmark numbers matter less than one internal test Alibaba published. In an autonomous programming task, Qwen 3.7 Max ran for 35 consecutive hours, executing 1,158 tool calls. Zero context degradation. Zero instruction drift. Zero infinite loops.
That last part is the meaningful one. Infinite loops are one of the most documented failure modes in long-horizon agent tasks. A model that calls tools 1,000 times without losing the thread of what it was supposed to accomplish is not demonstrating raw intelligence. It is demonstrating a specific kind of learned discipline — knowing when to move forward, when to backtrack, when a strategy is failing.
Alibaba reportedly trained the model using environment expansion — running the same programming tasks across multiple execution frameworks and verification methods. Instead of learning shortcuts for one setup, the model was forced to develop general problem-solving patterns. That approach likely explains why it holds up across different agent frameworks rather than performing well only in its own ecosystem.
In a developer test involving a self-training Tetris AI, Qwen 3.7 Max beat both Claude Opus 4.7 and GPT-5.5 at a total token cost of $1.32, with a 56 percent performance improvement over competitors. The cheaper option won. That is not always the story in AI benchmarks. When it is, it tends to move adoption fast.
For broader context on how these coding agent comparisons have played out at the frontier in 2026, see how AI is learning to automate its own training.
A 46-Page Paper, 2 Hours of Human Thinking
Deli Chen is a senior researcher at DeepSeek and a core contributor to multiple versions of the DeepSeek architecture, including the R1 model that appeared on the cover of Nature. He published a 46-page survey paper in late May. His own disclosure: approximately 99 percent of it was written by his autonomous research agent framework.
The numbers he reported: six total iterations, a first draft completed in 76 minutes, six days of total calendar time, 108 rounds of agent interaction, 648,000 tokens consumed, 103 references all verified, 2,234 lines of LaTeX. His personal CPU time on the actual thinking: under 2 hours.
The paper's topic was autonomous research agents. A human used an AI agent to write a comprehensive review of AI agents conducting research. The irony is deliberate. Chen framed it as both a demonstration and an analysis of exactly what it describes.
The paper proposes a five-level autonomy taxonomy for research agents, modeled on self-driving car classification. Level one is autocomplete tools like GitHub Copilot — the human drives every step. Level three, where Claude Code and Cursor agents currently sit, involves multi-step operation with human checkpoints. Level four is full autonomy within bounded domains. Level five, self-directed research where the agent chooses its own problems, remains mostly hypothetical.
The paper also identifies six fundamental problems that current systems have not solved: cognitive loop traps, context window limitations in long sessions, the difficulty of evaluating genuine novelty in AI-generated research, reproducibility issues from non-deterministic inference, safety and dual-use risks, and cost barriers — a single complex task can run $5 to $50 in API calls.
Chen's agent framework name as cited in the source transcript may contain a transcription error — treat the specific name as unverified. The paper itself and Chen's authorship disclosure are from his own public statements.
What the paper demonstrates, separate from its content, is that the gap between a human researcher and an AI agent doing research-adjacent work has compressed to the point where a credible senior researcher at a major lab found it worth documenting publicly. That is a different kind of data point than a benchmark score.
What June Actually Means
Grok V9's public release is expected in mid-June, timed just before SpaceX's NASDAQ listing on June 12. GPT-5.6 has appeared in Codex infrastructure with a reported 1.5 million token context window — Polymarket assigns over 85 percent probability to a release before end of June. Claude Opus 4.8 has surfaced in Google Vertex infrastructure. Gemini 3.5 Pro is also scheduled for June.
Four labs. Same month. All of them have been watching the same benchmark numbers and drawing the same conclusion about what the next gap to close actually is.
The interesting question is not which model scores highest on SWE-bench in July. It is whether tripling parameters while adding real developer interaction data produces a qualitative change in how Grok handles actual engineering work — or whether it just closes the benchmark gap while the production gap remains. Those are not the same thing. They rarely are.
For what Anthropic's Mythos model has been doing with real-world task execution in parallel, that is a separate thread worth tracking alongside the coding benchmark race.
My Take
The Cursor move is the one that matters most long-term. Not because of Grok V9 specifically, but because it signals that the training data scarcity problem has shifted from text and code to process. Anyone can scrape GitHub. Nobody else has what Cursor has — millions of hours of real engineering work captured in structured interaction logs. That data existed for years. The labs are only now paying 60 billion dollars to access it. That tells you where the bottleneck actually was.
The DeepSeek paper is the most honest thing published this month. A senior researcher at one of the world's leading AI labs spent 2 hours supervising a paper that would have taken him a month. He published it anyway and told everyone exactly what happened. That is worth more than the paper itself.
Qwen's 35-hour number is the one I keep coming back to. Just is.
- Grok V9 (1.5T parameters) trained on Cursor developer interaction data — real prompts, edits, debugging sessions, not just finished code
- SWE-bench gap: Grok 4 at 72-75% vs Claude Opus 4.6 at 80.8% vs GPT-5.5 at 88.7% — this gap drove the Cursor move
- Qwen 3.7 Max ran 35 hours, 1,158 tool calls, zero infinite loops — 4th globally on Code Arena, ahead of GPT-5.5
- DeepSeek researcher Deli Chen: 46-page paper, under 2 hours personal effort, 99% agent-written — published and disclosed publicly
- June 2026: GPT-5.6, Claude Opus 4.8, Gemini 3.5 Pro, and Grok V9 all targeting the same window
- Cursor data sourcing and authorization details remain unconfirmed by XAI or Cursor officially
Frequently Asked Questions
What is Grok V9 and when is it releasing?
Grok V9, internally called V9-Medium, is xAI's next-generation model at 1.5 trillion parameters — three times the size of the current V8 model in production. Elon Musk announced training completion on May 24, 2026. Public release is expected mid-June 2026, approximately 2-3 weeks from that announcement.
What kind of data does Cursor collect from developers?
According to a response Grok gave when asked directly, Cursor interaction data includes developer prompts, code context, editing operations, and task completion records — essentially the full workflow of how engineers write, debug, and modify code in real sessions. Whether and how this data was used in V9 training has not been officially confirmed by XAI or Cursor.
How does Qwen 3.7 Max compare to Claude Opus 4.6 for coding?
On SWE-bench Verified, the scores are very close — Qwen 3.7 Max at 80.4% vs Claude Opus 4.6 at 80.8%. On Terminal-Bench 2.0, Qwen leads at 69.7% vs Opus 4.6 at 65.4%. On SWE-bench Pro (harder real-world tasks), Qwen leads at 60.6%. Pricing is significantly lower. For most agentic coding workflows, the two are currently co-leaders.
What is SWE-bench and why do developers care about it?
SWE-bench Verified is a benchmark that tests AI models on real GitHub issues from actual software projects — the model has to read the issue, understand the codebase, and produce a working fix. Unlike benchmarks based on synthetic problems, it measures the kind of work a developer would actually hand off to an AI agent. High SWE-bench scores correlate with models that are genuinely useful in production coding environments.
Did a DeepSeek researcher really publish a paper written 99% by AI?
Yes. Deli Chen, a senior researcher at DeepSeek and contributor to multiple versions of its architecture including R1, publicly disclosed that his autonomous research agent wrote approximately 99% of a 46-page survey paper. He reported spending under 2 hours on the actual intellectual work. The paper's topic — autonomous research agents — made the disclosure particularly pointed. All 103 references were verified by the agent.
June 2026 will settle some of these benchmark questions. It will not settle the more interesting one: whether a model trained on how developers actually work produces meaningfully different results in production, or just on leaderboards. That answer comes from developers using the tools, not from announcements. Watch what they say in June, not what the labs publish.
0 Comments