“AI wrote some code” isn’t news anymore. But ai helped build a whole deep learning engine, the kind that has to juggle Python, C++, and GPU kernels, and it can train real neural networks. That’s different.
This comes from a real NVIDIA experiment: a team let coding agents powered by large language models generate and revise most of an open source deep learning runtime called Vibe Tensor (positioned as research and education, not production). The surprising part isn’t that it compiles. It’s that it runs end-to-end training loops on NVIDIA GPUs and matches the learning behavior you’d expect from a trusted framework, even when the code wasn’t hand-written line by line.
In this post, I’ll explain what they built, how they checked it worked without reading every change, where performance looks good (and where it doesn’t), and what this says about how software might get built next.
What the AI actually built (and why it is more than a toy demo)
Calling this “a demo” undersells it. Vibe Tensor is closer to a mini deep learning framework stack: a Python-facing API up top, a C++ core in the middle, and CUDA code at the bottom that talks to NVIDIA GPUs.
A quick translation of the jargon, in plain words. A tensor is just a multi-dimensional array (the numbers your model learns with). A runtime is the machinery that runs operations and keeps track of memory and execution. A kernel is a specialized GPU routine for a specific operation. A GPU is the parallel math engine that makes modern deep learning practical.
What makes Vibe Tensor feel real is that it doesn’t stop at “look, matrix multiply works.” The project runs full training loops, including forward pass, backward pass, optimizers, and enough core ops to train small transformer-style models. And it does it in an eager style that feels familiar if you’ve used PyTorch.
If you want the primary source, the project details are laid out in the VibeTensor arXiv paper, and the code is in the open at the NVLabs VibeTensor repository.
A PyTorch-like feel on the surface, with serious plumbing underneath
From the outside, Vibe Tensor behaves like the kind of framework people actually use. You write Python, create tensors, call ops, and they execute right away. That “right away” part matters, because it means the system isn’t hiding behind a pre-compiled graph you can’t inspect.
Under that friendly layer, the C++ core has to keep a lot of facts straight. Every tensor needs to know its shape, data type, memory layout, and whether its data lives on CPU or GPU. It also needs to support views, like slicing and reshaping, without copying memory all the time. That’s one of those unglamorous details that separates “works on a blog post” from “can run a model without wasting half the GPU.”
There are also safety-ish mechanics you normally never think about until things break. Version counters, for example, help detect risky in-place edits, the kind that can silently corrupt training if a tensor gets modified while something else still expects the old value. Mature frameworks do this because deep learning code is messy in practice, and Vibe Tensor mirrors that idea.
The key building blocks: dispatcher, autograd, and a GPU memory manager
Here’s the mental model I used while reading about it.
The dispatcher is like a traffic cop. When Python calls an operation (add, matmul, layer norm), the dispatcher decides which implementation to run and where, CPU or GPU. It can also wrap ops so they connect into training, not just compute a value and disappear.
The autograd engine is what makes learning happen. In simple terms, during the forward pass it records how outputs depend on inputs. During the backward pass, it walks that chain backward and computes gradients. Gradients are the “how much should I change this weight?” signals that drive training.
Then there’s the GPU memory manager. GPUs are fast, but repeatedly asking the GPU driver for fresh memory can slow you down. A reuse-based allocator grabs chunks and reuses them safely, and it can also track stats so you can see what memory is doing over time. That sounds boring, and it is, but it’s also the difference between stable training and random performance cliffs.
The wild twist is that the agents didn’t just write one layer. They stitched together Python calls into C++ tensor logic, and down into CUDA execution with streams, events, and even experiments around CUDA graphs (recording sequences of GPU work so they can replay faster later).
How they proved it works without humans reading every line
If ai is writing big chunks of system code, the obvious question is, “Who’s checking this?”
The answer, in this experiment, is validation-heavy workflow. Agents propose code changes, the system compiles, unit tests run in both C++ and Python, and outputs get compared against a known baseline (often PyTorch). If a change fails, it doesn’t “kind of pass.” It gets rejected, and the agent tries again.
That’s the core bet: instead of trusting the author, you trust a tight loop of builds, tests, comparisons, and benchmarks. Not vibes, not reputation.
This approach only works if your test suite is serious, and the team treated tests like a product feature. They also used reproducible benchmarks, because performance claims are slippery if you can’t run them again under the same conditions.
If you’ve been following how agent workflows are getting more autonomous in general, this fits into a bigger theme I wrote about in why the AI of 2026 will feel more independent. Vibe Tensor is a concrete example of that shift, where “agent” stops meaning chat, and starts meaning iterative work.
Build, test, compare, repeat: the loop that kept bad code from sticking
The development loop is simple to describe, but brutal in practice:
An agent proposes a change, the codebase builds, tests run, results are checked, and only then does the change stick. Over and over.
Humans still matter here, just not in the way people imagine. The humans set goals and constraints, like “support slicing without copies,” “add a diagnostic memory allocator,” or “implement a dispatcher that can route CPU vs GPU ops.” The agents fill in the details, generate code, run through failures quickly, and keep iterating until the system behaves.
That division of labor feels… honest. Humans do architecture and judgment. Agents do volume work plus endless retries. In a big stack, retries are most of the job.
The bugs were very real, and the fixes became new regression tests
This wasn’t a clean ride. The failures were the same kind you get in normal systems programming, the kind that makes you stare at logs at 2 a.m.
Some GPU kernels crashed because they exceeded hardware limits. Some numerical mistakes showed up because the wrong stability trick (or the wrong formula) can make values drift. And one of the most relatable problems, because it’s so classic, was training diverging due to memory reuse issues, basically a buffer getting reused without proper initialization.
Each time a bug surfaced, the workflow response was practical: write a targeted regression test (a test designed to make sure the same bug never comes back), then rerun the whole validation loop. That’s how the codebase becomes more trustworthy over time, even if you never “read” the whole thing in the traditional sense.
This is also why open source matters here. People can inspect the system, run tests, and reproduce claims. If you want a short outside write-up on the story from a news angle, this piece covers the broad idea: AI agents generating the VibeTensor stack.
Performance, benchmarks, and the "Frankenstein" problem nobody expects
Let’s be clear: Vibe Tensor is not presented as “PyTorch but faster.” It’s presented as a proof of concept that agent-written system software can be coherent enough to run real training on GPUs.
So performance is mixed. Some kernels beat baselines in specific cases, others don’t, and the full training stack is often slower than mature frameworks. That’s not shocking. Tuning a deep learning system is years of work and a lot of profiling pain.
But the more interesting lesson is what the authors describe as a Frankenstein-style issue: individual parts can look fine alone, yet when you glue them together, you get hidden global bottlenecks. It’s the system version of “each ingredient tastes good, but the soup is weird.”
And that’s exactly the kind of problem agent coding will run into more often, because agents can generate lots of reasonable local solutions that don’t optimize for the whole.
Where it shines: a few custom GPU kernels can beat the baseline
Vibe Tensor includes an ai-generated kernel suite for common building blocks used in modern models, things like layer normalization, rotary embeddings, and pieces of attention.
In benchmarks, some of these kernels show big speedups in certain setups, including cases where a routine runs multiple times faster than a reference baseline. That’s the fun part, because it hints at a future where agents generate many kernel variants, test them automatically, and keep the best ones per GPU architecture.
But it’s not consistent. Attention, in particular, can be a mixed story, with gains in some large training setups and slower results in smaller workloads. That variability is a reminder that GPU performance depends on details like memory access patterns, launch overhead, and how well the work fits the hardware.
So you don’t walk away thinking “agents solved performance.” You walk away thinking “agents can stumble into performance wins, and then the hard part is sorting signal from noise.”
Where it slows down: safety choices and global bottlenecks
Some slowdowns come from design choices that make correctness easier.
One example mentioned is a global lock around parts of the backward pass. It simplifies reasoning about correctness, because it reduces concurrency problems. But it also blocks parallel execution, and GPUs hate waiting. If multiple training tasks can’t run in parallel, you end up with expensive hardware sitting idle.
Other slowdowns are more subtle and honestly very believable in machine-generated code: redundant layers, inconsistent style, and extra indirection. Each one might be harmless on its own, but together they add friction. The code still works, but it’s harder to tune, harder to profile cleanly, and easier to end up with “why is this so slow?” moments.
The takeaway is not that agent code is bad. It’s that system performance is a whole-body problem. You can’t win it with isolated good ideas.
What I learned watching this unfold (and why it changed how I think about ai coding)
I’ve been around enough software projects to know how hard it is to build “the boring parts.” The dispatcher logic. The memory allocator. The endless edge cases that don’t show up in a demo. Seeing agents crank through those layers, then compile, test, fail, fix, and repeat, it changed my mental model a bit.
First, tests beat vibes. I used to think of tests as the safety net you add after you trust the code. This flips it. In an agent-driven workflow, tests are the thing you trust first. You can’t afford “looks right.” You need “fails loud.”
Second, humans don’t get pushed out, they get pulled upward. When the day-to-day code changes come in large volumes, the human job becomes setting constraints, choosing what matters, and spotting the places where “correct but dumb” architecture choices will create long-term pain. That’s not glamorous either, but it’s real engineering.
Third, I had to remind myself that a slow prototype can still be a landmark. Vibe Tensor trained real models, including small transformers and other training tasks, and the learning curves behaved like you’d expect, loss down, accuracy up, text outputs getting less random. That’s not a benchmark win, but it’s a correctness win at scale. Those are different muscles.
And finally, trust doesn’t come from who wrote the code. It comes from how you validate it. If an agent writes your system code but the build, tests, comparisons, and benchmarks are tight, I start to care less about authorship. I care about the pipeline.
It also made me think of NVIDIA’s bigger message lately: the future of ai isn’t just models, it’s infrastructure, software plumbing, and huge real-world constraints. That theme shows up in Jensen Huang’s point about trillions for AI infrastructure, and Vibe Tensor sits right in the “software plumbing” part of that stack.
Conclusion
Vibe Tensor is a strong signal that ai agents can assemble a working deep learning engine across Python, C++, and CUDA, and validate it well enough to run real training on GPUs. It’s also honest about its limits: it’s a research project, the API surface is incomplete, and performance can be uneven compared to production frameworks.
The next chapter isn’t “agents write everything.” It’s better test design, more transparent benchmarks, and new engineering roles that focus on constraints, architecture, and failure modes, not line-by-line edits.
Would you trust AI-written system code if the tests were strong enough, or would you still want a human reading every critical change?
0 Comments