NVIDIA's Physical AI Stack: Cosmos 3, Vera, and the Robot Operating Layer

Industrial robot arm operating in a warehouse, representing NVIDIA's physical AI platform


Twenty trillion tokens. That is the scale NVIDIA trained Cosmos 3 on, according to Axios. Not a research paper number. An actual production number. And it gives you a sense of what NVIDIA is actually building here, because this is not a model release. It is a platform play. Three components, one strategy: own the operating layer for physical AI the same way NVIDIA owned the compute layer for language models.

Quick Answer: NVIDIA launched three interconnected products at once: Cosmos 3, a world model that simulates physical environments and can cut robot training cycles from months to days; Vera, a CPU built specifically for AI agent workloads claiming up to 1.8x faster performance than x86; and the Isaac Groot humanoid reference robot, a research platform with 75 degrees of freedom and onboard AI compute. Together, they are NVIDIA's bid to become the infrastructure layer for robots and autonomous systems.

Why physical AI is harder than language AI

A chatbot can read half the internet and pick up language patterns. A robot cannot learn from that data. It needs something categorically different: motion sequences, cause-and-effect chains, the physics of objects in space. What happens when a hand reaches for a cup. What happens when a wheel loses traction. What happens when two objects collide and one tips over.

That is the gap Cosmos 3 is designed to close. Language models were stuck behind screens, as Jensen Huang put it. The real world is harder because it demands a model of physical reality, not just a model of text.

Robot training is also painfully slow by nature. You cannot let a humanoid fail a million times in a warehouse. It breaks hardware. It burns time. It creates safety problems. So companies rely on simulation, controlled environments, synthetic data. The bottleneck is not the robot. It is the speed at which you can teach it before it ever touches the real world. That is the exact problem NVIDIA is targeting.

Cosmos 3: the world model at the core

NVIDIA describes Cosmos 3 as an "open world foundation model for physical AI." It is built on a Transformer mixture-of-experts architecture and combines three capabilities in one system: vision, reasoning, world generation, and action prediction. In plain terms, it can understand what it is seeing, simulate physical environments, and help predict what should happen next.

The training scale is the detail that matters most. Axios reported 20 trillion tokens of multimodal data: real and synthetic video, images, ambient audio, text, and action sequences from both humans and robots. The specific inclusion of action trajectories is what separates this from a standard foundation model. Most models are trained on human-generated text and images. Cosmos 3 is trained on how things move and what they do next.

NVIDIA's core claim: Cosmos 3 can cut physical AI training and evaluation cycles from months down to days. That is the number robotics engineers will actually care about. The model can also function as a vision-language model, a world model, or a video foundation model depending on the application.

Note: Whether Cosmos 3 is available as open weights or API-access only was not specified in the source. Worth checking NVIDIA's developer documentation before building on it.

NVIDIA also announced the Cosmos Coalition with partners including Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skilled AI. World models are becoming a platform war: OpenAI, Google DeepMind, Tesla, simulation labs, video model companies are all circling the same idea. Google's entry, Gemini Omni Flash, trains across all four modalities simultaneously rather than building a dedicated physical simulation layer. NVIDIA is trying to lock down the foundation layer before the others do.

Vera: why NVIDIA built a CPU for AI agents

For years, NVIDIA's GPU was the only thing that mattered. Massive models needed massive parallel compute. Now the industry is moving toward agentic AI, and agents work differently from a chatbot. An agent does not just answer a question. It plans a task, calls tools, runs code, checks files, queries databases, retries failed steps, and keeps grinding through a workflow. That creates a different kind of load inside data centers. The CPU becomes the coordinator.

Vera is NVIDIA's answer to that shift. They describe it as a high-performance, energy-efficient CPU designed specifically for agentic AI, reinforcement learning, and data processing workloads. The claimed benchmark: up to 1.8x faster on diverse agent workloads compared to traditional x86 processors.

Note: The 1.8x figure is a marketing claim from NVIDIA. The specific benchmark methodology, test conditions, and which x86 processors it was compared against were not detailed in the source material. Treat it as directional, not definitive.

The adoption list is serious. NVIDIA says Anthropic, OpenAI, and xAI are among the planned adopters, along with ByteDance, CoreWeave, and Oracle Cloud Infrastructure. Reuters reported Jensen Huang describing Vera as a potential $200 billion market opportunity, with OpenAI, Anthropic, and SpaceX named among major early adopters. Dell, HPE, Lenovo, Super Micro, Asus, Foxconn, Gigabyte, QCT, and Wistron are building standalone Vera CPU systems at scale.

The strategic logic is clear. If AI agents become the primary workload of data centers in the next five years, the company that supplies the CPU layer controls a new choke point. NVIDIA ran this play once with GPUs. Vera is the same play, one layer up the stack. The shift toward agents is also visible inside AI labs themselves — Anthropic is already using agentic loops to accelerate its own model training, which means the compute demand for agent workloads is not hypothetical.

Isaac Groot: giving the stack a body

If Cosmos 3 is the brain and Vera is the nerve system, the Isaac Groot reference humanoid is the body. NVIDIA announced it as an open humanoid robot reference design for academic research, built around a Unitree H2 chassis. If you want context on where Unitree stands in the broader humanoid market right now, the G1 vs Atlas comparison has the shipment numbers. The specifications are specific enough to take seriously: nearly 6 feet tall, around 150 pounds, 31 degrees of freedom across the body. Add the dual Sharpa Wave tactile five-finger hands with 22 degrees of freedom and the full system reaches 75 degrees of freedom total.

Hands get less attention than walking in robotics coverage, but they matter more for real usefulness. A humanoid has to grab objects, hold tools, open doors, lift items, press buttons. Five-finger tactile hands are a meaningful step toward that. The sensing stack on the reference design includes a head-mounted stereo camera with 140-degree horizontal and 102-degree vertical field of view, wrist cameras for close-range manipulation, and an IMU for motion tracking.

Control specs: arm torque up to 120 Nm, leg torque up to 360 Nm, rated arm payload 7 kg with a 15 kg peak. These are not demo numbers. The onboard compute uses a Jetson AGX Thor module with a T5000 Blackwell GPU delivering 270 FP4 teraflops, a 14-core ARM CPU, 128 GB unified memory, and configurable power from 40 to 130 watts.

Research partners confirmed at announcement: AI2, ETH Zurich, Stanford Robotics Center, and UC San Diego's Advanced Robotics and Controls Lab. Reuters also reported NVIDIA plans to work with US, European, and South Korean humanoid makers beyond Unitree. That last detail is important because Unitree is a Chinese company, and there are already concerns from US lawmakers about Unitree hardware in federally funded research. NVIDIA is positioning itself as the secure platform layer, with software updates routed through NVIDIA chips and protections like secure boot and confidential computing built in.

The larger ambition is not a robot. It is standardization. If labs and companies build on Jetson Thor, Isaac Groot, Cosmos, and Omniverse, NVIDIA becomes the operating layer for physical AI. The same way CUDA locked in GPU compute, this stack would lock in robotics development infrastructure.

The military subplot nobody is talking about enough

While NVIDIA's announcement dominated the headlines, a parallel story was already further along. Foundation Future Industries has been field-testing its Phantom Mark1 humanoid in Ukraine. According to Business Insider, two Phantom robots were sent there earlier this year for logistics pilot testing near hazardous areas, with the idea of carrying supplies from outside to inside so soldiers do not have to expose themselves.

This is a different category of humanoid story. Most robotics companies talk about warehouse automation, manufacturing, home assistance. Foundation is openly focused on dangerous environments, including active conflict zones. Business Insider also reported the company secured a $24 million Pentagon contract, and Foundation's leadership has discussed future roles that include humanoids eventually handling weapons.

Even the company acknowledges the gap between a slow logistics demo and reliable operation in a firefight. Battery life is a problem. Durability is a problem. Water, dust, shock, terrain, manipulation under pressure, all massive barriers. The hardest part may still be the hand. Using a weapon or handling supplies under real field conditions requires dexterity that works when everything else is going wrong.

The company believes humanoids could carry out significantly more complex military missions within 5 to 10 years. That timeline sits somewhere between reassuring and unsettling depending on what you think gets solved in that window.

My Take

The 20 trillion token number is the one that actually matters. Not the robot specs, not the Vera benchmark claim, not the coalition announcements. Training data at that scale specifically designed around physical cause-and-effect is what separates Cosmos 3 from every previous attempt at this. Just is.

The military angle is undercovered and uncomfortable. Most technology coverage of NVIDIA's launch treated the military subplot as a footnote. It is not a footnote. A humanoid that has already been field-tested in a war zone, backed by a Pentagon contract, operated by a company that has openly discussed combat roles, is a different kind of milestone than a warehouse demo. The gap between where Foundation is today and where they say they will be in a decade is narrowing in ways that deserve more attention than a paragraph at the end of a product roundup.

Key Takeaways
  • Cosmos 3 was trained on 20 trillion tokens of multimodal data including action sequences, claiming to cut robot training from months to days.
  • Vera CPU targets AI agent workloads specifically, claiming up to 1.8x faster performance than x86. Anthropic, OpenAI, and xAI are among planned adopters.
  • Isaac Groot reference humanoid: 75 degrees of freedom, 270 FP4 TOPS compute, 7 kg rated arm payload. Research partners include Stanford, ETH Zurich, AI2, and UC San Diego.
  • NVIDIA's strategy is vertical lock-in: if Cosmos, Vera, and Isaac Groot become the standard stack, NVIDIA becomes infrastructure for physical AI the way CUDA became infrastructure for language AI.
  • Foundation Future Industries' Phantom Mark1 has already been tested in Ukraine for military logistics. The company has a $24M Pentagon contract and is targeting combat roles within 5 to 10 years.

FAQ

What is NVIDIA Cosmos 3 and how is it different from a regular AI model?

Cosmos 3 is what NVIDIA calls an open world foundation model for physical AI. Unlike a language model that processes text, Cosmos 3 is trained on multimodal data including video, audio, and action sequences from robots and humans. It can simulate physical environments, predict future states, and understand cause-and-effect in the real world. The training dataset reportedly reached 20 trillion tokens, with a specific focus on the structure of physical reality rather than language patterns.

What does the Vera CPU do that existing CPUs cannot?

Vera is designed specifically for agentic AI workloads, where an AI model is not just answering questions but continuously coordinating tasks, calling tools, running code, and managing workflows. Standard x86 CPUs handle this type of workload inefficiently. NVIDIA claims Vera finishes diverse agent workloads up to 1.8x faster, which matters because agents create a very different compute pattern than batch inference. Speed here means faster task completion, not just faster token generation.

Is the Isaac Groot humanoid available for researchers to use?

NVIDIA announced it as an open reference design for academic research, and confirmed that institutions including Stanford, ETH Zurich, AI2, and UC San Diego will use the platform. However, pricing, availability timelines, and ordering details were not specified in the announcement materials. Checking NVIDIA's Isaac platform developer pages directly would give the most current access information.

Are humanoid robots actually being used in military operations right now?

Foundation Future Industries has conducted field tests of its Phantom Mark1 in Ukraine, according to Business Insider. The testing was focused on logistics tasks, specifically carrying supplies near hazardous areas so soldiers do not have to expose themselves. These were pilot tests, not full deployments. The company holds a $24 million Pentagon contract and has stated ambitions for more complex military roles within 5 to 10 years, though the company itself acknowledges significant technical gaps remain before reliable combat-level operation is possible.

Where physical AI goes next

The interesting question is not whether physical AI will advance. It clearly will. The question is whether one company standardizing the entire stack, from world model to CPU to robot hardware, produces better outcomes than a more fragmented ecosystem. NVIDIA made that bet on GPU compute and it mostly worked out. Whether the same consolidation is healthy when the output is not a chatbot but a humanoid that can operate in hazardous environments is a different kind of question entirely.

The pieces are moving faster than the frameworks for thinking about them.


Source: "The Big Bang Of AI Just Happened: Cosmos 3" (YouTube). Additional reporting credited to Axios (training data scale) and Reuters (Vera market estimate, Jensen Huang statements).

Post a Comment

0 Comments