The benchmark numbers are not the story. I know that sounds wrong — a 31B model ranking #3 globally on Arena AI, beating models 20 times its size on scientific reasoning, running on a single H100 GPU. That's all genuinely impressive. But developers who've been burned by open-weight licenses before know that benchmark positions don't mean much if your legal team won't sign off on deployment. That was the real problem with every previous Gemma release.
Google just fixed it. Gemma 4, released on April 2, 2026, ships under the Apache 2.0 license — the first time Google has ever done this for the Gemma family. No custom usage policies. No termination clauses buried in the fine print. No legal review friction. And it comes with four models spanning edge devices all the way to workstation-grade deployments. Let's get into what this actually means.
Table of Contents
- What is Gemma 4 and where does it fit?
- What are the four model sizes and what is each one for?
- Do the benchmarks actually hold up?
- Why does the Apache 2.0 license change everything?
- What can Gemma 4 actually do?
- Where do you access and deploy Gemma 4?
- How does it compare against Qwen and Llama 4?
- My Take
- Key Takeaways
- FAQ
What is Gemma 4 and where does it fit?
Gemma 4 is Google's latest family of open-weight AI models, built directly from the same research base as Gemini 3 — their proprietary frontier model. That's the framing Google is leaning on hard: you're getting a slice of closed-model research in a package you can actually download, modify, and deploy on your own hardware.
The positioning is clear. This isn't a research demo. Google is going after the developer ecosystem that has been increasingly gravitating toward Qwen, Llama, and Mistral — models from competitors who've been more permissive with licensing and more aggressive with on-device deployment. Gemma 4 is Google saying: we're playing that game now, and we're playing it seriously.
The family spans four models across two distinct deployment categories — edge devices and workstation/cloud — with each size optimized for a specific compute environment. And for the first time, the entire lineup ships under Apache 2.0, which removes the friction that killed adoption of previous Gemma versions in commercial settings.
What are the four model sizes and what is each one for?
This is where Gemma 4 gets genuinely interesting architecturally. The four variants aren't just the same model at different sizes — they have meaningfully different design priorities.
E2B and E4B (Edge Models): These activate 2 billion and 4 billion parameters at inference respectively, but they're designed for devices you actually carry around — smartphones, Raspberry Pi, NVIDIA Jetson Orin Nano. Google worked directly with Qualcomm, MediaTek, and the Pixel team to optimize these for mobile hardware. They come with 128,000-token context windows, native multimodal support for images, video, and audio, and they run completely offline. The audio input capability on edge devices is genuinely useful — it means local speech understanding without any cloud round-trip.
26B MoE (Mixture of Experts): This is probably the most practically interesting model in the family. The total parameter count is 26 billion, but only about 3.8 billion parameters are active during any single inference pass. That's the MoE architecture doing its job — routing each input through a subset of the model's "experts" rather than activating everything at once. The result is significantly better latency and lower memory requirements than a comparable dense model. It supports a 256,000-token context window and targets workstations and consumer GPUs.
31B Dense: The flagship. Full parameter activation at inference, 256,000-token context, and currently ranked #3 on the Arena AI global leaderboard. If you have the compute budget and need maximum capability, this is the one. Quantized to 4-bit via Ollama or GGUF format, it fits on 16GB of RAM — an RTX 4090 handles it comfortably.
Do the benchmarks actually hold up?
Benchmark skepticism is warranted. Models get tuned for leaderboards. Numbers get cherry-picked. So let's look at what's verifiable here and what context actually matters.
The 31B Dense scored 85.7% on GPQA Diamond — a graduate-level scientific reasoning benchmark covering physics, biology, and chemistry. Independent verification from Artificial Analysis puts it as second among all open models under 40 billion parameters, just behind Qwen3.5 27B at 85.8%. Effectively tied. The 26B MoE scored 79.2% on the same test, ahead of OpenAI's gpt-oss-120B at 76.2%. A 26 billion parameter MoE model beating a 120 billion parameter dense model on scientific reasoning isn't a trivial result — that's the architecture working.
On LiveCodeBench v6 — which tests coding on problems released after training data cutoffs, specifically to prevent memorization — the 31B scored 80.0% and the 26B hit 77.1%. The 31B also achieved a 2150 Codeforces ELO rating, which places it above roughly 98% of human competitive programmers on that platform. The 26B at 1718 still exceeds 90% of human participants.
The Arena AI ranking is where things get more interesting. The 31B sits at #3 globally — and that's not #3 among open models, that's #3 across everything including closed commercial models. The Arena ranking is based on aggregated human preference votes across diverse prompt types, not a fixed academic benchmark. Humans consistently preferred Gemma 4's responses even when automated scores were nearly identical to competitors. The 31B's Arena AI ELO of 1,452 places it above models with far more parameters.
The edge models also punch above their weight. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — strong numbers for a model running on a T4 GPU. Both edge models significantly outperform Gemma 3 27B without thinking mode, despite being a fraction of the size.
That said, early community reports noted real-world inference speed issues with the 26B MoE at launch. Benchmarks and production throughput aren't the same thing. Worth watching as the ecosystem matures.
Why does the Apache 2.0 license change everything?
Previous Gemma releases weren't actually open in any meaningful commercial sense. They came with a custom Google usage policy that compliance teams routinely flagged. The terms included restrictions that required legal review before deployment, and the wording had enough edge cases that you couldn't just assume your use case was covered without asking legal counsel. Legal review adds friction. Friction kills adoption. Teams went elsewhere.
Apache 2.0 removes all of that. Full commercial use. Full modification rights. No termination clauses. You can take the model, change it, deploy it on-prem, keep full control over your data and infrastructure, and build a commercial product around it without worrying about whether your use case is technically permitted. That's what open actually means to developers.
The competitive context matters here. This is clearly a response to pressure from open-weight models coming out of China. Models from Alibaba (Qwen family) and others have been gaining serious developer traction with genuinely permissive licensing. Google's previous approach was too restrictive to compete with that. Apache 2.0 is Google acknowledging that reality and making a structural shift — not just releasing a better model, but changing the terms on which they're competing.
Gemma already had over 400 million downloads and more than 100,000 community variants before this release. With Apache 2.0, that adoption curve is going to steepen considerably. Organizations that previously couldn't deploy Gemma in production because of licensing friction now have no remaining legal obstacle.
What can Gemma 4 actually do?
Capability-wise, these are genuinely multimodal models across the entire family. Text, images with variable aspect ratio and resolution support, video for OCR and chart understanding, and native audio processing on the edge variants. Over 140 languages are supported. Offline code generation works without cloud inference.
For agentic workflows specifically, the larger models support function calling, structured JSON outputs, and multi-step reasoning chains. The 256K context window on the 26B and 31B means you can feed in very long documents, codebases, or conversation histories without hitting limits that force chunking strategies. Configurable thinking modes let you dial the reasoning depth up or down depending on your latency requirements.
The edge models are particularly notable for the audio input capability. Running speech understanding locally on a smartphone — without any cloud round-trip — is the kind of feature that enables a whole category of privacy-sensitive applications that simply weren't practical before. Healthcare, legal, financial services — anywhere the data can't leave the device.
Where do you access and deploy Gemma 4?
Distribution is wide. The 31B and 26B models are available on Hugging Face, Kaggle, Ollama, and Google AI Studio. The edge models (E2B and E4B) are available through Google AI Edge Gallery for mobile development. Android developers can prototype agentic flows in the AICore Developer Preview today.
Framework support at launch covers Hugging Face Transformers, TRL, Transformers.js, vLLM, llama.cpp, MLX, NVIDIA NIM and NeMo, LM Studio, Unsloth, SGLang, and more. Fine-tuning is available through Google Colab or Vertex AI, and production scaling goes through Google Cloud via Vertex AI endpoints, Cloud Run with NVIDIA RTX PRO 6000 GPUs, or GKE for teams that want infrastructure control.
For local deployment, quantizing the 31B to 4-bit via Ollama or GGUF format brings it down to 16GB of RAM. An RTX 4090 handles it comfortably at full 4-bit precision. The 26B MoE is even more practical locally — because only 3.8B parameters are active at inference, the memory footprint is much more manageable than its total parameter count suggests.
How does it compare against Qwen and Llama 4?
This is genuinely a knife fight, not a coronation. April 2026 is arguably the most crowded month in open-source AI history. Alibaba released Qwen 3.6-Plus on the same day as Gemma 4, with a 1 million token context window. Meta's Llama 4 Scout already offers 10 million tokens. Google entered a competitive environment where several strong models are fighting for developer attention simultaneously.
| Model | Top Variant | Context Window | License | Arena AI Rank |
|---|---|---|---|---|
| Gemma 4 | 31B Dense | 256K | Apache 2.0 ✅ | #3 Global |
| Qwen 3.6-Plus | MoE | 1M | Apache 2.0 ✅ | Top 10 |
| Llama 4 Scout | MoE | 10M | Llama License | Top 10 |
| Gemma 4 (26B MoE) | 3.8B active | 256K | Apache 2.0 ✅ | #6 Global |
Where Gemma 4 wins clearly: human preference scores on Arena AI (the 31B ELO of 1,452 beats models with far more parameters), scientific reasoning benchmarks, and edge deployment capabilities. Where it faces real competition: Qwen's vastly larger context window (1M vs 256K) and Llama 4's 10M tokens for long-document applications. Developers choosing between these aren't picking a winner — they're making specific trade-offs based on their deployment environment and use case. According to analysis from VentureBeat, the performance gap between the MoE and dense variants is modest given the significant inference cost advantage of the MoE architecture — making the 26B a more practical choice for most production scenarios.
My Take
I've watched Google release open models with one hand tied behind their back for two years now. Gemma 1, Gemma 2, Gemma 3 — technically solid, but the license was always the problem nobody wanted to talk about loudly. Compliance teams flagged it. Legal review created friction. Developers who wanted to ship commercial products just went with Llama or Qwen instead. Google kept releasing better models and kept losing the adoption race for the same preventable reason.
The Apache 2.0 switch is more significant than any benchmark number in this release. Not because the numbers are bad — they're genuinely strong, especially the 26B MoE punching well above its active parameter weight — but because benchmarks improve every few months regardless. A license change is structural. It changes the calculus for every company that was previously in "wait and evaluate" mode. Those teams don't need to evaluate anymore. The legal obstacle is gone.
The 26B MoE is the model I'd actually pay attention to. Only 3.8 billion parameters active at inference, 256K context, running on a single H100, scoring 79.2% on GPQA Diamond against a 120B dense model from OpenAI. That's an architecture story, not just a numbers story. If you're building production pipelines where latency and memory costs are real constraints, this is a more interesting model than the 31B despite lower benchmark ceilings.
What I'm watching: the edge models with native audio input are the sleeper feature here. Local speech understanding on mobile devices — no cloud required — opens a category of privacy-sensitive applications that were genuinely impractical before. Healthcare, legal, financial services. That's where I'd expect to see the most interesting production deployments in the next six months, not from the 31B flagship on a server rack.
⚡ Key Takeaways
- Gemma 4 is the first Google open model under Apache 2.0 — full commercial use, no legal friction
- Four model sizes: E2B and E4B for edge/mobile, 26B MoE and 31B Dense for workstations and cloud
- 31B ranks #3 globally on Arena AI leaderboard, scoring 85.7% on GPQA Diamond scientific reasoning
- 26B MoE activates only 3.8B parameters at inference — better latency than its size suggests
- Edge models support native audio input for local speech processing — no cloud required
- Available now on Hugging Face, Kaggle, Ollama, Google AI Studio, and Vertex AI
- Directly derived from Gemini 3 research — proprietary model capabilities in an open package
- Context windows: 128K for edge models, 256K for the larger variants
Frequently Asked Questions
Is Gemma 4 actually free for commercial use?
Yes. For the first time in the Gemma series, Google is releasing all four models under the Apache 2.0 license. This means you can use the models commercially, modify them, fine-tune them, and deploy them in production without any special permissions or licensing agreements. There are no termination clauses and no restrictions on commercial applications. This is a meaningful departure from previous Gemma releases, which had a custom Google usage policy that many legal and compliance teams flagged as problematic for commercial deployment.
Can I run Gemma 4 locally on my GPU?
Yes, and the hardware requirements are more accessible than you might expect. The 26B MoE model, despite its total parameter count, only activates 3.8 billion parameters during inference — so its actual memory footprint is much lower than a full dense 26B model. The 31B Dense can be quantized to 4-bit precision via Ollama or GGUF format and fits in 16GB of RAM, making it compatible with an RTX 4090 (24GB VRAM) or RTX 4080 in most configurations. Both larger models were benchmarked running on a single H100 GPU — not a multi-GPU cluster.
What makes the 26B MoE different from the 31B Dense?
The Mixture of Experts architecture in the 26B model means that only a subset of the model's parameters — about 3.8 billion — activates for any given input. The model routes each token through whichever "expert" networks are most relevant, rather than running everything. This gives significantly better inference latency and lower memory usage compared to a dense model of similar total size. The 26B MoE scored 79.2% on GPQA Diamond versus the 31B Dense's 85.7%, so there is a capability gap, but for most production applications the efficiency trade-off favors the MoE variant.
How does Gemma 4 compare to GPT-4o and Claude for everyday tasks?
The 31B Dense is competitive on many benchmarks with commercial closed models. Its Arena AI ELO of 1,452 places it above many models with hundreds of billions of parameters, and human evaluators consistently preferred its responses in comparative testing. For coding specifically, the 2,150 Codeforces ELO rating exceeds 98% of human competitive programmers. Where it still trails frontier closed models is on tasks requiring very long context (GPT-4o and Claude offer larger effective context windows with stronger long-document comprehension in practice) and on general multimodal reasoning where closed models still hold advantages at the top end.
What is the context window for Gemma 4 models?
The edge models (E2B and E4B) support 128,000 tokens. The larger models — the 26B MoE and 31B Dense — support up to 256,000 tokens. For comparison, competitors like Qwen 3.6-Plus offer 1 million tokens and Llama 4 Scout offers 10 million tokens. Google's 256K is competitive for most standard use cases, but if your application requires very long document processing or extremely long conversations without truncation, the context window gap matters.
Where can I try Gemma 4 right now without setting anything up?
Google AI Studio is the fastest no-setup option — both the 31B and 26B models are available there directly. For a more developer-oriented experience, the models are on Hugging Face and Kaggle. If you want local inference, Ollama lets you pull and run the models with a single command, and llama.cpp is supported for custom quantization setups. The edge models (E2B and E4B) are accessible through Google AI Edge Gallery for mobile development testing.
🔗 More From AI Revolution
- Qwen 3.6 Plus vs Claude Opus 4.6: What the Benchmarks Actually Show — the Chinese open model that launched the same day as Gemma 4
- The Hidden Cost of Claude's 200K Context Window — why larger context windows come with real trade-offs
- AI Just Broke Software's Unspoken Moat — why open licensing questions are reshaping the whole industry
What to Watch Next
Gemma 4 is a strong release. The Apache 2.0 license removes the biggest obstacle to commercial adoption, the benchmark numbers are independently verified and genuinely competitive, and the MoE architecture on the 26B makes it a practical choice for production workloads where the 31B's resource requirements are too steep.
But the honest caveat is that open-weight AI in April 2026 is moving faster than any single release can define. Qwen 3.6-Plus launched the same day with a 1M token context. Meta's Llama 4 is already in production. The leaderboard position that looks strong today will be challenged within weeks. What's more durable is the license change — that structural shift doesn't get erased by the next benchmark update. Whether the developer community actually builds on it at the scale Google is hoping for is the real question to track over the next few months.
Sources: Google Blog — Gemma 4 launch · VentureBeat — Apache 2.0 analysis · The Decoder — benchmark verification
0 Comments