Search results for

All search results
Best daily deals

Affiliate links on Android Authority may earn us a commission. Learn more.

DiffusionGemma is Google's fastest AI yet, but it comes with a big trade-off

Google's new AI trades shine for speed.
By

2 hours ago

Add AndroidAuthority on Google
DiffusionGemma
Google
TL;DR
  • DiffusionGemma writes a whole chunk of text in one go and then keeps polishing it rather than building it word by word.
  • Google says it can be up to 4x faster, hitting 1,000+ tokens per second on NVIDIA H100 and around 700 on an RTX 5090, thanks to parallel processing.
  • Output quality is still inferior to Gemma 4, so it’s more of an experimental tool than a finished product.

Google has released DiffusionGemma, an experimental AI model that takes a very different approach to how most chatbots generate text today. Instead of writing one word after another in a strict sequence, it generates a whole block of text at once and then keeps refining it until it becomes readable. The idea is to push for speed and hardware efficiency, even if it means giving up some polish in the final output.

DiffusionGemma compared with other Gemma models
Google

This new AI model is open-sourced under the Apache 2.0 license and is aimed at developers and researchers rather than everyday users. To understand why this matters, it helps to look at how most large language models work. Systems like Google’s Gemma 4 generate text step by step, one token at a time. Each new word depends on what came before it, which makes the process inherently sequential and harder to speed up.

DiffusionGemma, on the other hand, starts with a full canvas of random tokens, essentially noisy, unreadable text, and then repeatedly cleans it up in multiple passes. With each pass, the output becomes more structured and coherent until it settles into a final response. A simple way to picture it is that traditional models write, while DiffusionGemma drafts and edits everything at once.

Don’t want to miss the best from Android Authority?

google preferred source badge light@2xgoogle preferred source badge dark@2x

That shift has a direct impact on performance. Per Google’s claims, DiffusionGemma can be up to four times faster than standard autoregressive models in low-concurrency scenarios, where a single user or process uses the GPU. On high-end hardware, the numbers are even more aggressive. The company asserts more than 1,000 tokens per second on an NVIDIA H100 and over 700 tokens per second on an RTX 5090.

Under the hood, DiffusionGemma is a 26-billion-parameter Mixture-of-Experts model, but it does not activate all of that at once. Only about 3.8 billion parameters are used during inference, helping keep compute requirements manageable. Google says this makes it possible to run the model on high-end consumer GPUs when quantized, with a memory footprint of around 18GB VRAM.

Where things get more interesting is how the model actually generates text. It can produce up to 256 tokens in parallel in a single step, and each token can attend to every other token in the block. That gives the model a global view of the output instead of a strictly linear one.

This makes it better suited for structured or rule-based tasks. For example, it can help fill in missing sections of code, complete structured formats like JSON, work through logic-heavy problems such as Sudoku-style puzzles, or handle mathematical patterns where consistency across the whole output matters more than sentence-by-sentence flow. Because it sees the entire block at once, it can also correct contradictions within the same generation cycle, rather than waiting for a later token to fix them.

But there is a catch, and Google is upfront about it. DiffusionGemma does not match the output quality of its standard Gemma 4 models. The writing can be less stable, less refined, and not as reliable for complex or nuanced responses. So, you get speed but lose some polish.

DiffusionGemma comparison
Google

That is why Google is positioning it as an experimental tool — it is designed for scenarios where responsiveness matters more than perfection, such as real-time AI tools, inline writing or coding assistants, and fast iterative workflows where users care more about instant feedback than final-quality text.

Hence, DiffusionGemma is not meant to replace existing Gemini or Gemma models. It is a speed-first experiment that trades output quality for efficiency and responsiveness. But it also hints at a different direction for AI text generation, where models do not just predict the next word, but generate and refine entire blocks of text simultaneously.

Follow

Thank you for being part of our community. Read our Comment Policy before posting.