Understanding diffusion large language models [dLLMs]

Hailed as a paradigm shift in a world of copycat autoregressive LLMs, Inception Labs released Mercury Coder which is 10x faster than standard LLMs on same underlying hardware.

Mar 09, 2025

There is a very famous quote that is attributed to Michelangelo. When he was asked about creating the sculpture David he said, “It is easy. You just chip away the stone that doesn't look like David”

Diffusion inference works in a similar way. Michelangelo started with a block of marble and carved David from it. Diffusion starts with with noise and then gets to the final output by reducing the noise step by step.

The video below shows an example of a diffusion model generating code. It is not sped up and as you can see, it takes ~5 seconds to generate the code for a snake game.

Thanks for reading Deep Gains! This post is public so feel free to share it.

What are diffusion LLMs and how is it different from other LLMs

You might have already used diffusion models before, just not in text generation.

Diffusion models are making huge leaps in how we generate text and code. Unlike the current autoregressive models (GPTs, Claudes, Llamas), which create content in a serial fashion line by predicting next token, diffusion models take a more flexible, step-by-step approach. They start with noise and gradually refine it through denoising steps. This process allows them to look at the big picture as a whole and adapt based on feedback throughout the generation.

Because of this paradigm shift in approach, diffusion models are found to be great at reasoning tasks and can produce more coherent and relevant responses. They can also fix mistakes and inaccuracies better than autoregressive models, which can get stuck in a sequence and carry errors forward.

We've already seen diffusion models shine in areas like video, image, and audio generation, with tools like DALL-E, Sora, Midjourney showing their potential. While applying them to text and code has been tricky, recent progress indicates that these challenges are easing up, making it possible to use diffusion models effectively in these fields too. The demo you saw above is one such example of diffusion based text generation.

This breakthrough could usher in a new wave of AI solutions that are not only quicker and more efficient but also improve the quality of the content produced. With the strengths of diffusion models, we might see better outcomes in tasks like understanding natural language and generating code, which would make top-notch AI tools more available and useful for many applications.

In short, moving from autoregressive to diffusion-based models is a big leap forward in AI development. It could completely change how we create and interact with text and code. As research continues, we can look forward to exciting new applications that tap into the unique benefits of diffusion techniques.

How to get started.

You can test out Mercury Coder, a diffusion based code generation model from Inception Labs. It has great code generation properties as we saw in the above video. It scores top rank in speed and has accuracy levels comparable to GPT-4O mini.

Mercury Coder runs about 10x faster on same hardware.

This model runs 5 to 10 times quicker than the current autoregressive LLMs, hitting speeds of over 1000 tokens per second on NVIDIA H100 GPUs. It improves response quality by generating answers in a smart, step-by-step way. When it comes to coding tasks, it outshines models like GPT-4o Mini and Claude 3.5 Haiku. Plus, it’s versatile, working well for things like retrieval-augmented generation and tool usage.

One of its standout features is that it can modify multiple tokens at once, which leads to better outputs. It’s also easy to integrate as a drop-in replacement for existing LLMs. The great news is that its performance boosts don’t rely on fancy new hardware. You can get high throughput even with standard setups. Overall, this model marks a big leap forward in AI code generation technology. The gains will compound when you use faster chips in future or an accelerated inference platform like Groq, Cerebras, and SambaNova.

To sum it up, Mercury Coder's speed and efficiency are setting new standards in code generation. Its dLLM architecture not only speeds things up but also improves output quality. Its flexibility makes it easy to plug into different applications and workflows. Thanks to its innovative algorithms, it keeps performing well without the need for advanced hardware. For developers looking for dependable code generation, this model is a real gem.

Note: Inception Labs has not publicly released a full technical paper detailing their "Mercury" diffusion LLM.

Andrej Karpathy’s thoughts on Mercury Coder and dLLMs

Deep Gains

Understanding diffusion large language models [dLLMs]

Hailed as a paradigm shift in a world of copycat autoregressive LLMs, Inception Labs released Mercury Coder which is 10x faster than standard LLMs on same underlying hardware.

What are diffusion LLMs and how is it different from other LLMs

How to get started.

Discussion about this post