Mercury 2 is the world's fastest reasoning language model that uses parallel refinement instead of sequential decoding. It generates responses simultaneously to achieve over 1,000 tokens per second while maintaining reasoning-grade quality.

Mercury 2 is the world's fastest reasoning language model, introduced by Inception Labs. It belongs to the category of diffusion-based large language models designed for production AI deployments where speed is critical. The core value lies in its ability to deliver reasoning-grade quality within real-time latency budgets, enabling applications that were previously impossible due to sequential decoding bottlenecks. It is built for developers, product teams, and enterprises who need instant AI responses in loops, agents, and interactive systems. By adopting a parallel refinement approach, Mercury 2 shifts the quality-speed curve, making high-intelligence reasoning viable at over 1,000 tokens per second.

The fundamental problem Mercury 2 solves is the latency accumulation in production AI pipelines. Current LLMs use autoregressive sequential decoding, generating one token at a time, which becomes a bottleneck when AI is embedded in loops—agents, retrieval pipelines, extraction jobs running at volume. Each step, each user, each retry compounds latency, degrading user experience and limiting system complexity. This sequential approach forces a painful trade-off: higher intelligence requires more test-time compute, directly increasing latency and cost. Mercury 2's diffusion-based reasoning breaks this trade-off, providing reasoning-grade quality inside real-time latency budgets, enabling sophisticated AI workflows without the performance penalty.

The primary architectural innovation of Mercury 2 is its diffusion-based parallel refinement, replacing sequential decoding with simultaneous token generation. Instead of producing tokens one by one left-to-right, the model generates multiple tokens at once and converges over a small number of steps, akin to an editor revising a full draft rather than a typewriter. This approach yields over five times faster generation and a fundamentally different speed curve. For production deployments, this means p95 latency stays low even under high concurrency, with consistent turn-to-turn behavior and stable throughput. The result is a responsive AI that users actually feel—instantaneous, not waiting. This feature is enabled by NVIDIA Blackwell GPUs, achieving 1,009 tokens per second.

Mercury 2 offers a tunable reasoning capability that allows developers to adjust the depth of reasoning per use case, directly managing the balance between quality and speed. This flexibility means the same model can handle both quick-response tasks and deep analytical queries without switching systems. Additionally, the model supports a 128K context window, making it suitable for large documents and extended conversations. Native tool use enables the model to interact with external APIs and data sources, while schema-aligned JSON output ensures structured data generation that matches specified formats, reducing post-processing. These features combine to make Mercury 2 not just fast, but production-ready with high-quality outputs.

Mercury 2

Key Features

Use Cases

Who is this for?

Comments