

Mercury 2 is a reasoning language model built specifically for production AI applications where speed and responsiveness are critical. It represents a fundamental shift from traditional autoregressive models by employing diffusion-based architecture.
Mercury 2 generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. Key features include tunable reasoning capabilities, 128K context length, native tool use functionality, and schema-aligned JSON output. The model achieves 1,009 tokens per second on NVIDIA Blackwell GPUs and is priced at $0.25 per million input tokens and $0.75 per million output tokens.
The model's diffusion-based approach allows it to generate tokens simultaneously rather than sequentially, resulting in more than 5x faster generation compared to traditional models. This parallel refinement process enables reasoning-grade quality within real-time latency budgets, fundamentally changing the trade-off between intelligence and speed.
Mercury 2 excels in latency-sensitive applications where user experience is non-negotiable. It enables coding and editing workflows with fast autocomplete and next-edit suggestions, supports agentic workflows that chain dozens of inference calls per task, powers real-time voice interfaces with natural speech cadences, and enhances search and RAG pipelines with multi-hop retrieval and summarization.
The model targets production deployments requiring high-speed reasoning capabilities, including developers building interactive applications, enterprises running agentic workflows, companies developing voice interfaces, and organizations implementing intelligent search systems. Mercury 2 is OpenAI API compatible and can be integrated into existing stacks without requiring rewrites.
admin
Mercury 2 targets developers building interactive applications, enterprises running agentic workflows at scale, companies developing real-time voice interfaces, and organizations implementing intelligent search systems. It serves production deployments requiring high-speed reasoning capabilities where latency-sensitive user experiences are critical, including coding environments, advertising optimization platforms, conversational AI systems, and enterprise search solutions.