MiniCPM-o 4.5 is the latest and most capable model in the MiniCPM-o series, built as an end-to-end multimodal AI model based on SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B with a total of 9B parameters. It exhibits significant performance improvements and introduces new features for full-duplex multimodal live streaming, enabling real-time processing of video and audio inputs while generating concurrent text and speech outputs.
The model features leading visual capability with an average score of 77.6 on OpenCompass, surpassing proprietary models like GPT-4o and Gemini 2.0 Pro. It supports strong speech capability with bilingual real-time conversation in English and Chinese, configurable voices, voice cloning, and role play. The full-duplex streaming capability allows simultaneous processing of real-time video and audio inputs while generating outputs without mutual blocking. It also offers strong OCR capability, processing high-resolution images up to 1.8 million pixels and high-FPS videos up to 10fps efficiently.
MiniCPM-o 4.5 processes real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion. This allows the model to see, listen, and speak simultaneously without mutual blocking, creating a fluid, real-time omnimodal conversation experience. The model can also perform proactive interaction, initiating reminders or comments based on continuous understanding of the live scene.
The model achieves state-of-the-art performance for end-to-end English document parsing on OmniDocBench, outperforming proprietary models like Gemini-3 Flash and GPT-5. It features trustworthy behaviors matching Gemini 2.5 Flash on MMHal-Bench and supports multilingual capabilities on more than 30 languages. It delivers high-quality spoken interaction with natural prosody, including appropriate rhythm, stress, and pauses.
The model can be easily used through llama.cpp and Ollama for efficient CPU inference on local devices, supports int4 and GGUF format quantized models in 16 sizes, and works with vLLM and SGLang for high-throughput inference. It also supports FlagOS for unified multi-chip backend plugin, fine-tuning with LLaMA-Factory, and online web demos. The llama.cpp-omni inference framework enables full-duplex multimodal live streaming on local devices like PCs and MacBooks.
admin
MiniCPM-o 4.5 is designed for developers and researchers working with multimodal AI applications requiring real-time processing capabilities. It serves users needing local AI deployment for privacy-sensitive applications, those requiring voice cloning and role play features, document parsing specialists, multilingual AI application developers, and anyone needing AI systems that can process visual, audio, and text inputs simultaneously without cloud dependencies.