

Molmo 2 is a suite of state-of-the-art vision-language models designed for advanced video and image analysis. It extends Molmo's strengths in grounded vision to video and multi-image understanding, providing capabilities for video pointing, tracking, and dense captioning. The models can analyze single images, multi-image inputs, and video clips of varying length, moving beyond simple descriptive answers to pinpointing events in space and time.
Molmo 2 offers three variants serving different needs: Molmo 2 (8B) is Qwen 3-based and provides the best overall performance for video grounding and QA; Molmo 2 (4B) is also Qwen 3-based and optimized for efficiency; Molmo 2-O (7B) is built on Olmo and offers a fully open end-to-end model flow including the underlying LLM. These models support video tracking, image and multi-image reasoning, video grounding (pointing + counting), long-video QA, and human preference evaluations.
The architecture consists of a vision encoder that processes images or video frames into visual tokens and a language model backbone (Qwen 3 or Olmo) that consumes those tokens alongside text. A lightweight connector interleaves visual tokens with timestamps, image indices, and text so the model can reason jointly over space, time, and language. The models are trained in two stages focusing on pretraining for alignment and grounding through joint image captioning and image pointing, followed by supervised fine-tuning on a multimodal mixture.
Molmo 2 enables counting-by-pointing, multi-object tracking with persistent IDs that follow objects across occlusions and re-entries, dense video captioning with highly descriptive narratives, anomaly detection that flags rare events, artifact detection that points to flaws in generative video, and subtitle-aware QA that combines visual evidence with in-video subtitles. At inference time, Molmo 2 scales gracefully by processing more frames directly for maximum fidelity or adopting a SlowFast-style strategy.
The product is designed for research and educational use and supports applications including robotics and assistive technology, traffic monitoring and safety, scientific measurement, and various video analysis tasks. It enables users to ask questions like "How many times does the robot grasp the red block?" and receive points and timestamps for each grasp event, or "When did the cup fall?" and get the timestamp and location of the fall.
Molmo 2 targets researchers, practitioners, and educators working with multimodal intelligence applications. It's available through the Ai2 Playground for video and multi-image workflows, with models downloadable from Hugging Face and training code released under an open-source license. The system is designed to be easily useful for a wide audience from practitioners to researchers.
admin
Molmo 2 is designed for researchers, practitioners, and educators working with multimodal intelligence applications. It targets users in robotics and assistive technology, traffic monitoring and safety, scientific measurement, and various video analysis domains. The product is particularly useful for researchers who want full control over every part of the stack—vision encoder, connector, and language model. It's intended for research and educational use in accordance with Ai2's Responsible Use Guidelines and is designed to be easily useful for a wide audience from practitioners to researchers.