Xiaomi Native Multimodal Model

MiMo-V2-Omni for Perception, Action, and Rich Media Workflows

MiMo-V2-Omni is our multimodal foundation model built to unify vision, audio, and text with native browser action. In the supplied research material, it is positioned as the family member for Browser Use, document generation, and complex perception-to-action scenarios.

Overview

The supplied MiMo-V2 report positions MiMo-V2-Omni as the omni-modal base model in the series. It combines vision, audio, and text in a unified architecture and is explicitly tied to browser-native action, document production, and multimodal situational understanding.

Context

256K Tokens

MiMo-V2-Omni is listed with a 256K-token context window, giving it room for long multimodal prompts and larger action-oriented sessions.

Architecture

Unified Modal Stack

The model is described as using a unified vision, audio, and text architecture rather than treating media modalities as separate add-ons.

Action

Native Browser Use

The report highlights native Browser Use support, positioning MiMo-V2-Omni not just as a model for perception, but for perception bound directly to action.

This page is grounded in the supplied MiMo-V2 research report and focuses on MiMo-V2-Omni as the Xiaomi product entity for multimodal reasoning and browser-native execution.

Multimodal Capability

MiMo-V2-Omni is defined by native multimodal perception across visual, audio, and combined media inputs.

Visual Understanding

The report attributes support for complex chart analysis and cross-disciplinary visual reasoning to MiMo-V2-Omni, making it applicable to data-heavy and visually rich workflows.

Long Audio Understanding

MiMo-V2-Omni is described as supporting up to 10 hours of continuous audio understanding, including environmental sound recognition and multi-speaker separation.

Video-Aware Reasoning

The model supports joint audio-video input and is positioned as capable of stronger situational awareness and predictive reasoning under complex media conditions.

Perception Bound to Action

Rather than stopping at interpretation, the model is framed as linking perception to downstream operation, making it suitable for multimodal agents.

Browser Use and Document Workflows

The supplied material gives MiMo-V2-Omni a strong operational positioning for browser-native execution and near-finished document output.

Automation Across Websites

MiMo-V2-Omni is described as supporting cross-platform shopping, price comparison, automated communication, checkout flows, and social media publishing with interaction handling.

Near-Finished Documents

The report states that MiMo-V2-Omni can directly generate near-finished Excel, Word, PDF, and PPT outputs with formatting and layout included.

Operational Usefulness

This positioning makes MiMo-V2-Omni the most natural choice in the lineup when a system must see, hear, interpret, and act in one integrated loop.

Multimodal Expansion Layer

The summary recommendation in the supplied report explicitly pairs MiMo-V2-Omni with multimodal expansion across product and workflow deployments.

API Pricing

ModelInput / 1M TokensOutput / 1M TokensPositioning
MiMo-V2-Omni$0.40$2.00Competitively priced multimodal API access.

Official Resources