Trillion-Class Architecture
MiMo-V2-Pro is positioned as the flagship base model with more than 1T total parameters and 42B activated parameters under a Mixture-of-Experts design.
MiMo-V2 is our large model family designed to move AI from the conversation era to the agent era. Across flagship reasoning, native multimodal perception, real-time speech generation, and efficient inference, MiMo-V2 unifies understanding and action for websites, automation workflows, content generation, and complex task orchestration.
Released from late 2025 to early 2026, the MiMo-V2 series represents Xiaomi's self-developed large model portfolio built around trillion-scale capacity, full-modal perception, and human-like interaction. The family spans high-end foundation intelligence, omni-modal input and action, speech synthesis, and cost-efficient deployment.
MiMo-V2-Pro is positioned as the flagship base model with more than 1T total parameters and 42B activated parameters under a Mixture-of-Experts design.
The Pro model extends to a 1,000,000-token context window while maintaining high response efficiency through a hybrid attention ratio of 7:1.
MiMo-V2-Omni combines vision, audio, and text in one native architecture and binds perception directly to action for browser-native workflows.
MiMo-V2-Flash targets low-latency, high-frequency production usage with 150+ tokens per second and a highly competitive cost profile.
Entity focus: MiMo-V2 is presented here as the Xiaomi model family, with distinct product entities including MiMo-V2-Pro, MiMo-V2-Omni, MiMo-V2-TTS, and MiMo-V2-Flash. All descriptions on this page are aligned to the supplied research source and do not introduce unsupported product claims.
The MiMo-V2 lineup covers a broad deployment spectrum, from advanced agentic reasoning to omni-modal interaction, speech output, and budget-sensitive high-volume use cases.
| Model | Positioning | Total / Active Parameters | Context Window | Core Architecture and Traits |
|---|---|---|---|---|
| MiMo-V2-Pro | Flagship foundation model | >1T / 42B (MoE) | 1,000,000 tokens | Hybrid attention 7:1, MTP layers, deep agent optimization |
| MiMo-V2-Omni | Omni-modal foundation model | Not publicly disclosed | 256K tokens | Unified vision, audio, and text architecture with native Browser Use support |
| MiMo-V2-TTS | Speech synthesis model | Pretrained on hundred-million-hour scale data | 8K tokens | Multi-codebook speech-text joint modeling with dialect, emotion, and singing support |
| MiMo-V2-Flash | Extreme-efficiency model | 309B / 15B (MoE) | 256K tokens | Hybrid attention 5:1, 150+ tps inference, optimized price-performance |
MiMo-V2 is defined by a set of architecture and post-training choices oriented toward long-context reasoning, tool reliability, and multimodal action under real deployment constraints.
MiMo-V2 introduces a hybrid attention mechanism that balances long-text modeling against inference efficiency. In MiMo-V2-Pro, the 7:1 hybrid ratio supports ultra-long 1M context processing while preserving high responsiveness.
MiMo-V2-Pro is deeply optimized during supervised fine-tuning and reinforcement learning for agent frameworks such as OpenClaw, with emphasis on task planning, tool use stability, and self-correction after errors.
MiMo-V2-Omni supports advanced chart interpretation, cross-domain visual reasoning, up to 10 hours of continuous audio understanding, environmental sound recognition, speaker separation, and combined audio-video input for situational prediction.
MiMo-V2-TTS is built for real-time audio generation and expressive speech output, supporting dialectal variation, emotional control, and singing scenarios within a unified speech-text modeling framework.
For website development and production integration, MiMo-V2 provides an OpenAI-compatible API surface, tool calling support, structured output, web search connectivity, and direct speech generation capability.
https://api.xiaomimimo.com/v1api-key: $MIMO_API_KEY or Authorization: Bearer $MIMO_API_KEYthinking: { type: "enabled" } with reasoning_content in the responseresponse_format: { type: "json_object" }tools definitions are supported, with specific optimization for multi-step reasoning stability.web_search can be invoked to retrieve real-time information directly.mimo-v2-tts supports direct audio stream generation through the API.The MiMo-V2 family is positioned for website automation, productivity document generation, multimodal interaction, and voice-enabled service experiences.
With MiMo-V2-Omni and its native browser operation capabilities, the family supports cross-platform shopping and price comparison, automated communication, checkout completion, social media publishing, and interactive comment handling.
MiMo-V2-Omni can generate nearly production-ready Excel, Word, PDF, and PPT materials, including formatted reports, planning documents, layouts, and presentation structures derived from raw source data.
MiMo-V2-TTS enables branded voice personas, customer service voices, dialect-aware delivery, and more natural spoken experiences through emotional control and paralinguistic events such as laughter and sighs.
MiMo-V2-Pro is recommended for complex business logic and agent planning, while MiMo-V2-Flash is positioned for frequent foundational interactions where throughput, latency, and unit economics are critical.
The supplied pricing information positions MiMo-V2 across premium reasoning, competitive multimodality, and cost-sensitive high-frequency usage.
| Model | Input Price / 1M Tokens | Output Price / 1M Tokens | Notes |
|---|---|---|---|
| MiMo-V2-Pro | $1.00 within 256K / $2.00 at 1M | $3.00 within 256K / $6.00 at 1M | Cached input is priced at 20% of the standard input rate. |
| MiMo-V2-Omni | $0.40 | $2.00 | Competitive multimodal API pricing. |
| MiMo-V2-Flash | $0.10 | $0.30 | Designed for high-frequency, low-latency scenarios. |
| MiMo-V2-TTS | Free for a limited time | Free for a limited time | Speech synthesis is temporarily not billed. |
The FAQ below summarizes the most important points from the supplied MiMo-V2 research material in a compact, source-aligned format.
MiMo-V2 is Xiaomi's large model family introduced from late 2025 to early 2026 and positioned to move artificial intelligence from pure conversation toward agent-oriented execution, where understanding and action are more tightly integrated.
The supplied material identifies four core models: MiMo-V2-Pro, MiMo-V2-Omni, MiMo-V2-TTS, and MiMo-V2-Flash. Together they cover flagship reasoning, multimodal perception, speech synthesis, and high-efficiency deployment.
Yes. The report states that the Xiaomi MiMo platform provides an OpenAI-compatible API and supports features such as tool calls, structured JSON output, web search, and speech generation through the mimo-v2-tts endpoint.
The documented scenarios include browser-native automation, shopping and comparison workflows, social content operations, document generation for Excel, Word, PDF, and PPT, plus customer service and voice assistant use cases.
Within the provided report, MiMo-V2-Pro is recommended for complex business logic, MiMo-V2-Flash for high-frequency interactions, and MiMo-V2-Omni for multimodal expansion where perception and action need to be linked.
The research positions MiMo-V2-Pro as the premium long-context option, MiMo-V2-Omni as a competitively priced multimodal service, MiMo-V2-Flash as a low-cost high-throughput model, and MiMo-V2-TTS as temporarily free at the time of the report.
The following references are the source-aligned external resources listed in the research material. All outbound links are marked with nofollow as requested.
Within the supplied report, MiMo-V2 is framed as a strong foundation for website development and agent-oriented upgrades. The recommended deployment pattern is to use MiMo-V2-Pro for complex business logic, MiMo-V2-Flash for high-frequency interaction, and MiMo-V2-Omni for multimodal expansion.
Dedicated model pages are available for users who want a more focused view of each member of the MiMo-V2 family, including flagship reasoning, multimodal workflows, speech generation, and high-efficiency deployment.
Visit the MiMo-V2-Pro page for a focused overview of Xiaomi's flagship agent foundation model.
Visit the MiMo-V2-Flash page for a focused view of Xiaomi's high-efficiency model for low-latency production workloads.
Visit the MiMo-V2-Omni page for a focused view of Xiaomi's multimodal model for perception, browser action, and rich media workflows.
Visit the MiMo-V2-TTS page for a focused view of Xiaomi's speech synthesis model for expressive voice generation.