Xiaomi Large Model Family

MiMo-V2 for the Agent Era

MiMo-V2 is our large model family designed to move AI from the conversation era to the agent era. Across flagship reasoning, native multimodal perception, real-time speech generation, and efficient inference, MiMo-V2 unifies understanding and action for websites, automation workflows, content generation, and complex task orchestration.

Overview

Released from late 2025 to early 2026, the MiMo-V2 series represents Xiaomi's self-developed large model portfolio built around trillion-scale capacity, full-modal perception, and human-like interaction. The family spans high-end foundation intelligence, omni-modal input and action, speech synthesis, and cost-efficient deployment.

Scale

Trillion-Class Architecture

MiMo-V2-Pro is positioned as the flagship base model with more than 1T total parameters and 42B activated parameters under a Mixture-of-Experts design.

Context

Long-Range Reasoning

The Pro model extends to a 1,000,000-token context window while maintaining high response efficiency through a hybrid attention ratio of 7:1.

Modality

Unified Perception

MiMo-V2-Omni combines vision, audio, and text in one native architecture and binds perception directly to action for browser-native workflows.

Efficiency

Deployment Flexibility

MiMo-V2-Flash targets low-latency, high-frequency production usage with 150+ tokens per second and a highly competitive cost profile.

Entity focus: MiMo-V2 is presented here as the Xiaomi model family, with distinct product entities including MiMo-V2-Pro, MiMo-V2-Omni, MiMo-V2-TTS, and MiMo-V2-Flash. All descriptions on this page are aligned to the supplied research source and do not introduce unsupported product claims.

Model Matrix

The MiMo-V2 lineup covers a broad deployment spectrum, from advanced agentic reasoning to omni-modal interaction, speech output, and budget-sensitive high-volume use cases.

Model Positioning Total / Active Parameters Context Window Core Architecture and Traits
MiMo-V2-Pro Flagship foundation model >1T / 42B (MoE) 1,000,000 tokens Hybrid attention 7:1, MTP layers, deep agent optimization
MiMo-V2-Omni Omni-modal foundation model Not publicly disclosed 256K tokens Unified vision, audio, and text architecture with native Browser Use support
MiMo-V2-TTS Speech synthesis model Pretrained on hundred-million-hour scale data 8K tokens Multi-codebook speech-text joint modeling with dialect, emotion, and singing support
MiMo-V2-Flash Extreme-efficiency model 309B / 15B (MoE) 256K tokens Hybrid attention 5:1, 150+ tps inference, optimized price-performance

Core Technology

MiMo-V2 is defined by a set of architecture and post-training choices oriented toward long-context reasoning, tool reliability, and multimodal action under real deployment constraints.

Hybrid Attention

MiMo-V2 introduces a hybrid attention mechanism that balances long-text modeling against inference efficiency. In MiMo-V2-Pro, the 7:1 hybrid ratio supports ultra-long 1M context processing while preserving high responsiveness.

Agent-Centric Post-Training

MiMo-V2-Pro is deeply optimized during supervised fine-tuning and reinforcement learning for agent frameworks such as OpenClaw, with emphasis on task planning, tool use stability, and self-correction after errors.

Omni-Modal Perception

MiMo-V2-Omni supports advanced chart interpretation, cross-domain visual reasoning, up to 10 hours of continuous audio understanding, environmental sound recognition, speaker separation, and combined audio-video input for situational prediction.

Speech Generation

MiMo-V2-TTS is built for real-time audio generation and expressive speech output, supporting dialectal variation, emotional control, and singing scenarios within a unified speech-text modeling framework.

Developer Integration

For website development and production integration, MiMo-V2 provides an OpenAI-compatible API surface, tool calling support, structured output, web search connectivity, and direct speech generation capability.

API Compatibility

  • Base URL: https://api.xiaomimimo.com/v1
  • Authentication: api-key: $MIMO_API_KEY or Authorization: Bearer $MIMO_API_KEY
  • Thinking support: thinking: { type: "enabled" } with reasoning_content in the response
  • Structured output: response_format: { type: "json_object" }

Tooling and Workflow Support

  • Standard tools definitions are supported, with specific optimization for multi-step reasoning stability.
  • web_search can be invoked to retrieve real-time information directly.
  • mimo-v2-tts supports direct audio stream generation through the API.
  • Documented ecosystem compatibility includes Claude Code, Cline, Roo Code, Kilo Code, LiteLLM, LangChain, and OpenRouter.

Application Scenarios

The MiMo-V2 family is positioned for website automation, productivity document generation, multimodal interaction, and voice-enabled service experiences.

Browser-Native Automation

With MiMo-V2-Omni and its native browser operation capabilities, the family supports cross-platform shopping and price comparison, automated communication, checkout completion, social media publishing, and interactive comment handling.

Document Production

MiMo-V2-Omni can generate nearly production-ready Excel, Word, PDF, and PPT materials, including formatted reports, planning documents, layouts, and presentation structures derived from raw source data.

Customer Service and Voice Agents

MiMo-V2-TTS enables branded voice personas, customer service voices, dialect-aware delivery, and more natural spoken experiences through emotional control and paralinguistic events such as laughter and sighs.

Task Orchestration

MiMo-V2-Pro is recommended for complex business logic and agent planning, while MiMo-V2-Flash is positioned for frequent foundational interactions where throughput, latency, and unit economics are critical.

API Pricing

The supplied pricing information positions MiMo-V2 across premium reasoning, competitive multimodality, and cost-sensitive high-frequency usage.

Model Input Price / 1M Tokens Output Price / 1M Tokens Notes
MiMo-V2-Pro $1.00 within 256K / $2.00 at 1M $3.00 within 256K / $6.00 at 1M Cached input is priced at 20% of the standard input rate.
MiMo-V2-Omni $0.40 $2.00 Competitive multimodal API pricing.
MiMo-V2-Flash $0.10 $0.30 Designed for high-frequency, low-latency scenarios.
MiMo-V2-TTS Free for a limited time Free for a limited time Speech synthesis is temporarily not billed.

Frequently Asked Questions

The FAQ below summarizes the most important points from the supplied MiMo-V2 research material in a compact, source-aligned format.

What is MiMo-V2?

MiMo-V2 is Xiaomi's large model family introduced from late 2025 to early 2026 and positioned to move artificial intelligence from pure conversation toward agent-oriented execution, where understanding and action are more tightly integrated.

Which models are included in the MiMo-V2 family?

The supplied material identifies four core models: MiMo-V2-Pro, MiMo-V2-Omni, MiMo-V2-TTS, and MiMo-V2-Flash. Together they cover flagship reasoning, multimodal perception, speech synthesis, and high-efficiency deployment.

Does MiMo-V2 support API-based developer integration?

Yes. The report states that the Xiaomi MiMo platform provides an OpenAI-compatible API and supports features such as tool calls, structured JSON output, web search, and speech generation through the mimo-v2-tts endpoint.

What application scenarios are described?

The documented scenarios include browser-native automation, shopping and comparison workflows, social content operations, document generation for Excel, Word, PDF, and PPT, plus customer service and voice assistant use cases.

How should the model family be used across workloads?

Within the provided report, MiMo-V2-Pro is recommended for complex business logic, MiMo-V2-Flash for high-frequency interactions, and MiMo-V2-Omni for multimodal expansion where perception and action need to be linked.

What does the pricing positioning look like?

The research positions MiMo-V2-Pro as the premium long-context option, MiMo-V2-Omni as a competitively priced multimodal service, MiMo-V2-Flash as a low-cost high-throughput model, and MiMo-V2-TTS as temporarily free at the time of the report.

Official Resources

The following references are the source-aligned external resources listed in the research material. All outbound links are marked with nofollow as requested.

Positioning Summary

Within the supplied report, MiMo-V2 is framed as a strong foundation for website development and agent-oriented upgrades. The recommended deployment pattern is to use MiMo-V2-Pro for complex business logic, MiMo-V2-Flash for high-frequency interaction, and MiMo-V2-Omni for multimodal expansion.

Model Pages

Dedicated model pages are available for users who want a more focused view of each member of the MiMo-V2 family, including flagship reasoning, multimodal workflows, speech generation, and high-efficiency deployment.

MiMo-V2-Omni

Visit the MiMo-V2-Omni page for a focused view of Xiaomi's multimodal model for perception, browser action, and rich media workflows.