Xiaomi High-Efficiency Model

MiMo-V2-Flash for High-Frequency Production Workloads

MiMo-V2-Flash is our high-efficiency model designed for throughput, responsiveness, and strong price-performance. For teams that need low-latency inference and budget-aware deployment at scale, MiMo-V2-Flash is positioned as the practical production layer in the MiMo-V2 family.

Overview

In the supplied MiMo-V2 research material, MiMo-V2-Flash is positioned as the extreme-efficiency member of the lineup. It balances a large MoE base with fast inference, long context, and one of the most aggressive pricing profiles in the family.

Scale

309B / 15B MoE

The report lists 309B total parameters with 15B active parameters, providing a substantial base while keeping active compute comparatively lean.

Speed

150+ TPS

MiMo-V2-Flash is described as supporting 150+ tokens per second, making it well suited to high-frequency interaction layers and fast response scenarios.

Context

256K Tokens

The model supports a 256K-token context window, giving production teams meaningful long-context capacity without stepping into flagship-level cost.

This page is grounded in the supplied MiMo-V2 research report and focuses on MiMo-V2-Flash as a distinct Xiaomi product entity for efficient, low-latency deployment.

Efficiency Profile

MiMo-V2-Flash is defined in the report by a combination of MoE scale, compact activation cost, long context, and highly competitive API rates.

Hybrid Attention 5:1

The supplied material attributes a 5:1 hybrid attention architecture to MiMo-V2-Flash, helping it maintain inference efficiency while still supporting long documents and multi-step interaction.

Built for Throughput

Rather than taking the flagship reasoning role, MiMo-V2-Flash is presented as the model for frequent production traffic where responsiveness and unit economics matter most.

Low Activation Cost

With 15B active parameters under an MoE design, the model is framed as a practical choice for scale-sensitive deployment where inference efficiency is central.

Long-Context Practicality

Its 256K context window allows Flash to support richer sessions, larger prompts, and document-driven use cases without moving up to the 1M context tier of MiMo-V2-Pro.

Recommended Use Cases

The report explicitly recommends MiMo-V2-Flash for high-frequency foundational interactions. That makes it a practical model layer for speed-sensitive production systems.

High-Frequency Interaction

MiMo-V2-Flash fits chat surfaces, assistant panels, frequent website interactions, and operational flows where latency and unit cost need to stay tightly controlled.

Production Fallback Layer

It can serve as a fast execution layer beneath a more capable reasoning model, handling routine prompts, retrieval-driven answers, and repeated structured interactions at lower cost.

Cost-Efficient Automation

For batch-like or repetitive automation scenarios, the Flash positioning in the report makes it suitable where response quality must remain strong but premium model pricing is unnecessary.

Traffic Scaling

Its price-performance profile makes it the most natural candidate in the family for scenarios where request volume matters as much as absolute intelligence ceiling.

API Pricing

The supplied report positions MiMo-V2-Flash as the lowest-cost model in the lineup among the text-focused variants.

ModelInput / 1M TokensOutput / 1M TokensPositioning
MiMo-V2-Flash$0.10$0.30Designed for high-frequency, low-latency scenarios with strong price-performance.

Official Resources