Xiaomi Low-Latency MoE Model

MiMo-V2-Flash: Xiaomi MoE Model for Low-Latency Workloads

MiMo-V2-Flash is Xiaomi's low-latency MoE model for throughput, responsiveness, and strong price-performance. For teams that need cost-efficient AI inference and budget-aware deployment at scale, MiMo-V2-Flash is positioned as the practical production layer in the MiMo-V2 family.

Overview

In the supplied MiMo-V2 research material, MiMo-V2-Flash is positioned as Xiaomi's efficiency-focused MoE model in the lineup. It balances a large MoE base with fast inference, 256K context, and one of the most aggressive pricing profiles in the family.

Scale

309B / 15B MoE

The report lists 309B total parameters with 15B active parameters, providing a substantial base while keeping active compute comparatively lean.

Speed

150+ TPS

MiMo-V2-Flash is described as supporting 150+ tokens per second, making it well suited to high-frequency interaction layers and fast response scenarios.

Context

256K Tokens

The model supports a 256K-token context window, giving production teams meaningful long-context capacity without stepping into flagship-level cost.

This page summarizes Xiaomi MiMo public materials and focuses on MiMo-V2-Flash as a distinct product entity for efficient, low-latency deployment and API-based production workloads.

Efficiency Profile

MiMo-V2-Flash is defined in the report by a combination of MoE scale, compact activation cost, long context, and highly competitive API rates.

Hybrid Attention 5:1

The supplied material attributes a 5:1 hybrid attention architecture to MiMo-V2-Flash, helping it maintain inference efficiency while still supporting long documents and multi-step interaction.

Built for Throughput

Rather than taking the flagship reasoning role, MiMo-V2-Flash is presented as the model for frequent production traffic where responsiveness and unit economics matter most.

Low Activation Cost

With 15B active parameters under an MoE design, the model is framed as a practical choice for scale-sensitive deployment where inference efficiency is central.

Long-Context Practicality

Its 256K context window allows Flash to support richer sessions, larger prompts, and document-driven use cases without moving up to the 1M context tier of MiMo-V2-Pro.

Recommended Use Cases

The report explicitly recommends MiMo-V2-Flash for high-frequency foundational interactions. That makes it a practical model layer for speed-sensitive production systems.

High-Frequency Interaction

MiMo-V2-Flash fits chat surfaces, assistant panels, frequent website interactions, and operational flows where latency and unit cost need to stay tightly controlled.

Production Fallback Layer

It can serve as a fast execution layer beneath a more capable reasoning model, handling routine prompts, retrieval-driven answers, and repeated structured interactions at lower cost.

Cost-Efficient Automation

For batch-like or repetitive automation scenarios, the Flash positioning in the report makes it suitable where response quality must remain strong but premium model pricing is unnecessary.

Traffic Scaling

Its price-performance profile makes it the most natural candidate in the family for scenarios where request volume matters as much as absolute intelligence ceiling.

API Pricing

The supplied report positions MiMo-V2-Flash as the lowest-cost model in the lineup among the text-focused variants.

ModelInput / 1M TokensOutput / 1M TokensPositioning
MiMo-V2-Flash$0.10$0.30Designed for high-frequency, low-latency scenarios with strong price-performance.

Official Resources