Xiaomi Speech Synthesis Model

MiMo-V2-TTS for Expressive Voice Generation

MiMo-V2-TTS is our speech synthesis model for voice-first experiences, combining expressive generation, dialect control, emotional variation, and even singing support. In the supplied research material, it is positioned as the audio layer for natural service and branded voice interaction.

Overview

The supplied MiMo-V2 report positions MiMo-V2-TTS as the speech synthesis member of the family. It is defined by expressive voice output, support for dialect and emotion, and suitability for customer service, branded voices, and interactive assistant experiences.

Training

Hundred-Million-Hour Scale

The report describes MiMo-V2-TTS as pretrained on hundred-million-hour scale data, giving it a large base for robust speech generation.

Modeling

Speech-Text Joint Modeling

MiMo-V2-TTS uses a multi-codebook speech-text joint modeling approach to support more controllable and expressive generation.

Context

8K Tokens

The model is listed with an 8K context window in the supplied matrix, aligning it with targeted speech output tasks rather than long-document reasoning.

This page is grounded in the supplied MiMo-V2 research report and focuses on MiMo-V2-TTS as the Xiaomi product entity for text-to-speech and expressive voice generation.

Voice Control and Expressiveness

The report highlights MiMo-V2-TTS as more than plain speech output. It is positioned as a controllable expressive voice model.

Dialect Support

MiMo-V2-TTS is described as supporting dialect control, making it useful for region-aware voice experiences and broader accessibility across language variation.

Emotion Control

The model supports emotional delivery, allowing voice output to align more naturally with service tone, brand voice, and conversation context.

Singing Support

The supplied material explicitly includes singing among the supported capabilities, distinguishing MiMo-V2-TTS from narrower utilitarian TTS systems.

Real-Time Audio Generation

MiMo-V2-TTS is positioned for real-time audio generation, making it relevant for live assistant and voice interaction experiences.

Use Cases

The report describes MiMo-V2-TTS as a voice layer for service, branding, and more natural spoken interaction.

Customer Service Voices

MiMo-V2-TTS is suitable for customer support and assistant scenarios where voice tone and emotional nuance can improve the experience.

Branded Personas

The report highlights role-play and customized brand spokesperson use cases, where a distinct voice identity matters.

Natural Interaction

It supports paralinguistic events such as laughter and sighs, which the supplied report presents as a path to more natural and human-like interaction.

API Voice Output

The MiMo platform documentation in the supplied report states that mimo-v2-tts supports direct audio stream generation through the API.

Pricing Status

ModelInputOutputPositioning
MiMo-V2-TTSFree for a limited timeFree for a limited timeSpeech synthesis is temporarily not billed in the supplied report.

Official Resources