VoxCPM2 · 30-language TTS · voice cloning

VoxCPM: Tokenizer-Free Multilingual Voice Cloning TTS

OpenBMB's VoxCPM2 speech model, running live from the official Hugging Face Space

VoxCPM turns text into natural, expressive speech without a discrete speech tokenizer. Use the embedded demo below to test multilingual VoxCPM text-to-speech, design brand-new voices from a description, and clone a voice from a short reference clip at 48 kHz — no install required.

2B parameters30 languagesVoice designControllable cloning48 kHz outputApache-2.0

Try VoxCPM now Open on Hugging Face

VoxCPM2 · official Hugging Face Space

Open in Hugging Face

The demo above is the official VoxCPM-Demo Space, hosted on Hugging Face by OpenBMB. Whisper Web only embeds it for convenience. A cold start can take a little longer while the Space wakes up, and you should never paste confidential scripts or private voice samples into a public demo.

Quick answer

What is VoxCPM?

VoxCPM is an open-source text-to-speech model family from OpenBMB. Instead of converting speech into discrete audio tokens, it predicts continuous speech representations directly, which is why the project describes it as a tokenizer-free, diffusion-autoregressive TTS system. The result is speech that keeps natural rhythm, emotion and timbre rather than the flat delivery older pipelines often produce.

VoxCPM2 is the current flagship release: a 2B-parameter model built on a MiniCPM-4 backbone and trained, according to its public model card, on more than two million hours of multilingual speech. It supports 30 languages plus several Chinese dialects, accepts a short reference clip for cloning, and renders 48 kHz studio-quality audio through AudioVAE V2. Lighter VoxCPM1.5 (0.6B) and the original VoxCPM-0.5B checkpoints are also published for resource-constrained setups.

The Space embedded on this page is the fastest way to judge VoxCPM for yourself. It exposes the three workflows teams test first — direct multilingual text-to-speech, natural-language voice design, and reference-based voice cloning — so you can hear the model before deciding whether to self-host it.

Showcase

What you can produce with VoxCPM

These examples illustrate the kind of finished audio projects VoxCPM is built for — from designed brand voices to multilingual narration and consent-based cloning. Generate your own versions in the live demo above.

Design a voice from words

Describe a speaker — age, gender, tone, emotion, pace — and VoxCPM voice design invents a brand-new synthetic voice without any reference recording.

VoxCPM controllable voice cloning from a short reference audio clip with style guidance — Clone a voice you own
Upload a short, clean reference clip and VoxCPM controllable cloning preserves the timbre while you steer emotion and pacing — use it only with consent.

VoxCPM multilingual narration generating voiceovers across 30 languages — Narrate in 30 languages
Draft voiceovers for tutorials, explainers and product tours across VoxCPM's 30 supported languages before committing to a final human recording.

VoxCPM 48 kHz studio-quality dubbing and localization workflow — Test dubbing at 48 kHz
Evaluate pronunciation, rhythm and expressiveness for localized scripts, then export 48 kHz studio-quality audio straight from the VoxCPM2 pipeline.

Showcase images illustrate typical VoxCPM use cases; generate your own audio in the live demo above.

Features

Why VoxCPM2 stands out

VoxCPM brings multilingual generation, voice design, cloning and production-ready serving together in a single open model you can run yourself.

Tokenizer-free diffusion-autoregressive TTS

VoxCPM2 generates continuous speech representations through a four-stage LocEnc → TSLM → RALM → LocDiT pipeline instead of relying on discrete speech tokens, which helps it preserve natural prosody and timbre.

30-language multilingual speech

The model card lists 30 supported languages — including English, Chinese, Japanese, Korean, Spanish, French, German, Portuguese, Hindi, Arabic and Vietnamese — plus nine Chinese dialects.

Voice design from a description

Describe a speaker in plain language — gender, age, tone, emotion or pace — and VoxCPM creates an entirely new synthetic voice, with no reference recording needed.

Controllable and ultimate cloning

Clone a voice from a short clip and add style guidance to steer delivery, or provide the reference audio together with its transcript for the highest-fidelity 'ultimate' cloning mode.

48 kHz output with fast serving

VoxCPM2 takes 16 kHz reference audio and returns 48 kHz studio-quality speech. The docs report real-time streaming, an OpenAI-compatible vLLM-Omni server and Nano-vLLM acceleration with an RTF as low as ~0.13.

Open Apache-2.0, safety-aware release

VoxCPM is released under Apache-2.0 and free for commercial use, but the project explicitly forbids impersonation, fraud and disinformation — always label AI-generated audio and get consent before cloning a real voice.

How to use

Try VoxCPM in four steps

The hosted Space gives you the core VoxCPM2 workflows without any local setup.

Open the VoxCPM Space

Use the demo embedded above or open it in a new tab. A cold start may take extra time while the hosted environment loads its dependencies and the VoxCPM2 weights.

Enter text in a supported language

Paste your script directly. VoxCPM2 handles multilingual input without a separate language tag, so start with one short sentence in your target language.

Pick voice design or cloning

For voice design, write a natural-language speaker description before your text. For cloning, upload a clean single-speaker reference clip and add style guidance only when you want to change pace or emotion.

Tune the settings and listen

Adjust inference steps, guidance and denoise or style controls in the demo, generate a short sample, then iterate until pronunciation, pacing and speaker character are right.

Model details

VoxCPM capabilities at a glance

Key public facts about VoxCPM, drawn from the VoxCPM2 model card, the OpenBMB GitHub repo and the project's documentation, summarized here for quick evaluation.

Model family

VoxCPM, VoxCPM1.5 and VoxCPM2 by OpenBMB

Primary tasks

Multilingual text-to-speech, voice design, controllable voice cloning and high-fidelity 'ultimate' cloning

Architecture

Tokenizer-free, diffusion-autoregressive pipeline (LocEnc → TSLM → RALM → LocDiT) with AudioVAE V2

Size and backbone

VoxCPM2 is 2B parameters on a MiniCPM-4 backbone; VoxCPM1.5 (0.6B) and VoxCPM-0.5B are also available

Training scale

More than 2 million hours of multilingual speech, per the public model card

Language coverage

30 languages plus nine Chinese dialects, including Cantonese, Sichuanese, Wu and Northeastern Mandarin

Audio quality

Accepts 16 kHz reference audio and outputs 48 kHz studio-quality speech

Hosted demo

Official Gradio Space openbmb/VoxCPM-Demo, served from openbmb-voxcpm-demo.hf.space

License

Apache-2.0, described by the project as free for commercial use

Use cases

What you can build with VoxCPM

VoxCPM is most useful when a voice project needs multilingual breadth, controllable expression and the freedom to self-host an open model.

Multilingual product narration

Create draft voiceovers for product tours, lessons and explainers across many languages before recording final human narration.

Voice-agent prototyping

Explore voice personas, speaking rate, emotional tone and 48 kHz quality before wiring VoxCPM into a production voice-agent stack.

Dubbing and localization tests

Check pronunciation, rhythm and expressiveness for localized scripts in languages that older TTS systems handle poorly.

Consent-based voice cloning

Clone a speaker only when you have permission, then use controllable guidance to test different emotions and pacing while keeping the original timbre.

Synthetic character voices

Reach for voice design when you need a fresh, brand-safe or fictional voice without cloning a real person at all.

Open-model evaluation

Benchmark VoxCPM against other open TTS systems, try the Python package, or study the Nano-vLLM serving path for lower-latency experiments.

Best practices

Tips for better VoxCPM results

Start with short text so you can quickly judge pronunciation, rhythm and voice consistency before generating long passages.
Use clean, single-speaker reference audio for cloning; noisy clips or overlapping speakers make speaker similarity hard to evaluate.
If a designed voice misses the mark, rewrite the speaker description with concrete attributes — age, pace, emotion and vocal texture.
For the highest-fidelity 'ultimate' cloning, provide the reference transcript when the demo asks for it, because transcript-aligned context preserves the original delivery.
Keep confidential scripts and private voice samples out of any public Space — self-host VoxCPM2 for sensitive production work.
Never use VoxCPM for impersonation, fraud or unlabeled synthetic media, and always get consent before cloning a real voice.

FAQ

VoxCPM frequently asked questions

Short, practical answers for anyone evaluating VoxCPM for multilingual TTS, voice cloning, voice design and self-hosted AI voice work.

What is VoxCPM?

VoxCPM is OpenBMB's open-source, tokenizer-free text-to-speech model family. The current VoxCPM2 release is a 2B-parameter model for multilingual speech generation, voice design and voice cloning.

Can I try VoxCPM online for free?

Yes. This page embeds the official VoxCPM-Demo Hugging Face Space, so you can test VoxCPM2 directly in your browser without installing the Python package, and the model itself is released under Apache-2.0.

What languages does VoxCPM2 support?

The public VoxCPM2 model card lists 30 languages — among them English, Chinese, Japanese, Korean, Spanish, French, German, Portuguese, Hindi, Arabic and Vietnamese — plus nine Chinese dialects.

Does VoxCPM support voice cloning?

Yes. VoxCPM offers controllable cloning from a short reference clip and an 'ultimate' cloning mode that also uses the reference transcript for higher fidelity. Only clone voices you own or have explicit permission to use.

What is VoxCPM voice design?

Voice design lets you describe a desired speaker in natural language — for example their age, gender, tone, emotion or pace — and generate speech in that newly created voice without uploading any reference recording.

Is VoxCPM the same as Whisper Web?

No. VoxCPM is an external open-source model from OpenBMB, and the demo on this page runs on a public Hugging Face Space. Whisper Web is a separate browser-based speech-to-text product; this page is an independent guide that embeds the VoxCPM demo for convenience.

Try VoxCPM in your browser

Generate multilingual speech, experiment with voice design, and test consent-based cloning in the official Hugging Face demo before you install VoxCPM locally.

Launch the embedded demo View the model card

GitHub Hugging Face model Docs Audio samples

VoxCPM: Tokenizer-Free Multilingual Voice Cloning TTS

What is VoxCPM?

What you can produce with VoxCPM

Design a voice from words

Clone a voice you own

Narrate in 30 languages

Test dubbing at 48 kHz

Why VoxCPM2 stands out

Tokenizer-free diffusion-autoregressive TTS

30-language multilingual speech

Voice design from a description

Controllable and ultimate cloning

48 kHz output with fast serving

Open Apache-2.0, safety-aware release

Try VoxCPM in four steps

Open the VoxCPM Space

Enter text in a supported language

Pick voice design or cloning

Tune the settings and listen

VoxCPM capabilities at a glance

What you can build with VoxCPM

Multilingual product narration

Voice-agent prototyping

Dubbing and localization tests

Consent-based voice cloning

Synthetic character voices

Open-model evaluation

Tips for better VoxCPM results

VoxCPM frequently asked questions

Try VoxCPM in your browser