VoxCPM: Tokenizer-Free Multilingual Voice Cloning TTS
OpenBMB's VoxCPM2 speech model, running live from the official Hugging Face Space
VoxCPM turns text into natural, expressive speech without a discrete speech tokenizer. Use the embedded demo below to test multilingual VoxCPM text-to-speech, design brand-new voices from a description, and clone a voice from a short reference clip at 48 kHz — no install required.
The demo above is the official VoxCPM-Demo Space, hosted on Hugging Face by OpenBMB. Whisper Web only embeds it for convenience. A cold start can take a little longer while the Space wakes up, and you should never paste confidential scripts or private voice samples into a public demo.
What is VoxCPM?
VoxCPM is an open-source text-to-speech model family from OpenBMB. Instead of converting speech into discrete audio tokens, it predicts continuous speech representations directly, which is why the project describes it as a tokenizer-free, diffusion-autoregressive TTS system. The result is speech that keeps natural rhythm, emotion and timbre rather than the flat delivery older pipelines often produce.
VoxCPM2 is the current flagship release: a 2B-parameter model built on a MiniCPM-4 backbone and trained, according to its public model card, on more than two million hours of multilingual speech. It supports 30 languages plus several Chinese dialects, accepts a short reference clip for cloning, and renders 48 kHz studio-quality audio through AudioVAE V2. Lighter VoxCPM1.5 (0.6B) and the original VoxCPM-0.5B checkpoints are also published for resource-constrained setups.
The Space embedded on this page is the fastest way to judge VoxCPM for yourself. It exposes the three workflows teams test first — direct multilingual text-to-speech, natural-language voice design, and reference-based voice cloning — so you can hear the model before deciding whether to self-host it.
What you can produce with VoxCPM
These examples illustrate the kind of finished audio projects VoxCPM is built for — from designed brand voices to multilingual narration and consent-based cloning. Generate your own versions in the live demo above.

Design a voice from words
Describe a speaker — age, gender, tone, emotion, pace — and VoxCPM voice design invents a brand-new synthetic voice without any reference recording.

Clone a voice you own
Upload a short, clean reference clip and VoxCPM controllable cloning preserves the timbre while you steer emotion and pacing — use it only with consent.

Narrate in 30 languages
Draft voiceovers for tutorials, explainers and product tours across VoxCPM's 30 supported languages before committing to a final human recording.

Test dubbing at 48 kHz
Evaluate pronunciation, rhythm and expressiveness for localized scripts, then export 48 kHz studio-quality audio straight from the VoxCPM2 pipeline.
Showcase images illustrate typical VoxCPM use cases; generate your own audio in the live demo above.
Why VoxCPM2 stands out
VoxCPM brings multilingual generation, voice design, cloning and production-ready serving together in a single open model you can run yourself.
Tokenizer-free diffusion-autoregressive TTS
VoxCPM2 generates continuous speech representations through a four-stage LocEnc → TSLM → RALM → LocDiT pipeline instead of relying on discrete speech tokens, which helps it preserve natural prosody and timbre.
30-language multilingual speech
The model card lists 30 supported languages — including English, Chinese, Japanese, Korean, Spanish, French, German, Portuguese, Hindi, Arabic and Vietnamese — plus nine Chinese dialects.
Voice design from a description
Describe a speaker in plain language — gender, age, tone, emotion or pace — and VoxCPM creates an entirely new synthetic voice, with no reference recording needed.
Controllable and ultimate cloning
Clone a voice from a short clip and add style guidance to steer delivery, or provide the reference audio together with its transcript for the highest-fidelity 'ultimate' cloning mode.
48 kHz output with fast serving
VoxCPM2 takes 16 kHz reference audio and returns 48 kHz studio-quality speech. The docs report real-time streaming, an OpenAI-compatible vLLM-Omni server and Nano-vLLM acceleration with an RTF as low as ~0.13.
Open Apache-2.0, safety-aware release
VoxCPM is released under Apache-2.0 and free for commercial use, but the project explicitly forbids impersonation, fraud and disinformation — always label AI-generated audio and get consent before cloning a real voice.
Try VoxCPM in four steps
The hosted Space gives you the core VoxCPM2 workflows without any local setup.
Open the VoxCPM Space
Use the demo embedded above or open it in a new tab. A cold start may take extra time while the hosted environment loads its dependencies and the VoxCPM2 weights.
Enter text in a supported language
Paste your script directly. VoxCPM2 handles multilingual input without a separate language tag, so start with one short sentence in your target language.
Pick voice design or cloning
For voice design, write a natural-language speaker description before your text. For cloning, upload a clean single-speaker reference clip and add style guidance only when you want to change pace or emotion.
Tune the settings and listen
Adjust inference steps, guidance and denoise or style controls in the demo, generate a short sample, then iterate until pronunciation, pacing and speaker character are right.
VoxCPM capabilities at a glance
Key public facts about VoxCPM, drawn from the VoxCPM2 model card, the OpenBMB GitHub repo and the project's documentation, summarized here for quick evaluation.
What you can build with VoxCPM
VoxCPM is most useful when a voice project needs multilingual breadth, controllable expression and the freedom to self-host an open model.
Multilingual product narration
Create draft voiceovers for product tours, lessons and explainers across many languages before recording final human narration.
Voice-agent prototyping
Explore voice personas, speaking rate, emotional tone and 48 kHz quality before wiring VoxCPM into a production voice-agent stack.
Dubbing and localization tests
Check pronunciation, rhythm and expressiveness for localized scripts in languages that older TTS systems handle poorly.
Consent-based voice cloning
Clone a speaker only when you have permission, then use controllable guidance to test different emotions and pacing while keeping the original timbre.
Synthetic character voices
Reach for voice design when you need a fresh, brand-safe or fictional voice without cloning a real person at all.
Open-model evaluation
Benchmark VoxCPM against other open TTS systems, try the Python package, or study the Nano-vLLM serving path for lower-latency experiments.
Tips for better VoxCPM results
- Start with short text so you can quickly judge pronunciation, rhythm and voice consistency before generating long passages.
- Use clean, single-speaker reference audio for cloning; noisy clips or overlapping speakers make speaker similarity hard to evaluate.
- If a designed voice misses the mark, rewrite the speaker description with concrete attributes — age, pace, emotion and vocal texture.
- For the highest-fidelity 'ultimate' cloning, provide the reference transcript when the demo asks for it, because transcript-aligned context preserves the original delivery.
- Keep confidential scripts and private voice samples out of any public Space — self-host VoxCPM2 for sensitive production work.
- Never use VoxCPM for impersonation, fraud or unlabeled synthetic media, and always get consent before cloning a real voice.
VoxCPM frequently asked questions
Short, practical answers for anyone evaluating VoxCPM for multilingual TTS, voice cloning, voice design and self-hosted AI voice work.
What is VoxCPM?
VoxCPM is OpenBMB's open-source, tokenizer-free text-to-speech model family. The current VoxCPM2 release is a 2B-parameter model for multilingual speech generation, voice design and voice cloning.
Can I try VoxCPM online for free?
Yes. This page embeds the official VoxCPM-Demo Hugging Face Space, so you can test VoxCPM2 directly in your browser without installing the Python package, and the model itself is released under Apache-2.0.
What languages does VoxCPM2 support?
The public VoxCPM2 model card lists 30 languages — among them English, Chinese, Japanese, Korean, Spanish, French, German, Portuguese, Hindi, Arabic and Vietnamese — plus nine Chinese dialects.
Does VoxCPM support voice cloning?
Yes. VoxCPM offers controllable cloning from a short reference clip and an 'ultimate' cloning mode that also uses the reference transcript for higher fidelity. Only clone voices you own or have explicit permission to use.
What is VoxCPM voice design?
Voice design lets you describe a desired speaker in natural language — for example their age, gender, tone, emotion or pace — and generate speech in that newly created voice without uploading any reference recording.
Is VoxCPM the same as Whisper Web?
No. VoxCPM is an external open-source model from OpenBMB, and the demo on this page runs on a public Hugging Face Space. Whisper Web is a separate browser-based speech-to-text product; this page is an independent guide that embeds the VoxCPM demo for convenience.
Try VoxCPM in your browser
Generate multilingual speech, experiment with voice design, and test consent-based cloning in the official Hugging Face demo before you install VoxCPM locally.