OmniVoice: Zero-Shot Voice Cloning TTS for 600+ Languages
k2-fsa's open OmniVoice speech model, running live from the official Hugging Face Space
OmniVoice turns text into natural, expressive speech in hundreds of languages and clones a voice from just a few seconds of reference audio. Use the embedded demo below to test OmniVoice text-to-speech, design a brand-new speaker from plain-language attributes, and add expressive cues like [laughter] — no install required.
The demo above is the official OmniVoice Space, hosted on Hugging Face by k2-fsa. Whisper Web only embeds it for convenience. The Space runs on shared GPUs, so a cold start can take a little longer while it wakes up, and you should never paste confidential scripts or upload private voice samples into a public demo.
What is OmniVoice?
OmniVoice is an open-source, massively multilingual text-to-speech model from k2-fsa — the Next-gen Kaldi team behind Kaldi, k2 and sherpa. It is a zero-shot TTS system, which means it can read your text in a target voice without any per-speaker training: give it a short reference clip and it reproduces that voice, or describe a speaker in words and it invents one.
Technically, OmniVoice is built as a single-stage diffusion language model. A bidirectional Transformer, initialized from Qwen3-0.6B-Base, is trained with a masked-diffusion objective that maps text directly to multi-codebook acoustic tokens — skipping the separate text-to-semantic-to-acoustic stages older pipelines rely on. According to the project's paper it was trained on roughly 581,000 hours of audio spanning 646 languages, which is why the team describes it as having the broadest language coverage of any current zero-shot TTS model.
The Space embedded on this page is the quickest way to judge OmniVoice for yourself. It exposes the three workflows teams test first — direct multilingual text-to-speech, reference-based voice cloning, and attribute-driven voice design — and outputs 24 kHz audio fast enough to iterate in near real time, so you can hear the model before deciding whether to self-host it.
What you can produce with OmniVoice
These examples illustrate the kind of finished audio projects OmniVoice is built for — from consent-based voice clones to designed speakers, multilingual narration and expressive, directed delivery. Generate your own versions in the live demo above.

Clone a voice in seconds
Upload a short, clean reference clip and OmniVoice reproduces that speaker's timbre zero-shot — no fine-tuning — so you can prototype audiobooks or assistants in a familiar voice you have permission to use.

Design a speaker from attributes
Skip the reference recording entirely: set gender, age, pitch, accent or dialect — even a whisper style — and OmniVoice voice design composes a brand-new synthetic speaker to match.

Narrate in 600+ languages
Draft voiceovers for tutorials, explainers and localized product tours across OmniVoice's 600+ supported languages before committing to a final human recording.

Direct expressive delivery
Drop in non-verbal cues such as [laughter], or fix tricky names with pinyin and phonemes, to steer OmniVoice toward the exact pronunciation and emotion a scene needs.
Showcase images illustrate typical OmniVoice use cases; generate your own audio in the live demo above.
Why OmniVoice stands out
OmniVoice brings massively multilingual generation, zero-shot cloning, attribute-based voice design and fast, open serving together in a single compact model you can run yourself.
Zero-shot voice cloning
Give OmniVoice a few seconds of clean reference audio and it reproduces that speaker's voice with no per-speaker fine-tuning — the project reports state-of-the-art zero-shot cloning quality and high speaker similarity.
Speech in 600+ languages
Trained on roughly 581k hours across 646 languages, OmniVoice covers 600+ languages — described by the team as the broadest language coverage of any current zero-shot TTS model.
Voice design from attributes
Describe a speaker by gender, age, pitch, accent, dialect or whisper style and OmniVoice creates an entirely new synthetic voice — no reference recording required.
Fine-grained expressive control
Insert non-verbal symbols such as [laughter] and correct tricky pronunciations with pinyin or phonemes, so you can direct delivery instead of re-recording it.
Single-stage, real-time-class speed
A diffusion language model maps text straight to acoustic tokens and renders 24 kHz audio with a real-time factor reported as low as ~0.025 — roughly 40x faster than real time — using just 16-32 diffusion steps.
Open Apache-2.0, safety-aware release
OmniVoice is released under Apache-2.0 and free for commercial use, but cloning a real person is powerful — always get consent before cloning a voice and label AI-generated audio clearly.
Try OmniVoice in four steps
The hosted Space gives you the core OmniVoice workflows — text-to-speech, cloning and voice design — without any local setup.
Open the OmniVoice Space
Use the demo embedded above or open it in a new tab. Because the Space runs on shared GPUs, a cold start may take extra time while it loads the OmniVoice weights.
Enter text in your target language
Paste a short script in any of the 600+ supported languages. Start with one or two sentences so you can quickly judge pronunciation and pacing before generating longer passages.
Pick cloning or voice design
For cloning, upload a clean single-speaker reference clip. For voice design, describe the speaker instead — gender, age, pitch, accent or whisper style — with no reference audio at all.
Tune the controls and listen
Adjust diffusion steps and speaking-speed settings, add cues like [laughter] or pinyin corrections where needed, generate a sample, then iterate until the voice and delivery are right.
OmniVoice capabilities at a glance
Key public facts about OmniVoice, drawn from the k2-fsa model card, the OmniVoice GitHub repository and the project's paper, summarized here for quick evaluation.
What you can build with OmniVoice
OmniVoice is most useful when a voice project needs broad language coverage, zero-shot cloning and the freedom to self-host a small, fast open model.
Multilingual product narration
Create draft voiceovers for product tours, lessons and explainers across hundreds of languages before recording final human narration.
Consent-based voice cloning
Clone a speaker zero-shot from a short clip — only with permission — to prototype audiobooks, characters or personalized assistants in a familiar voice.
Synthetic character voices
Reach for voice design when you need a fresh, brand-safe or fictional speaker, dialing in age, accent and tone without cloning a real person at all.
Low-resource language coverage
Reach languages and dialects that mainstream TTS systems skip, thanks to OmniVoice's 600+ language training footprint.
Voice-agent prototyping
Explore personas, accents and speaking rate, and check 24 kHz quality, before wiring OmniVoice into a production voice-agent stack.
Open-model evaluation
Benchmark OmniVoice against other open TTS systems, try the pip package, or study the diffusion-LM serving path for low-latency experiments.
Tips for better OmniVoice results
- Start with short text so you can quickly judge pronunciation, rhythm and voice consistency before generating long passages.
- Use clean, single-speaker reference audio for cloning; noisy clips or overlapping speakers make speaker similarity hard to evaluate.
- If a designed voice misses the mark, rewrite the attributes with concrete values — gender, age, pitch, accent and whisper style can be combined.
- For tricky names or rare words, use pinyin or phoneme hints, and add cues like [laughter] only where you actually want them.
- More diffusion steps usually trade speed for stability; lower the step count when you just need a fast draft and raise it for final takes.
- Keep confidential scripts and private voice samples out of any public Space — self-host OmniVoice for sensitive production work, and never use it for impersonation, fraud or unlabeled synthetic media.
OmniVoice frequently asked questions
Short, practical answers for anyone evaluating OmniVoice for multilingual TTS, zero-shot voice cloning, voice design and self-hosted AI voice work.
What is OmniVoice?
OmniVoice is k2-fsa's open-source, massively multilingual zero-shot text-to-speech model. Built as a single-stage diffusion language model initialized from Qwen3-0.6B-Base, it generates natural speech in 600+ languages and clones a voice from a short reference clip without per-speaker training.
Can I try OmniVoice online for free?
Yes. This page embeds the official OmniVoice Hugging Face Space, so you can test OmniVoice directly in your browser without installing the Python package, and the model itself is released under Apache-2.0.
How many languages does OmniVoice support?
OmniVoice supports 600+ languages. According to the project's paper it was trained on roughly 581,000 hours of audio spanning 646 languages, which the team describes as the broadest language coverage of any current zero-shot TTS model.
Does OmniVoice support voice cloning?
Yes. OmniVoice does zero-shot voice cloning: provide a few seconds of clean reference audio and it reproduces that speaker's voice with no fine-tuning. Only clone voices you own or have explicit permission to use.
What is OmniVoice voice design?
Voice design lets you create a speaker from attributes instead of a recording. Describe gender, age, pitch, accent, dialect or a whisper style and OmniVoice generates a brand-new synthetic voice that matches, with no reference audio required.
Is OmniVoice the same as Whisper Web?
No. OmniVoice is an external open-source model from k2-fsa, and the demo on this page runs on a public Hugging Face Space. Whisper Web is a separate browser-based speech-to-text product; this page is an independent guide that embeds the OmniVoice demo for convenience.
Try OmniVoice in your browser
Generate multilingual speech, clone a consenting speaker zero-shot, and experiment with attribute-based voice design in the official Hugging Face demo before you install OmniVoice locally.