OmniVoice · 600+ languages · zero-shot voice cloning

OmniVoice: Zero-Shot Voice Cloning TTS for 600+ Languages

k2-fsa's open OmniVoice speech model, running live from the official Hugging Face Space

OmniVoice turns text into natural, expressive speech in hundreds of languages and clones a voice from just a few seconds of reference audio. Use the embedded demo below to test OmniVoice text-to-speech, design a brand-new speaker from plain-language attributes, and add expressive cues like [laughter] — no install required.

Diffusion-LM TTS600+ languagesZero-shot cloningVoice design24 kHz outputApache-2.0
OmniVoice · official Hugging Face Space
Open in Hugging Face

The demo above is the official OmniVoice Space, hosted on Hugging Face by k2-fsa. Whisper Web only embeds it for convenience. The Space runs on shared GPUs, so a cold start can take a little longer while it wakes up, and you should never paste confidential scripts or upload private voice samples into a public demo.

Quick answer

What is OmniVoice?

OmniVoice is an open-source, massively multilingual text-to-speech model from k2-fsa — the Next-gen Kaldi team behind Kaldi, k2 and sherpa. It is a zero-shot TTS system, which means it can read your text in a target voice without any per-speaker training: give it a short reference clip and it reproduces that voice, or describe a speaker in words and it invents one.

Technically, OmniVoice is built as a single-stage diffusion language model. A bidirectional Transformer, initialized from Qwen3-0.6B-Base, is trained with a masked-diffusion objective that maps text directly to multi-codebook acoustic tokens — skipping the separate text-to-semantic-to-acoustic stages older pipelines rely on. According to the project's paper it was trained on roughly 581,000 hours of audio spanning 646 languages, which is why the team describes it as having the broadest language coverage of any current zero-shot TTS model.

The Space embedded on this page is the quickest way to judge OmniVoice for yourself. It exposes the three workflows teams test first — direct multilingual text-to-speech, reference-based voice cloning, and attribute-driven voice design — and outputs 24 kHz audio fast enough to iterate in near real time, so you can hear the model before deciding whether to self-host it.

Showcase

What you can produce with OmniVoice

These examples illustrate the kind of finished audio projects OmniVoice is built for — from consent-based voice clones to designed speakers, multilingual narration and expressive, directed delivery. Generate your own versions in the live demo above.

OmniVoice zero-shot voice cloning from a few seconds of reference audio

Clone a voice in seconds

Upload a short, clean reference clip and OmniVoice reproduces that speaker's timbre zero-shot — no fine-tuning — so you can prototype audiobooks or assistants in a familiar voice you have permission to use.

OmniVoice voice design — creating a new synthetic speaker from attributes like gender, age, pitch and accent

Design a speaker from attributes

Skip the reference recording entirely: set gender, age, pitch, accent or dialect — even a whisper style — and OmniVoice voice design composes a brand-new synthetic speaker to match.

OmniVoice multilingual narration generating voiceovers across 600+ languages

Narrate in 600+ languages

Draft voiceovers for tutorials, explainers and localized product tours across OmniVoice's 600+ supported languages before committing to a final human recording.

OmniVoice expressive control with non-verbal cues like laughter and pinyin pronunciation correction

Direct expressive delivery

Drop in non-verbal cues such as [laughter], or fix tricky names with pinyin and phonemes, to steer OmniVoice toward the exact pronunciation and emotion a scene needs.

Showcase images illustrate typical OmniVoice use cases; generate your own audio in the live demo above.

Features

Why OmniVoice stands out

OmniVoice brings massively multilingual generation, zero-shot cloning, attribute-based voice design and fast, open serving together in a single compact model you can run yourself.

Zero-shot voice cloning

Give OmniVoice a few seconds of clean reference audio and it reproduces that speaker's voice with no per-speaker fine-tuning — the project reports state-of-the-art zero-shot cloning quality and high speaker similarity.

Speech in 600+ languages

Trained on roughly 581k hours across 646 languages, OmniVoice covers 600+ languages — described by the team as the broadest language coverage of any current zero-shot TTS model.

Voice design from attributes

Describe a speaker by gender, age, pitch, accent, dialect or whisper style and OmniVoice creates an entirely new synthetic voice — no reference recording required.

Fine-grained expressive control

Insert non-verbal symbols such as [laughter] and correct tricky pronunciations with pinyin or phonemes, so you can direct delivery instead of re-recording it.

Single-stage, real-time-class speed

A diffusion language model maps text straight to acoustic tokens and renders 24 kHz audio with a real-time factor reported as low as ~0.025 — roughly 40x faster than real time — using just 16-32 diffusion steps.

Open Apache-2.0, safety-aware release

OmniVoice is released under Apache-2.0 and free for commercial use, but cloning a real person is powerful — always get consent before cloning a voice and label AI-generated audio clearly.

How to use

Try OmniVoice in four steps

The hosted Space gives you the core OmniVoice workflows — text-to-speech, cloning and voice design — without any local setup.

1

Open the OmniVoice Space

Use the demo embedded above or open it in a new tab. Because the Space runs on shared GPUs, a cold start may take extra time while it loads the OmniVoice weights.

2

Enter text in your target language

Paste a short script in any of the 600+ supported languages. Start with one or two sentences so you can quickly judge pronunciation and pacing before generating longer passages.

3

Pick cloning or voice design

For cloning, upload a clean single-speaker reference clip. For voice design, describe the speaker instead — gender, age, pitch, accent or whisper style — with no reference audio at all.

4

Tune the controls and listen

Adjust diffusion steps and speaking-speed settings, add cues like [laughter] or pinyin corrections where needed, generate a sample, then iterate until the voice and delivery are right.

Model details

OmniVoice capabilities at a glance

Key public facts about OmniVoice, drawn from the k2-fsa model card, the OmniVoice GitHub repository and the project's paper, summarized here for quick evaluation.

Model
OmniVoice by k2-fsa, the Next-gen Kaldi team
Primary tasks
Zero-shot multilingual text-to-speech, reference-based voice cloning and attribute-based voice design
Architecture
Single-stage diffusion language model — a bidirectional Transformer trained with a masked-diffusion objective that maps text directly to multi-codebook acoustic tokens
Size and backbone
Compact model initialized from Qwen3-0.6B-Base weights (~0.8B backbone, per the paper)
Training scale
About 581,000 hours of audio across 646 languages from 50 open datasets, per the paper
Language coverage
600+ languages — described as the broadest coverage among current zero-shot TTS models
Audio quality
24 kHz output, with zero-shot voice cloning from a short reference clip
Inference speed
Real-time factor reported as low as ~0.025 (about 40x faster than real time), using 16-32 diffusion steps
Hosted demo
Official Gradio Space k2-fsa/OmniVoice, served from k2-fsa-omnivoice.hf.space
License
Apache-2.0, described by the project as free for commercial use
Use cases

What you can build with OmniVoice

OmniVoice is most useful when a voice project needs broad language coverage, zero-shot cloning and the freedom to self-host a small, fast open model.

Multilingual product narration

Create draft voiceovers for product tours, lessons and explainers across hundreds of languages before recording final human narration.

Consent-based voice cloning

Clone a speaker zero-shot from a short clip — only with permission — to prototype audiobooks, characters or personalized assistants in a familiar voice.

Synthetic character voices

Reach for voice design when you need a fresh, brand-safe or fictional speaker, dialing in age, accent and tone without cloning a real person at all.

Low-resource language coverage

Reach languages and dialects that mainstream TTS systems skip, thanks to OmniVoice's 600+ language training footprint.

Voice-agent prototyping

Explore personas, accents and speaking rate, and check 24 kHz quality, before wiring OmniVoice into a production voice-agent stack.

Open-model evaluation

Benchmark OmniVoice against other open TTS systems, try the pip package, or study the diffusion-LM serving path for low-latency experiments.

Best practices

Tips for better OmniVoice results

  • Start with short text so you can quickly judge pronunciation, rhythm and voice consistency before generating long passages.
  • Use clean, single-speaker reference audio for cloning; noisy clips or overlapping speakers make speaker similarity hard to evaluate.
  • If a designed voice misses the mark, rewrite the attributes with concrete values — gender, age, pitch, accent and whisper style can be combined.
  • For tricky names or rare words, use pinyin or phoneme hints, and add cues like [laughter] only where you actually want them.
  • More diffusion steps usually trade speed for stability; lower the step count when you just need a fast draft and raise it for final takes.
  • Keep confidential scripts and private voice samples out of any public Space — self-host OmniVoice for sensitive production work, and never use it for impersonation, fraud or unlabeled synthetic media.
FAQ

OmniVoice frequently asked questions

Short, practical answers for anyone evaluating OmniVoice for multilingual TTS, zero-shot voice cloning, voice design and self-hosted AI voice work.

What is OmniVoice?

OmniVoice is k2-fsa's open-source, massively multilingual zero-shot text-to-speech model. Built as a single-stage diffusion language model initialized from Qwen3-0.6B-Base, it generates natural speech in 600+ languages and clones a voice from a short reference clip without per-speaker training.

Can I try OmniVoice online for free?

Yes. This page embeds the official OmniVoice Hugging Face Space, so you can test OmniVoice directly in your browser without installing the Python package, and the model itself is released under Apache-2.0.

How many languages does OmniVoice support?

OmniVoice supports 600+ languages. According to the project's paper it was trained on roughly 581,000 hours of audio spanning 646 languages, which the team describes as the broadest language coverage of any current zero-shot TTS model.

Does OmniVoice support voice cloning?

Yes. OmniVoice does zero-shot voice cloning: provide a few seconds of clean reference audio and it reproduces that speaker's voice with no fine-tuning. Only clone voices you own or have explicit permission to use.

What is OmniVoice voice design?

Voice design lets you create a speaker from attributes instead of a recording. Describe gender, age, pitch, accent, dialect or a whisper style and OmniVoice generates a brand-new synthetic voice that matches, with no reference audio required.

Is OmniVoice the same as Whisper Web?

No. OmniVoice is an external open-source model from k2-fsa, and the demo on this page runs on a public Hugging Face Space. Whisper Web is a separate browser-based speech-to-text product; this page is an independent guide that embeds the OmniVoice demo for convenience.

Try OmniVoice in your browser

Generate multilingual speech, clone a consenting speaker zero-shot, and experiment with attribute-based voice design in the official Hugging Face demo before you install OmniVoice locally.