Model2026-06-30

Zyphra ZONOS2: open TTS MoE, voice cloning, and tool-workflow boundaries

Zyphra released ZONOS2 under Apache-2.0. The model uses an 8B-total, roughly 900M-active MoE TTS architecture trained on more than 6M hours of speech. Its official materials document benchmark framing, language tiers, and voice-cloning limits.

Model News / Zyphra

Released: 2026-06-12
Apache-2.0
8B total / ~900M active
Reference speech and text form the public voice-cloning input contract

Abstract

ZONOS2 is Zyphra's June 2026 open real-time TTS release. Official materials describe it as an Apache-2.0, sparse MoE speech model with 8B total parameters, roughly 900M active parameters, high-fidelity voice cloning, low-latency TTS, more than six million hours of multilingual speech data, and Zyphra-reported benchmark results for seed-tts eval and ZTTS1-Eval.

The important story is not just that another voice model shipped. ZONOS2 is notable because it ties open weights, a higher-fidelity DAC audio path, language tiers, and speaker conditioning into one route. Reference speech supplies identity and acoustic context; target text supplies content; the decoder returns 44.1 kHz speech.

Official Zyphra ZONOS2 release image — Official figure 1: the Zyphra/ZONOS2 model-card image highlights real-time TTS, voice cloning, and multilingual speech generation.

From Zonos-v0.1 to ZONOS2: data, model capacity, and conditioning changed together

ZONOS2 is not a larger copy of Zonos-v0.1. The previous model had about 1.6B parameters and was trained on roughly 200,000 hours of speech. It relied on phonemization and language labels at the text front end, and its speaker representation carried less detail. That version established an open zero-shot voice-cloning route, but it still faced the usual TTS tradeoffs: dense scaling slowed real-time generation, phoneme dictionaries complicated multilingual and code-switched text, and a compact speaker vector could discard vocal and recording details present in the reference clip.

Zyphra moved all three boundaries in ZONOS2. The training set grew from about 200,000 hours to more than six million hours. Total model size increased from 1.6B to 8B parameters, while sparse MoE routing keeps the active set near 900M for each step. The release reports four-times higher real-time throughput than the previous model. Capacity can therefore hold broader language, accent, prosody, and recording variation without asking every token to execute the full 8B network.

Speaker conditioning changed as well. Zyphra describes the new ECAPA-TDNN embedding as having 20 times the bandwidth of the previous speaker model. That bandwidth concerns how much speaker information the embedding can carry, not the sample rate of the output WAV. It helps explain the release's distinction between faithful cloning and studio-clean speech: noise, room response, breath, and unusual vocal texture may be identity evidence in one job and defects to remove in another.

The text path moved from explicit phonemization and language tags to normalized UTF-8 bytes. Chinese characters, Japanese scripts, Korean, numbers, punctuation, and code-switching no longer have to pass through one fixed pronunciation dictionary before reaching the model. Byte input does not guarantee correct polyphones or phrasing, but it removes several front-end failure points: missing dictionary entries, incorrect language labels, and hard language boundaries inside a sentence.

More data alone would have left the model with a noisy objective, so the official description uses three-stage training. Pretraining runs the full dataset for eight epochs with minimal transcript-agreement filtering and without speaker-cloning information. Mid-training tightens transcript agreement and dataset selection to reduce hallucinations, mispronunciations, and repetitions. A final annealing stage introduces speaker embeddings, speaking-rate controls, and quality conditions under stricter filtering. Coverage is learned first; controllable pronunciation and cloning are concentrated later.

Voice cloning is more than similarity

Voice-cloning coverage often collapses into one question: does it sound like the reference speaker? That matters, but it is not enough for a tool. A usable TTS system also depends on naturalness, stability, language coverage, latency, licensing, and setup burden. ZONOS2 is interesting because it does not only follow the closed-API story of "more human-like speech." It pairs high-fidelity cloning with an open local-running path.

That changes the user's tradeoff. Closed TTS APIs are convenient for hosted calls and uniform billing. Many open TTS projects are flexible for experiments but require users to solve quality, environment, and runtime problems themselves. ZONOS2 sits between those poles by pairing high-quality voice cloning, open model materials, and a product workflow users can try from the site.

The release is a model and inference stack rather than an external voice API. Its public contract combines reference audio, target text, language and sampling controls; the result still depends on preprocessing, codec decoding and the selected checkpoint.

Official Zyphra ZONOS2 inference overview animation — Official animation: the model card shows ZONOS2 using text, speaker conditioning, and a MoE backbone to generate DAC audio tokens.

What Is Actually New

Three official details are worth separating. First is architecture: ZONOS2 is a sparse Mixture of Experts TTS model. Zyphra describes it as 8B total parameters with roughly 900M active at inference. That means the model is not merely a dense model scaled upward; the MoE design lets a larger parameter pool coexist with a real-time TTS target.

Second is the audio representation. The model card says ZONOS2 uses nemo TN normalized UTF-8 bytes and an ECAPA-TDNN speaker embedding during inference, then uses the MoE backbone to generate DAC tokens. Zyphra's official article frames the DAC route as producing 44.1 kHz audio. That matters because the intended use is higher-detail narration, character voice, podcast, and multilingual voice cloning rather than low-sample-rate telephony.

Third is language coverage. The model card lists Tier 1 as English, Mandarin Chinese, and Japanese; Tier 2 as Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, and Dutch; and Tier 3 as Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, and Latvian. The tiering is itself a boundary: support does not mean every language, accent, and reference clip will behave identically.

How To Read The Benchmark Claim

Zyphra's release page reports strong results under seed-tts eval and the new ZTTS1-Eval, with emphasis on speaker similarity, prosody, naturalness, and voice-cloning fidelity. The accurate reading is source-scoped: this is Zyphra's benchmark and sample framing, not a universal guarantee that every downstream generation will outperform every closed or open model.

The evaluation discussion is also more subtle than a single score. Zyphra notes that WER can be awkward for TTS because a generated voice may be cleaner than real human speech and therefore easier for ASR to transcribe, without necessarily being a more faithful clone of the reference speaker. ZONOS2 explicitly emphasizes vocal fidelity and offers stable and expressive orientations: one favors cleaner stable output, while the other aims for more faithful natural cloning.

ZONOS2 is a June 2026 open MoE TTS release that Zyphra reports under specific TTS and voice-cloning evaluation settings. Reference audio quality, target language, text length, style control and sampling configuration remain separate variables, so the published results stay inside their stated protocol.

Compared With Common TTS Routes

Dimension	ZONOS2 / task-workflow route	Closed hosted TTS API	Typical self-hosted open TTS
Weights and license	Official weights are available; the model card marks Apache-2.0.	Usually API-only, with no weight control.	Depends on the model; license, commercial use, and quality vary widely.
Usage control	A task workflow can provide status tracking and WAV download integration while keeping the model boundary visible.	Convenient hosting, but latency, region, audit, and data path depend on the vendor.	High control, but the user must manage environment, weights, serving, and debugging.
Voice cloning	Zyphra explicitly emphasizes high-fidelity and naturalistic voice cloning.	Can be strong, but often constrained by account, authorization, quota, and platform policy.	Quality varies heavily by model family and reference audio.
Language boundaries	The model card publishes Tier 1/2/3 language support.	Product coverage may be broad, but training and evaluation details are not always public.	Coverage depends on training data and tokenizer/phonemizer design.
User workflow	Upload reference speech, enter text, and wait for a WAV output.	Usually direct real-time API or console calls.	Often requires CLI work, tool setup, and audio post-processing.

Using It As A Tool Workflow

The user flow is simple: upload a reference voice, enter the text to be spoken, choose language and parameters, and receive a WAV file result.

For users, the main value is lower setup burden: they do not need to prepare a research environment or understand every implementation detail before evaluating ZONOS2 through reference voice, text, language, and quality parameters.

Practical Usage Notes

Reference quality matters

Prefer a single speaker, close microphone, low reverb, and low background noise. A clean 10-30 second clip is often more useful than a long recording with music or room noise.

Language support has tiers

English, Mandarin Chinese, and Japanese are Tier 1 in the official model card. Other listed languages should still be judged by the generated output and human listening.

Do not over-read the benchmark claim

Benchmark claims depend on the test and setup and should not be read as unconditional quality promises.

Consent boundary

Clone only voices you have the right to use, and do not impersonate real people or use generated speech to mislead identity.

ZONOS2 Facts And Product Reading

Item	Official or product fact	Accurate meaning
Release date	Zyphra's official page is dated 2026-06-12.	This is the model-news date, not proof that every downstream product updated that day.
License	The Hugging Face model card marks apache-2.0.	Friendly to local running and integration, while still requiring responsible use.
Scale	8B total parameters / roughly 900M active parameters.	MoE activates only part of the expert pool per step; it is not a dense 8B inference claim.
Training data	Zyphra says more than six million hours of multilingual speech.	Large scale, but language tier, recording condition, and reference voice still matter.
Audio path	Official materials describe DAC-token generation and 44.1 kHz audio.	Better aligned with detailed speech, narration, and character voices than low-rate telephony.
Inference contract	Reference speech, target text, language and sampling settings condition the released model.	Quality still depends on input material, preprocessing and decoding configuration.

References

Reference materials

OfficialZyphra ZONOS2 official pageOfficial release page, benchmark framing, and links.Model cardZyphra/ZONOS2Official model card with license, assets, and API notes.RepositoryZyphra/ZONOS2Official source repository and local server instructions.

Back to news