Zyphra ZONOS2: an open TTS MoE brings high-fidelity voice cloning into TelkNet tools
Zyphra released ZONOS2 under Apache-2.0. The model uses an 8B-total, roughly 900M-active MoE TTS architecture trained on more than 6M hours of speech; this article explains the official SOTA framing, language tiers, voice-cloning boundaries, and TelkNet's ZONOS2 Voice Cloning TTS workflow.

Reference materials
Model News / Zyphra
Zyphra ZONOS2: an open TTS MoE brings high-fidelity voice cloning into TelkNet tools
ZONOS2 is Zyphra's June 2026 open real-time TTS release. Official materials describe it as an Apache-2.0, sparse MoE speech model with 8B total parameters, roughly 900M active parameters, high-fidelity voice cloning, low-latency TTS, more than six million hours of multilingual speech data, and state-of-the-art results in benchmark contexts such as seed-tts eval and ZTTS1-Eval.
The important story is not just that another voice model shipped. ZONOS2 is notable because it ties open weights, a higher-fidelity DAC audio path, language tiers, and speaker conditioning into one deployable route. TelkNet has integrated ZONOS2 Voice Cloning TTS into the tool catalog so users can upload a reference voice, enter text, and generate WAV speech through the site's task workflow.
Our Read: The Value Is Not Only Similarity
Voice-cloning coverage often collapses into one question: does it sound like the reference speaker? That matters, but it is not enough for a tool. A usable TTS system also depends on naturalness, stability, language coverage, latency, licensing, and deployment cost. ZONOS2 is interesting because it does not only follow the closed-API story of "more human-like speech." It pairs high-fidelity cloning with an open local-running path.
That changes the user's tradeoff. Closed TTS APIs are convenient for hosted calls, uniform billing, and platform-managed scaling. Many open TTS projects are flexible for experiments but require users to solve quality, environment, and serving problems themselves. ZONOS2 sits between those poles by pairing high-quality voice cloning, open model materials, and a product workflow users can try from the site.
For TelkNet, the significance is not to become an external voice API wrapper. It is to bring speech generation into the same tool-execution model as the rest of the site: users upload, enter text, choose parameters, track the job, and download the generated file.
What Is Actually New
Three official details are worth separating. First is architecture: ZONOS2 is a sparse Mixture of Experts TTS model. Zyphra describes it as 8B total parameters with roughly 900M active at inference. That means the model is not merely a dense model scaled upward; the MoE design lets a larger parameter pool coexist with a real-time TTS target.
Second is the audio representation. The model card says ZONOS2 uses nemo TN normalized UTF-8 bytes and an ECAPA-TDNN speaker embedding during inference, then uses the MoE backbone to generate DAC tokens. Zyphra's official article frames the DAC route as producing 44.1 kHz audio. That matters because the intended use is higher-detail narration, character voice, podcast, and multilingual voice cloning rather than low-sample-rate telephony.
Third is language coverage. The model card lists Tier 1 as English, Mandarin Chinese, and Japanese; Tier 2 as Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, and Dutch; and Tier 3 as Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, and Latvian. The tiering is itself a boundary: support does not mean every language, accent, and reference clip will behave identically.
How To Read The SOTA Claim
Zyphra's release page says ZONOS2 reaches state-of-the-art results in contexts including seed-tts eval and the new ZTTS1-Eval, with emphasis on speaker similarity, prosody, naturalness, and voice-cloning fidelity. The accurate reading is source-scoped: this is Zyphra's benchmark and sample framing, not a universal guarantee that every downstream generation will outperform every closed or open model.
The evaluation discussion is also more subtle than a single score. Zyphra notes that WER can be awkward for TTS because a generated voice may be cleaner than real human speech and therefore easier for ASR to transcribe, without necessarily being a more faithful clone of the reference speaker. ZONOS2 explicitly emphasizes vocal fidelity and offers stable and expressive orientations: one favors cleaner stable output, while the other aims for more faithful natural cloning.
That is why this article does not state "ZONOS2 is the best TTS" as an unconditional fact. The accurate statement is narrower: ZONOS2 is a June 2026 open MoE TTS release that Zyphra says reaches SOTA results under specific TTS and voice-cloning evaluation settings. Real TelkNet results still depend on reference audio quality, target language, text length, style control, tool load, and human listening.
Compared With Common TTS Routes
| Dimension | ZONOS2 / TelkNet route | Closed hosted TTS API | Typical self-hosted open TTS |
|---|---|---|---|
| Weights and license | Official weights are available; the model card marks Apache-2.0. | Usually API-only, with no weight control. | Depends on the model; license, commercial use, and quality vary widely. |
| Deployment control | TelkNet provides an in-site task workflow, status tracking, and WAV download integration. | Convenient hosting, but latency, region, audit, and data path depend on the vendor. | High control, but the user must manage environment, weights, serving, and debugging. |
| Voice cloning | Zyphra explicitly emphasizes high-fidelity and naturalistic voice cloning. | Can be strong, but often constrained by account, authorization, quota, and platform policy. | Quality varies heavily by model family and reference audio. |
| Language boundaries | The model card publishes Tier 1/2/3 language support. | Product coverage may be broad, but training and evaluation details are not always public. | Coverage depends on training data and tokenizer/phonemizer design. |
| User workflow | Upload reference speech, enter text, wait for a TelkNet WAV output. | Usually direct real-time API or console calls. | Often requires CLI work, tool setup, and audio post-processing. |
What TelkNet Deployed
TelkNet has deployed the ZONOS2 Voice Cloning TTS tool experience and backend task path, not merely a link to an external page. The user flow is simple: upload a reference voice, enter the text to be spoken, choose language and parameters, submit through the TelkNet task system, and receive a WAV file when the task completes.
The public page presents the user-facing workflow only: reference voice and text input, language and quality parameters, task status, and WAV downloads. Deployment details stay in server-side operations and internal documentation, not in the regular tool page or public article body.
For users, this removes the setup burden: they can stay inside the TelkNet tool flow and use the same task/download experience as other tools. For the product, it makes speech generation a normal TelkNet tool capability with queueing, billing, task records, and downloads.
Practical Usage Notes
Prefer a single speaker, close microphone, low reverb, and low background noise. A clean 10-30 second clip is often more useful than a long recording with music or room noise.
English, Mandarin Chinese, and Japanese are Tier 1 in the official model card. Other listed languages should still be judged by the generated output and human listening.
SOTA is benchmark- and setup-specific. TelkNet keeps that boundary visible instead of turning it into an unconditional quality promise.
Clone only voices you have the right to use, and do not impersonate real people or use generated speech to mislead identity.
ZONOS2 Facts And TelkNet Reading
| Item | Official or TelkNet fact | Accurate meaning |
|---|---|---|
| Release date | Zyphra's official page is dated 2026-06-12. | This is the model-news date, not proof that every downstream deployment shipped that day. |
| License | The Hugging Face model card marks apache-2.0. | Friendly to local deployment and integration, while still requiring responsible use. |
| Scale | 8B total parameters / roughly 900M active parameters. | MoE activates only part of the expert pool per step; it is not a dense 8B inference claim. |
| Training data | Zyphra says more than six million hours of multilingual speech. | Large scale, but language tier, recording condition, and reference voice still matter. |
| Audio path | Official materials describe DAC-token generation and 44.1 kHz audio. | Better aligned with detailed speech, narration, and character voices than low-rate telephony. |
| TelkNet | The ZONOS2 Voice Cloning TTS tool and in-site task path are integrated on the site. | Users can submit web tasks, but quality still depends on input material and runtime load. |