Compare Supertone TTS models on quality, latency, language coverage, voice settings, and features — and pick the right one for your use case.
Supertone offers five TTS models with different trade-offs between quality, latency, language coverage, and configurability. Use this page to choose the model that fits your product.
The most natural, highest-quality voice on the platform with broad multilingual coverage. Recommended for narration, audiobooks, character dialogue, and production-quality marketing audio — anywhere quality matters more than latency.
Languages (23):en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi
Voice settings: all parameters except subharmonic_amplitude_control
Extras:include_phonemes (timestamps for lip-sync), normalized_text (pronunciation control)
A lightweight variant of sona_speech_2 optimized for lower latency while keeping the same multilingual coverage. Use it when you care about response time and want acceptable quality — for example, interactive agents or batch generation at scale.
The next-generation successor to supertonic_api_1 with significantly improved speech stability. Trained differently from the open-weights Supertonic 3 release, this API variant inherits the ultra-low latency profile of supertonic_api_1 while delivering far more reliable pronunciation and reduced reading errors. The best default for voice agents, chatbots, and any real-time experience where time-to-first-audio is the top priority.
Languages (31):en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi
Voice settings:speed only — all other settings are silently ignored
Extras: —
Streaming: not supported (but per-call latency is so low that streaming is usually unnecessary)
The legacy supertonic model. Superseded by supertonic_api_3, which offers broader language coverage and dramatically better speech stability at the same latency profile. Pick supertonic_api_1 only if you have an existing integration pinned to it; new projects should use supertonic_api_3.
Languages (5):en, ko, ja, es, pt
Voice settings:speed only — all other settings are silently ignored
The legacy flagship. It supports the full voice-settings surface and is the only model that currently supports chunked streaming (stream_speech). For most use cases the newer models are a better starting point; pick sona_speech_1 if you specifically need stream_speech output or the full set of fine-tuning parameters (similarity, text_guidance, subharmonic_amplitude_control).
For multilingual content, fire one request per language rather than mixing languages inside a single text. For Japanese inputs with kanji, numbers, units, or symbols, see Normalized text.
Looking to run TTS locally on CPU, with no API call and no network round-trip? Supertone also publishes an open-weights model in the same Supertonic 3 family — Supertonic 3 (99M parameters, ONNX Runtime, OpenRAIL-M license).
Supertonic 3 (open-weights) is a different model from supertonic_api_3. They share the same family name and lineage, but were trained differently and produce different audio. The API model (supertonic_api_3) is what’s exposed by this API; the open-weights model is a separate on-device release. Don’t assume parity in voice quality, supported voices, or behavior.
Supertonic 3 — On-device TTS ↗
99M-parameter open-weights TTS that runs locally on CPU via ONNX Runtime — 31 languages, no GPU, no cloud, no API. A separate model from supertonic_api_3; visit the project site for weights, samples, and SDKs (Python, Node.js, Web, iOS, Android, C++).