Convert text to speech
Text to speech
Create speech
Convert text into a complete audio file using a voice of your choice.
POST
Convert text to speech
Generates speech from text and returns the audio in the response body. For the conceptual walkthrough, SDK examples, and tips, see Docs: Create speech.
Endpoint
Path parameters
| Name | Required | Description |
|---|---|---|
voice_id | ✅ | The ID of the target voice. |
Request body
| Name | Required | Description |
|---|---|---|
text | ✅ | The text to convert. Max 300 characters. Use an SDK or split client-side for longer input. |
language | ✅ | Language code (e.g. en, ko, ja). Must be supported by the voice and the model. |
style | — | Emotional style (e.g. neutral, happy). If omitted, the voice’s default style is used. |
model | — | TTS model. Defaults to sona_speech_1. |
output_format | — | wav (default) or mp3. |
voice_settings | — | Advanced voice parameters (see below). |
include_phonemes | — | If true, response switches to JSON with base64 audio plus phoneme timing data. Default: false. |
normalized_text | — | Pronunciation-normalized companion text (used by sona_speech_2 and sona_speech_2_flash, primarily for Japanese). |
Supported languages by model
| Model | Languages |
|---|---|
sona_speech_2, sona_speech_2_flash | en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi |
supertonic_api_3 | en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi |
supertonic_api_1 | en, ko, ja, es, pt |
sona_speech_1 | en, ko, ja |
Voice settings
Unsupported settings are silently ignored — they don’t error.| Name | Range | Default | Description |
|---|---|---|---|
pitch_shift | -24 → 24 | 0 | Pitch shift in semitones. |
pitch_variance | 0 → 2 | 1 | Degree of pitch variation. |
speed | 0.5 → 2 | 1 | Playback rate multiplier. Applied after duration. |
duration | 0 → 60 | 0 | When non-zero, generates audio targeting this length in seconds. |
similarity | 1 → 5 | 3 | How closely the output matches the original character voice. |
text_guidance | 0 → 4 | 1 | How sensitively delivery adapts to the text content. |
subharmonic_amplitude_control | 0 → 2 | 1 | Subharmonic amplitude in the generated speech. |
Voice settings by model
| Setting | sona_speech_2 | sona_speech_2_flash | supertonic_api_3 | supertonic_api_1 | sona_speech_1 |
|---|---|---|---|---|---|
pitch_shift, pitch_variance, duration | ✅ | ✅ | — | — | ✅ |
speed | ✅ | ✅ | ✅ | ✅ | ✅ |
similarity, text_guidance | ✅ | — | — | — | ✅ |
subharmonic_amplitude_control | — | — | — | — | ✅ |
Response
Default (include_phonemes=false): Binary audio in the body.
Content-Type: audio/wavoraudio/mpeg(matchesoutput_format).X-Audio-Lengthheader: duration of the generated audio in seconds.
include_phonemes=true: JSON body with base64 audio plus phoneme arrays.
Notes
textover 300 characters returns400. Use the Python or TypeScript SDK for automatic chunking, or split manually — see Long text.speedapplies afterduration. Settingduration=5withspeed=2produces ~10 seconds of audio.- When
styleis omitted, the first value in the voice’sstylesarray is used. Different voices can have different defaults — call Get voice to check.
See also
Docs: Create speech
Walkthrough with SDK examples.
Stream speech
Stream audio chunks instead of waiting for the full clip.
Authorizations
Path Parameters
Body
application/json
The text to convert to speech
Maximum string length:
300The language code of the text
Available options:
en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi, hr, lt, lv, sk, sl, sv, tr, uk The style of character to use for the text-to-speech conversion
The model type to use for the text-to-speech conversion
Available options:
sona_speech_1, sona_speech_2, sona_speech_2_flash, supertonic_api_1, supertonic_api_3 The desired output format of the audio file (wav, mp3). Default is wav.
Available options:
wav, mp3 Return phoneme timing data with the audio
Pre-normalized text for TTS. Only used with sona_speech_2 and sona_speech_2_flash models.
Response
Returns either binary audio or JSON with phoneme data based on include_phonemes parameter
Binary audio file (when include_phonemes=false or omitted)