Documentation Index
Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt
Use this file to discover all available pages before exploring further.
create_speech converts text into a finished audio file. The full audio is returned in the response body, ready to save or play.
If you need to stream audio chunks as they are synthesized — for example, to start playback before generation finishes — see Stream speech instead.
Basic usage
- Python
- TypeScript
- cURL
Request fields
| Field | Required | Description |
|---|---|---|
voice_id | ✅ | Path parameter — identifies the character. |
text | ✅ | The text to synthesize. Max 300 characters per API call (SDKs auto-chunk longer text). |
language | ✅ | Language code. Must be supported by the voice and the model. |
style | — | Emotional style (e.g. neutral, happy). Defaults to the first style in the voice’s styles array. |
model | — | TTS model. Defaults to sona_speech_1. See Models. |
output_format | — | wav (default) or mp3. See Output formats below. |
voice_settings | — | Pitch, intonation, speed, etc. See Voice settings. |
include_phonemes | — | When true, returns phoneme symbols and timestamps. See Pronunciation and phonemes. |
normalized_text | — | Pronunciation-normalized companion text (currently for Japanese on sona_speech_2 family). See Normalized text. |
Output formats
| Format | output_format value | Content type | Use when |
|---|---|---|---|
| WAV (default) | wav | audio/wav | You want lossless audio — best for production audio pipelines, further processing, or game/animation assets. |
| MP3 | mp3 | audio/mpeg | You want smaller files for delivery to end-user devices and don’t need lossless quality. |
output_format, the API defaults to wav. The same option applies to Stream speech — chunks come back as binary in the requested format.
Response
By default the API returns binary audio in the body. The response carries two useful headers:| Header | Meaning |
|---|---|
Content-Type | audio/wav or audio/mpeg, matching output_format. |
X-Audio-Length | Duration of the generated audio in seconds (float). |
When include_phonemes=true
If you opt in to phoneme timestamps, the response switches to JSON with a base64-encoded audio payload alongside the phoneme arrays:
Save the result
- Python
- TypeScript
Tips
- Style matters. Different voices may have different default styles. Either explicitly set
style, or call Get voice once at startup to read the voice’s default. - Estimate before you generate.
predict_durationreturns the expected audio length without consuming credits — useful for UI hints and cost forecasting. - Long text. The raw API caps
textat 300 characters. The Python and TypeScript SDKs split, generate, and merge automatically — see Long text. - Empty or very short input can produce unnatural results. Aim for at least a complete short sentence.
Related
Pick a model
Choose between fast and high-quality TTS models.
Long text
Generate audio from text longer than 300 characters.
Voice settings
Tune pitch, intonation, and speed.
API reference
Full request and response schema.