Converts text into speech using a voice of your choice and configurable voice settings, and returns the output as an audio stream.
| Name | Required | Description |
|---|---|---|
voice_id | Yes | The ID of the target voice. |
| Name | Required | Description |
|---|---|---|
text | Yes | The text to convert (max 300 characters). |
language | Yes | Language code. Supported: en, ko, ja. |
style | No | Emotional style. E.g., neutral, happy, sad, etc. If not specified, the character’s default style is applied |
model | No | TTS model. Default: sona_speech_1. |
output_format | No | Output format. Options: wav, mp3. Default: wav. |
voice_settings | No | Advanced voice parameters (see below). |
include_phonemes | No | If true, returns phoneme timing data along with audio (Base64-encoded). Default: false. |
| Name | Range | Default | Description |
|---|---|---|---|
pitch_shift | -24 → 24 | 0 | Pitch adjustment in semitones. |
pitch_variance | 0 → 2 | 1 | Degree of pitch variation. |
speed | 0.5 → 2 | 1 | Adjusts the generated audio uniformly faster or slower. (ratio) |
duration | 0 → 60 | 0 | When provided, speech is generated to match the given duration (seconds) |
similarity | 1 → 5 | 3 | Controls how closely the generated speech matches the original character voice. |
text_guidance | 0 → 4 | 1 | Controls how sensitively speech characteristics adapt to the input text content. |
subharmonic_amplitude_control | 0 → 2 | 1 | Controls the amount of subharmonic amplitude of the generated speech. |
include_phonemes, returns:
Audio Streamsona_speech_1 model.text length exceeds 300 characters.speed is applied after duration. (Example: duration=5seconds, speed=2times → final audio ≈ 10seconds)style, but the default style may vary by character.The text to convert to speech
300The language code of the text
en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi The style of character to use for the text-to-speech conversion
The model type to use for the text-to-speech conversion
sona_speech_1, sona_speech_2, supertonic_api_1 The desired output format of the audio file (wav, mp3). Default is wav.
wav, mp3 Return phoneme timing data with the audio
Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter
Binary audio stream (when includePhonemes=false or omitted)