Endpoint
Path Parameters
| Name | Required | Description |
|---|---|---|
voice_id | Yes | The ID of the target voice. |
Request Body
Content-Type: application/json| Name | Required | Description |
|---|---|---|
text | Yes | The text to convert (max 300 characters). |
language | Yes | Language code. Supported: en, ko, ja. |
style | No | Emotional style. E.g., neutral, happy, sad, etc. If not specified, the character’s default style is applied |
model | No | TTS model. Default: sona_speech_1. |
output_format | No | Output format. Options: wav, mp3. Default: wav. |
voice_settings | No | Advanced voice parameters (see below). |
include_phonemes | No | If true, returns phoneme timing data along with audio (Base64-encoded). Default: false. |
Voice Settings (optional)
| Name | Range | Default | Description |
|---|---|---|---|
pitch_shift | -24 → 24 | 0 | Pitch adjustment in semitones. |
pitch_variance | 0 → 2 | 1 | Degree of pitch variation. |
speed | 0.5 → 2 | 1 | Adjusts the generated audio uniformly faster or slower. (ratio) |
duration | 0 → 60 | 0 | When provided, speech is generated to match the given duration (seconds) |
similarity | 1 → 5 | 3 | Controls how closely the generated speech matches the original character voice. |
text_guidance | 0 → 4 | 1 | Controls how sensitively speech characteristics adapt to the input text content. |
subharmonic_amplitude_control | 0 → 2 | 1 | Controls the amount of subharmonic amplitude of the generated speech. |
Response
Depending oninclude_phonemes, returns:
Audio Stream(Default & when include_phonemes=false)
audio/wav – Binary audio stream.
audio/mpeg – Binary audio stream. NDJSON stream with Phoneme Data
(when include_phonemes=true)
Streamed as newline-delimited JSON
Notes
- A 400 error will occur if the
textlength exceeds 300 characters. speedis applied afterduration. (Example: duration=5seconds, speed=2times → final audio ≈ 10seconds)- The API can be called without specifying
style, but the default style may vary by character.
Please use the Get Voices API to check the default (the first value in the style array is the default). - The returned audio file can be saved or played directly. (Appropriate handling may be required depending on the client.)
Authorizations
Path Parameters
Body
application/json
The text to convert to speech
Maximum length:
300The language code of the text
Available options:
en, ko, ja The style of character to use for the text-to-speech conversion
The model type to use for the text-to-speech conversion
The desired output format of the audio file (wav, mp3). Default is wav.
Available options:
wav, mp3 Return phoneme timing data with the audio
Response
Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter
Binary audio stream (when includePhonemes=false or omitted)