1. Endpoint and Basic Structure
Required Headers
Path Parameters
voice_id
: Unique ID of the voice to use
Query Parameters
output_format
(optional): Audio format to generate. Choose betweenwav
(default) andmp3
2. Request Body
Requests are sent in JSON format and can include the following fields:Field | Required | Description |
---|---|---|
text | ✅ | Text to convert to speech (max 300 characters) |
language | ✅ | Language of the text. Choose within languages supported by the voice (ko , en , ja ) |
style | ❌ | Emotion style to apply (neutral, happy, etc.). If not entered, the default style will be used. The first value becomes the default style. |
model | ❌ | Voice model to use (sona_speech_1 ). Automatically applied if omitted |
voice_settings | ❌ | Advanced options to adjust voice pitch, intonation, and speed (see below) |
3. Complete Request Example
4. voice_settings
Options
voice_settings
is an advanced option you can use when you want to fine-tune the speech feel of the generated voice.
Parameter | Description | Allowed Range | Default |
---|---|---|---|
pitch_shift | Adjusts the pitch level. 0 is the original voice pitch, with ±12 steps possible. 1 step is a semitone. | -12 ~ +12 | 0 |
pitch_variance | Controls the degree of intonation variation during speech. Smaller values create flatter intonation, larger values create richer intonation. | 0.1 ~ 2 | 1 |
speed | Controls speech speed. Values less than 1 make it slower, values greater than 1 make it faster. | 0.5 ~ 2 | 1 |
5. Response
On success, responds with an audio stream (audio/wav
or audio/mpeg
).Audio length can be checked through headers.
6. Text Input Considerations
- Text can be input up to 300 characters maximum.
- Too short sentences may result in unnatural speech.
- Only Korean, English, and Japanese are supported; other languages may produce unexpected results.
- Emojis and special symbols may not be read or may be ignored.
7. Model Selection
Currently, only one modelsona_speech_1
is supported.This parameter is optional, and the default model is automatically applied if not provided.
8. Check Speech Duration First with Predict Duration API
Even without generating speech, you can predict how many seconds of speech the input text will produce.- Request method is the same as TTS
- Response example:
9. Stream Text-to-Speech
This is a streaming TTS designed for real-time services such as AI chatbots and character-based chats.With streaming TTS, you can receive audio output quickly without waiting for the entire text to be fully synthesized.
For detailed usage instructions, please refer to the guide below: