To convert text to speech through Supertone API, you need to pass information such as text, language, and style along with a specific voice ID to the API. This document provides step-by-step guidance on the complete call structure of the Text-to-Speech function, parameter configuration methods, response format, and voice adjustment options.

1. Endpoint and Basic Structure

POST /v1/text-to-speech/{voice_id}

Required Headers

x-sup-api-key: [YOUR_API_KEY]
Content-Type: application/json

Path Parameters

  • voice_id: Unique ID of the voice to use

Query Parameters

  • output_format (optional): Audio format to generate. Choose between wav (default) and mp3

2. Request Body

Requests are sent in JSON format and can include the following fields:

FieldRequiredDescription
textText to convert to speech (max 300 characters)
languageLanguage of the text. Choose within languages supported by the voice (ko, en, ja)
styleEmotion style to apply (neutral, happy, etc.). If not entered, the default style will be used. The first value becomes the default style.
modelVoice model to use (sona_speech_1). Automatically applied if omitted
voice_settingsAdvanced options to adjust voice pitch, intonation, and speed (see below)

3. Complete Request Example

POST /v1/text-to-speech/91992bbd4758bdcf9c9b01?output_format=mp3
x-sup-api-key: [YOUR_API_KEY]
Content-Type: application/json

{
  "text": "안녕하세요, 수퍼톤 API입니다.",
  "language": "ko",
  "style": "neutral",
  "model": "sona_speech_1",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1
  }
}

4. voice_settings Options

voice_settings is an advanced option you can use when you want to fine-tune the speech feel of the generated voice.

ParameterDescriptionAllowed RangeDefault
pitch_shiftAdjusts the pitch level.
0 is the original voice pitch, with ±12 steps possible. 1 step is a semitone.
-12 ~ +120
pitch_varianceControls the degree of intonation variation during speech.
Smaller values create flatter intonation, larger values create richer intonation.
0.1 ~ 21
speedControls speech speed.
Values less than 1 make it slower, values greater than 1 make it faster.
0.5 ~ 21

5. Response

On success, responds with an audio stream (audio/wav or audio/mpeg).
Audio length can be checked through headers.

X-Audio-Length: 3.42

The above example indicates that 3.42 seconds of speech was generated.

6. Text Input Considerations

  • Text can be input up to 300 characters maximum.
  • Too short sentences may result in unnatural speech.
  • Only Korean, English, and Japanese are supported; other languages may produce unexpected results.
  • Emojis and special symbols may not be read or may be ignored.

7. Model Selection

Currently, only one model sona_speech_1 is supported.
This parameter is optional, and the default model is automatically applied if not provided.

8. Check Speech Duration First with Predict Duration API

Even without generating speech, you can predict how many seconds of speech the input text will produce.

POST /v1/predict-duration/{voice_id}
  • Request method is the same as TTS
  • Response example:
{
  "duration": 2.87
}

This API does not deduct credits. It can be useful for usage prediction or implementing preview UI features.