Skip to main content
POST
/
v1
/
text-to-speech
/
{voice_id}
Convert text to speech
curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id} \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '{
  "text": "<string>",
  "language": "en",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false
}'
This response does not have an example.

Endpoint

https://supertoneapi.com/v1/text-to-speech/{voice_id}

Path Parameters

NameRequiredDescription
voice_idYesThe ID of the target voice.

Request Body

Content-Type: application/json
NameRequiredDescription
textYesThe text to convert (max 300 characters).
languageYesLanguage code. Supported: en, ko, ja.
styleNoEmotional style. E.g., neutral, happy, sad, etc. If not specified, the character’s default style is applied
modelNoTTS model. Default: sona_speech_1.
output_formatNoOutput format. Options: wav, mp3. Default: wav.
voice_settingsNoAdvanced voice parameters (see below).
include_phonemesNoIf true, returns phoneme timing data along with audio (Base64-encoded). Default: false.

Voice Settings (optional)

NameRangeDefaultDescription
pitch_shift-24 → 240Pitch adjustment in semitones.
pitch_variance0 → 21Degree of pitch variation.
speed0.5 → 21Adjusts the generated audio uniformly faster or slower. (ratio)
duration0 → 600When provided, speech is generated to match the given duration (seconds)
similarity1 → 53Controls how closely the generated speech matches the original character voice.
text_guidance0 → 41Controls how sensitively speech characteristics adapt to the input text content.
subharmonic_amplitude_control0 → 21Controls the amount of subharmonic amplitude of the generated speech.

Response

Depending on include_phonemes, returns: Binary Audio
(Default & when include_phonemes=false)
audio/wav – Raw WAV file.
audio/mpeg – Raw MP3 file.
JSON with Phoneme Data
(when include_phonemes=true)
{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

Headers:

X-Audio-Length (number) – Duration of the audio in seconds.

Notes

  • A 400 error will occur if the text length exceeds 300 characters.
  • speed is applied after duration. (Example: duration=5seconds, speed=2times → final audio ≈ 10seconds)
  • Calls are possible even without style, but default styles may vary by character, so please call Get Voices API to check the default style (the first value in the styles array is the default).
  • The audio file in the response can be directly saved or played (appropriate handling required depending on client).

Authorizations

x-sup-api-key
string
header
required

Path Parameters

voice_id
string
required

Body

application/json
text
string
required

The text to convert to speech

Maximum length: 300
language
enum<string>
required

The language code of the text

Available options:
en,
ko,
ja
style
string

The style of character to use for the text-to-speech conversion

model
string
default:sona_speech_1

The model type to use for the text-to-speech conversion

output_format
enum<string>
default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:
wav,
mp3
voice_settings
object
include_phonemes
boolean
default:false

Return phoneme timing data with the audio

Response

Returns either binary audio or JSON with phoneme data based on include_phonemes parameter

Binary audio file (when include_phonemes=false or omitted)

I