Skip to main content
POST
/
v1
/
text-to-speech
/
{voice_id}
Convert text to speech
curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id} \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '
{
  "text": "<string>",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false,
  "normalized_text": "<string>"
}
'
"<string>"
Generates speech from text and returns the audio in the response body. For the conceptual walkthrough, SDK examples, and tips, see Docs: Create speech.

Endpoint

POST https://supertoneapi.com/v1/text-to-speech/{voice_id}

Path parameters

NameRequiredDescription
voice_idThe ID of the target voice.

Request body

NameRequiredDescription
textThe text to convert. Max 300 characters. Use an SDK or split client-side for longer input.
languageLanguage code (e.g. en, ko, ja). Must be supported by the voice and the model.
styleEmotional style (e.g. neutral, happy). If omitted, the voice’s default style is used.
modelTTS model. Defaults to sona_speech_1.
output_formatwav (default) or mp3.
voice_settingsAdvanced voice parameters (see below).
include_phonemesIf true, response switches to JSON with base64 audio plus phoneme timing data. Default: false.
normalized_textPronunciation-normalized companion text (used by sona_speech_2 and sona_speech_2_flash, primarily for Japanese).

Supported languages by model

ModelLanguages
sona_speech_2, sona_speech_2_flashen, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi
supertonic_api_3en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi
supertonic_api_1en, ko, ja, es, pt
sona_speech_1en, ko, ja

Voice settings

Unsupported settings are silently ignored — they don’t error.
NameRangeDefaultDescription
pitch_shift-24 → 240Pitch shift in semitones.
pitch_variance0 → 21Degree of pitch variation.
speed0.5 → 21Playback rate multiplier. Applied after duration.
duration0 → 600When non-zero, generates audio targeting this length in seconds.
similarity1 → 53How closely the output matches the original character voice.
text_guidance0 → 41How sensitively delivery adapts to the text content.
subharmonic_amplitude_control0 → 21Subharmonic amplitude in the generated speech.

Voice settings by model

Settingsona_speech_2sona_speech_2_flashsupertonic_api_3supertonic_api_1sona_speech_1
pitch_shift, pitch_variance, duration
speed
similarity, text_guidance
subharmonic_amplitude_control

Response

Default (include_phonemes=false): Binary audio in the body.
  • Content-Type: audio/wav or audio/mpeg (matches output_format).
  • X-Audio-Length header: duration of the generated audio in seconds.
When include_phonemes=true: JSON body with base64 audio plus phoneme arrays.
{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

Notes

  • text over 300 characters returns 400. Use the Python or TypeScript SDK for automatic chunking, or split manually — see Long text.
  • speed applies after duration. Setting duration=5 with speed=2 produces ~10 seconds of audio.
  • When style is omitted, the first value in the voice’s styles array is used. Different voices can have different defaults — call Get voice to check.

See also

Docs: Create speech

Walkthrough with SDK examples.

Stream speech

Stream audio chunks instead of waiting for the full clip.

Authorizations

x-sup-api-key
string
header
required

Path Parameters

voice_id
string
required

Body

application/json
text
string
required

The text to convert to speech

Maximum string length: 300
language
enum<string>
required

The language code of the text

Available options:
en,
ko,
ja,
bg,
cs,
da,
el,
es,
et,
fi,
hu,
it,
nl,
pl,
pt,
ro,
ar,
de,
fr,
hi,
id,
ru,
vi,
hr,
lt,
lv,
sk,
sl,
sv,
tr,
uk
style
string

The style of character to use for the text-to-speech conversion

model
enum<string>
default:sona_speech_1

The model type to use for the text-to-speech conversion

Available options:
sona_speech_1,
sona_speech_2,
sona_speech_2_flash,
supertonic_api_1,
supertonic_api_3
output_format
enum<string>
default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:
wav,
mp3
voice_settings
object
include_phonemes
boolean
default:false

Return phoneme timing data with the audio

normalized_text
string

Pre-normalized text for TTS. Only used with sona_speech_2 and sona_speech_2_flash models.

Response

Returns either binary audio or JSON with phoneme data based on include_phonemes parameter

Binary audio file (when include_phonemes=false or omitted)