Skip to main content
POST
/
v1
/
text-to-speech
/
{voice_id}
/
stream
Convert text to speech with streaming response
curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '{
  "text": "<string>",
  "language": "en",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false
}'
This response does not have an example.

Endpoint

https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream

Path Parameters

NameRequiredDescription
voice_idYesThe ID of the target voice.

Request Body

Content-Type: application/json
NameRequiredDescription
textYesThe text to convert (max 300 characters).
languageYesLanguage code. Supported: en, ko, ja.
styleNoEmotional style. E.g., neutral, happy, sad, etc. If not specified, the character’s default style is applied
modelNoTTS model. Default: sona_speech_1.
output_formatNoOutput format. Options: wav, mp3. Default: wav.
voice_settingsNoAdvanced voice parameters (see below).
include_phonemesNoIf true, returns phoneme timing data along with audio (Base64-encoded). Default: false.

Voice Settings (optional)

NameRangeDefaultDescription
pitch_shift-24 → 240Pitch adjustment in semitones.
pitch_variance0 → 21Degree of pitch variation.
speed0.5 → 21Adjusts the generated audio uniformly faster or slower. (ratio)
duration0 → 600When provided, speech is generated to match the given duration (seconds)
similarity1 → 53Controls how closely the generated speech matches the original character voice.
text_guidance0 → 41Controls how sensitively speech characteristics adapt to the input text content.
subharmonic_amplitude_control0 → 21Controls the amount of subharmonic amplitude of the generated speech.

Response

Depending on include_phonemes, returns: Audio Stream
(Default & when include_phonemes=false)
audio/wav – Binary audio stream.
audio/mpeg – Binary audio stream.
NDJSON stream with Phoneme Data
(when include_phonemes=true)
Streamed as newline-delimited JSON
{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

Notes

  • A 400 error will occur if the text length exceeds 300 characters.
  • speed is applied after duration. (Example: duration=5seconds, speed=2times → final audio ≈ 10seconds)
  • The API can be called without specifying style, but the default style may vary by character.
    Please use the Get Voices API to check the default (the first value in the style array is the default).
  • The returned audio file can be saved or played directly. (Appropriate handling may be required depending on the client.)

Authorizations

x-sup-api-key
string
header
required

Path Parameters

voice_id
string
required

Body

application/json
text
string
required

The text to convert to speech

Maximum length: 300
language
enum<string>
required

The language code of the text

Available options:
en,
ko,
ja
style
string

The style of character to use for the text-to-speech conversion

model
string
default:sona_speech_1

The model type to use for the text-to-speech conversion

output_format
enum<string>
default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:
wav,
mp3
voice_settings
object
include_phonemes
boolean
default:false

Return phoneme timing data with the audio

Response

Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter

Binary audio stream (when includePhonemes=false or omitted)