Stream speech

Convert text to speech with streaming response

curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '{
  "text": "<string>",
  "language": "en",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false
}'

This response does not have an example.

POST

text-to-speech

{voice_id}

stream

Convert text to speech with streaming response

curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '{
  "text": "<string>",
  "language": "en",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false
}'

This response does not have an example.

Endpoint

https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream

Path Parameters

Name	Required	Description
`voice_id`	Yes	The ID of the target voice.

Request Body

Content-Type: application/json

Name	Required	Description
`text`	Yes	The text to convert (max 300 characters).
`language`	Yes	Language code. Supported: `en`, `ko`, `ja`.
`style`	No	Emotional style. E.g., `neutral`, `happy`, `sad`, etc. If not specified, the character’s default style is applied
`model`	No	TTS model. Default: `sona_speech_1`.
`output_format`	No	Output format. Options: `wav`, `mp3`. Default: `wav`.
`voice_settings`	No	Advanced voice parameters (see below).
`include_phonemes`	No	If `true`, returns phoneme timing data along with audio (Base64-encoded). Default: `false`.

Voice Settings (optional)

Name	Range	Default	Description
`pitch_shift`	-24 → 24	0	Pitch adjustment in semitones.
`pitch_variance`	0 → 2	1	Degree of pitch variation.
`speed`	0.5 → 2	1	Adjusts the generated audio uniformly faster or slower. (ratio)
`duration`	0 → 60	0	When provided, speech is generated to match the given duration (seconds)
`similarity`	1 → 5	3	Controls how closely the generated speech matches the original character voice.
`text_guidance`	0 → 4	1	Controls how sensitively speech characteristics adapt to the input text content.
`subharmonic_amplitude_control`	0 → 2	1	Controls the amount of subharmonic amplitude of the generated speech.

Response

Depending on include_phonemes, returns: Audio Stream
(Default & when include_phonemes=false)
audio/wav – Binary audio stream.
audio/mpeg – Binary audio stream. NDJSON stream with Phoneme Data
(when include_phonemes=true)
Streamed as newline-delimited JSON

{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

Notes

A 400 error will occur if the text length exceeds 300 characters.
speed is applied after duration. (Example: duration=5seconds, speed=2times → final audio ≈ 10seconds)
The API can be called without specifying style, but the default style may vary by character.
Please use the Get Voices API to check the default (the first value in the style array is the default).
The returned audio file can be saved or played directly. (Appropriate handling may be required depending on the client.)

Authorizations

x-sup-api-key

string

header

required

Path Parameters

voice_id

string

required

Body

application/json

text

string

required

The text to convert to speech

Maximum length: 300

language

enum<string>

required

The language code of the text

Available options:

en,

ko,

ja

style

string

The style of character to use for the text-to-speech conversion

model

string

default:sona_speech_1

The model type to use for the text-to-speech conversion

output_format

enum<string>

default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:

wav,

mp3

voice_settings

object

Show child attributes

include_phonemes

boolean

default:false

Return phoneme timing data with the audio

Response

Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter

Binary audio stream (when includePhonemes=false or omitted)

Create speech

Predict duration

⌘I

Supertone API

Voices

Custom voices

Text to speech

Usage

Endpoint

Path Parameters

Request Body

Voice Settings (optional)

Response

Notes

Authorizations

Path Parameters

Body

Response

Supertone API

Voices

Custom voices

Text to speech

Usage

​Endpoint

​Path Parameters

​Request Body

​Voice Settings (optional)

​Response

​Notes

Authorizations

Path Parameters

Body

Response

Endpoint

Path Parameters

Request Body

Voice Settings (optional)

Response

Notes