Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt

Use this file to discover all available pages before exploring further.

create_speech converts text into a finished audio file. The full audio is returned in the response body, ready to save or play. If you need to stream audio chunks as they are synthesized — for example, to start playback before generation finishes — see Stream speech instead.

Basic usage

import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Hello from Supertone.",
        language="en",
        model="sona_speech_1",
        output_format="wav",
    )

    with open("speech.wav", "wb") as f:
        f.write(response.result.read())

Request fields

FieldRequiredDescription
voice_idPath parameter — identifies the character.
textThe text to synthesize. Max 300 characters per API call (SDKs auto-chunk longer text).
languageLanguage code. Must be supported by the voice and the model.
styleEmotional style (e.g. neutral, happy). Defaults to the first style in the voice’s styles array.
modelTTS model. Defaults to sona_speech_1. See Models.
output_formatwav (default) or mp3. See Output formats below.
voice_settingsPitch, intonation, speed, etc. See Voice settings.
include_phonemesWhen true, returns phoneme symbols and timestamps. See Pronunciation and phonemes.
normalized_textPronunciation-normalized companion text (currently for Japanese on sona_speech_2 family). See Normalized text.
For the complete schema, see Create speech (API reference).

Output formats

Formatoutput_format valueContent typeUse when
WAV (default)wavaudio/wavYou want lossless audio — best for production audio pipelines, further processing, or game/animation assets.
MP3mp3audio/mpegYou want smaller files for delivery to end-user devices and don’t need lossless quality.
If you omit output_format, the API defaults to wav. The same option applies to Stream speech — chunks come back as binary in the requested format.

Response

By default the API returns binary audio in the body. The response carries two useful headers:
HeaderMeaning
Content-Typeaudio/wav or audio/mpeg, matching output_format.
X-Audio-LengthDuration of the generated audio in seconds (float).

When include_phonemes=true

If you opt in to phoneme timestamps, the response switches to JSON with a base64-encoded audio payload alongside the phoneme arrays:
{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}
See Pronunciation and phonemes for the full structure.

Save the result

response = client.text_to_speech.create_speech(...)
with open("speech.wav", "wb") as f:
    f.write(response.result.read())

Tips

  • Style matters. Different voices may have different default styles. Either explicitly set style, or call Get voice once at startup to read the voice’s default.
  • Estimate before you generate. predict_duration returns the expected audio length without consuming credits — useful for UI hints and cost forecasting.
  • Long text. The raw API caps text at 300 characters. The Python and TypeScript SDKs split, generate, and merge automatically — see Long text.
  • Empty or very short input can produce unnatural results. Aim for at least a complete short sentence.

Pick a model

Choose between fast and high-quality TTS models.

Long text

Generate audio from text longer than 300 characters.

Voice settings

Tune pitch, intonation, and speed.

API reference

Full request and response schema.