Create speech

create_speech converts text into a finished audio file. The full audio is returned in the response body, ready to save or play. If you need to stream audio chunks as they are synthesized — for example, to start playback before generation finishes — see Stream speech instead.

Basic usage

Python
TypeScript
cURL

import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Hello from Supertone.",
        language="en",
        model="sona_speech_1",
        output_format="wav",
    )

    with open("speech.wav", "wb") as f:
        f.write(response.result.read())

import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID

const client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

const response = await client.textToSpeech.createSpeech({
  voiceId: VOICE_ID,
  apiConvertTextToSpeechUsingCharacterRequest: {
    text: "Hello from Supertone.",
    language: "en",
    model: "sona_speech_1",
    outputFormat: "wav",
  },
});

if (response.result instanceof Uint8Array) {
  fs.writeFileSync("speech.wav", response.result);
} else if (response.result && "getReader" in response.result) {
  const reader = (response.result as ReadableStream<Uint8Array>).getReader();
  const chunks: Uint8Array[] = [];
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    if (value) chunks.push(value);
  }
  fs.writeFileSync("speech.wav", Buffer.concat(chunks));
}

VOICE_ID="20160a4c5ba38967330c84"  # replace with your voice ID

curl -X POST "https://supertoneapi.com/v1/text-to-speech/$VOICE_ID" \
  -H "x-sup-api-key: $SUPERTONE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello from Supertone.",
    "language": "en",
    "model": "sona_speech_1",
    "output_format": "wav"
  }' \
  --output speech.wav

Request fields

Field	Required	Description
`voice_id`	✅	Path parameter — identifies the character.
`text`	✅	The text to synthesize. Max 300 characters per API call (SDKs auto-chunk longer text).
`language`	✅	Language code. Must be supported by the voice and the model.
`style`	—	Emotional style (e.g. `neutral`, `happy`). Defaults to the first style in the voice’s `styles` array.
`model`	—	TTS model. Defaults to `sona_speech_1`. See Models.
`output_format`	—	`wav` (default) or `mp3`. See Output formats below.
`voice_settings`	—	Pitch, intonation, speed, etc. See Voice settings.
`include_phonemes`	—	When `true`, returns phoneme symbols and timestamps. See Pronunciation and phonemes.
`normalized_text`	—	Pronunciation-normalized companion text (currently for Japanese on `sona_speech_2` family). See Normalized text.

For the complete schema, see Create speech (API reference).

Output formats

Format	`output_format` value	Content type	Use when
WAV (default)	`wav`	`audio/wav`	You want lossless audio — best for production audio pipelines, further processing, or game/animation assets.
MP3	`mp3`	`audio/mpeg`	You want smaller files for delivery to end-user devices and don’t need lossless quality.

If you omit output_format, the API defaults to wav. The same option applies to Stream speech — chunks come back as binary in the requested format.

Response

By default the API returns binary audio in the body. The response carries two useful headers:

Header	Meaning
`Content-Type`	`audio/wav` or `audio/mpeg`, matching `output_format`.
`X-Audio-Length`	Duration of the generated audio in seconds (float).

When `include_phonemes=true`

If you opt in to phoneme timestamps, the response switches to JSON with a base64-encoded audio payload alongside the phoneme arrays:

{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

See Pronunciation and phonemes for the full structure.

Save the result

Python
TypeScript

response = client.text_to_speech.create_speech(...)
with open("speech.wav", "wb") as f:
    f.write(response.result.read())

import * as fs from "node:fs";

const response = await client.textToSpeech.createSpeech(...);

if (response.result instanceof Uint8Array) {
  fs.writeFileSync("speech.wav", response.result);
} else if (response.result && "getReader" in response.result) {
  const reader = (response.result as ReadableStream<Uint8Array>).getReader();
  const chunks: Uint8Array[] = [];
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    if (value) chunks.push(value);
  }
  fs.writeFileSync("speech.wav", Buffer.concat(chunks));
}

Tips

Style matters. Different voices may have different default styles. Either explicitly set style, or call Get voice once at startup to read the voice’s default.
Estimate before you generate. predict_duration returns the expected audio length without consuming credits — useful for UI hints and cost forecasting.
Long text. The raw API caps text at 300 characters. The Python and TypeScript SDKs split, generate, and merge automatically — see Long text.
Empty or very short input can produce unnatural results. Aim for at least a complete short sentence.

Pick a model

Choose between fast and high-quality TTS models.

Long text

Generate audio from text longer than 300 characters.

Voice settings

Tune pitch, intonation, and speed.

API reference

Full request and response schema.

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

Basic usage

Request fields

Output formats

Response

When `include_phonemes=true`

Save the result

Tips

Pick a model

Long text

Voice settings

API reference

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

Documentation Index

​Basic usage

​Request fields

​Output formats

​Response

​When include_phonemes=true

​Save the result

​Tips

​Related

Pick a model

Long text

Voice settings

API reference

Basic usage

Request fields

Output formats

Response

When `include_phonemes=true`

Save the result

Tips

Related