text at 300 characters per request. Anything longer returns 400 Bad Request. To synthesize longer scripts you have two options:
- Use the SDKs. Both the Python and TypeScript SDKs automatically split long input, generate each segment, and merge the audio for you. You get one final clip back even if the input was 2,000 characters.
- Chunk yourself. If you call the REST API directly, split the text on sentence boundaries and concatenate the resulting audio.
The 300-character limit, at a glance
| Layer | Behavior on >300 characters |
|---|---|
REST API (POST /v1/text-to-speech/{voice_id}) | Returns 400 Bad Request. |
Python SDK (create_speech, stream_speech) | Auto-chunks at 300, generates in parallel for create_speech (sequential for streaming), merges audio. |
TypeScript SDK (createSpeech, streamSpeech) | Auto-chunks at 300, generates sequentially, merges audio. |
predict_duration | Not auto-chunked — same 300-character limit applies. |
SDK auto-chunking, end to end
- Python
- TypeScript
LONG_TEXT at sentence boundaries (then word boundaries, then character boundaries if a single word is too long), runs up to 3 parallel create_speech requests, and merges the resulting WAV/MP3 audio with intermediate file headers stripped.Streaming long text
The SDKs also auto-chunk onstream_speech / streamSpeech. The audio is delivered to your iterator as if it were a single continuous stream — you don’t need to know how many segments were used.
See Stream speech for the streaming pattern.
Chunking yourself (cURL or raw HTTP)
If you call the REST API directly, you need to split before sending. A reasonable strategy:- Split on sentence-ending punctuation (
.,!,?,。,?). - If a sentence is still over 300 characters, split on commas, then on word boundaries.
- For each segment, call
POST /v1/text-to-speech/{voice_id}and append the returned audio to your output file. - For WAV concatenation, strip the WAV header (first 44 bytes) from every segment after the first so the final file plays as one clip.
Tips
- Punctuation matters. Auto-chunking prefers sentence boundaries. Well-punctuated input produces cleaner cuts and more natural transitions.
- Estimate cost first.
predict_durationdoesn’t auto-chunk, but you can split text yourself and sum durations to estimate total credits. - Watch rate limits. A single long input becomes multiple TTS requests — track your account’s rate limits and consider throttling in your own caller.
Related
Long-form narration
End-to-end example for generating a multi-paragraph narration.
Rate limits
Per-minute request limits by tier.