This area will display the streaming input text in real time.
This demo requires the full text to be provided upfront. The model then receives the text via streaming input during synthesis.
For non-punctuation special characters, applying text normalization before processing often yields better results.
Clone a Voice (add your own speaker)▼
✨ High Quality TTS — Voice Cloning (F5-TTS)▼
Upload or record 5–10 seconds of clean reference audio.
F5-TTS generates speech in that voice with ElevenLabs-level quality. Note: ~30–90 s on CPU — upgrade to a GPU VPS for near-instant results.
No audio selected
Reference Transcription (optional — leave empty to auto-detect)