VibeVoice-Realtime TTS Demo

Text

Streaming Input Text

This area will display the streaming input text in real time.

This demo requires the full text to be provided upfront. The model then receives the text via streaming input during synthesis.
For non-punctuation special characters, applying text normalization before processing often yields better results.

Clone a Voice (add your own speaker) ▼

💾 Upload audio

Use acoustic inversion (slower, better)

✨ High Quality TTS — Voice Cloning (F5-TTS) ▼

Upload or record 5–10 seconds of clean reference audio. F5-TTS generates speech in that voice with ElevenLabs-level quality.
Note: ~30–90 s on CPU — upgrade to a GPU VPS for near-instant results.

💾 Reference Audio No audio selected

Reference Transcription (optional — leave empty to auto-detect)

Text to synthesize

Speed 1.00x

Speaker

CFG 1.5 Inference Steps 5

Model Generated Audio0.00s Audio Played0.00s

Runtime Logs