VibeVoice-Realtime TTS Demo

Streaming Input Text
This area will display the streaming input text in real time.
This demo requires the full text to be provided upfront. The model then receives the text via streaming input during synthesis.
For non-punctuation special characters, applying text normalization before processing often yields better results.
Clone a Voice (add your own speaker)
✨ High Quality TTS — Voice Cloning (F5-TTS)

Upload or record 5–10 seconds of clean reference audio. F5-TTS generates speech in that voice with ElevenLabs-level quality.
Note: ~30–90 s on CPU — upgrade to a GPU VPS for near-instant results.

No audio selected
Reference Transcription (optional — leave empty to auto-detect)
Text to synthesize
Speaker
Model Generated Audio0.00s Audio Played0.00s
Runtime Logs