VibeVoice Large Modal

$20

We’re bringing long-form, multi-speaker conversational speech synthesis to everyone — fully open-source, easy to run, and now live on Modal.

Bigger Model (7B): Generate high-fidelity speech for extended conversations.
Multi-Speaker Dialogue: Up to 4 unique voices for podcasts, audiobooks, or roleplay.
Ultra-Long Context: Handle conversations up to 45 minutes with natural flow.
Next-Token Diffusion Framework: Combines LLM-style context with diffusion-based acoustic detail for expressive realism.

A ready-to-run Google Colab link that includes:

Once deployed, your model will be live at:

https://<your-modal-space>/vibevoice