Meta's Voicebox Shines in Speech Synthesis

Trained on over 50,000 hours of audio data, Voicebox outperforms previous TTS benchmarks.

Meta has introduced Voicebox, a sophisticated speech generation model excelling in text-to-speech (TTS) synthesis across six languages and demonstrating superior noise elimination capabilities

It predicts masked sections in audio inputs, allowing tasks like noise removal and cross-lingual style transfer.

Voicebox, utilizing a flow-matching architecture, distinguishes itself from autoregressive models.

Meta refrains from open-sourcing Voicebox, citing safety concerns.

Despite training on audiobooks in multiple languages,

Voicebox, trained for specific tasks, exhibits in-context learning for style transfer and noise removal.

To balance openness and responsibility, Meta shares audio samples and a detailed research paper.

Discussions explore Meta's decision, considering the model's potential replication with abundant training data from audiobooks, podcasts, and broadcast archives.

For safety, Meta introduces a classifier detecting synthesized speech, reaffirming its commitment to ethical AI development.