Amazon Researchers Create Largest Text-to-Speech Model, BASE TTS

Researchers at Amazon have developed a new large language model (LLM) for text-to-speech, named BASE TTS. This model is the largest text-to-speech model ever created, with 980 million parameters. The team trained models of various sizes using up to 100,000 hours of public domain speech data.

Improved Versatility and Robustness

The medium-sized, 400 million parameter model showed significant improvement in versatility and robustness when tested on tricky test sentences. These sentences contained complex lexical, syntactic, and paralinguistic features that often trip up text-to-speech systems.

Handling Challenging Sentences

“These sentences are designed to contain challenging tasks—none of which BASE TTS is explicitly trained to perform,” explained the researchers.

Further Work Required

While an experimental process, the creation of BASE TTS demonstrates that models can reach new versatility thresholds as they scale. Researchers plan further work to identify the optimal model size for emergent abilities.

Lightweight and Streamable Design

The model is also designed to be lightweight and streamable, packaging emotional and prosodic data separately. This could allow for natural-sounding spoken Website audio integration to be transmitted across low-bandwidth connections.

