Collecting Large-Scale High-Quality Audio Datasets for Text-to-Speech Synthesis

Supervisor(s):	Nicolas Müller
Status:	finished
Topic:	Others
Author:	Simon Roschmann
Submission:	2021-03-15
Type of Thesis:	Bachelorthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching
Description Text-to-Speech (TTS) synthesis creates artificial speech from a text prompt. Based on a sequence-to-sequence architecture with attention and trained on appropriate <text, audio> pairs, state-of-the-art TTS systems produce synthe- sized audio close to the naturalness and intelligibility of human speech. Since high-quality training data remains a scarce resource until today, particularly for non-English TTS models, we present two approaches to collecting large-scale high-quality audio datasets. Both pipelines are applied and evaluated when col- lecting the first German multi-speaker TTS corpus from LibriVox audiobooks for model pretraining and two German single-speaker corpora from YouTube videos for model finetuning. Evaluating the outcome of both pipelines and the performance of TTS models trained on these outcomes, we find that our collec- tion of training data allows to synthesize audio of high quality.

Description