TUM Logo

Collecting Large-Scale High-Quality Audio Datasets for Text-to-Speech Synthesis

Collecting Large-Scale High-Quality Audio Datasets for Text-to-Speech Synthesis

Supervisor(s): Nicolas Müller
Status: finished
Topic: Others
Author: Simon Roschmann
Submission: 2021-03-15
Type of Thesis: Bachelorthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

Text-to-Speech (TTS) synthesis creates artificial speech from a text prompt.
Based on a sequence-to-sequence architecture with attention and trained on
appropriate <text, audio> pairs, state-of-the-art TTS systems produce synthe-
sized audio close to the naturalness and intelligibility of human speech. Since
high-quality training data remains a scarce resource until today, particularly for
non-English TTS models, we present two approaches to collecting large-scale
high-quality audio datasets. Both pipelines are applied and evaluated when col-
lecting the first German multi-speaker TTS corpus from LibriVox audiobooks
for model pretraining and two German single-speaker corpora from YouTube
videos for model finetuning. Evaluating the outcome of both pipelines and the
performance of TTS models trained on these outcomes, we find that our collec-
tion of training data allows to synthesize audio of high quality.