Description
Text-to-Speech (TTS) synthesis creates artificial speech from a text prompt.
Based on a sequence-to-sequence architecture with attention and trained on
appropriate <text, audio> pairs, state-of-the-art TTS systems produce synthe-
sized audio close to the naturalness and intelligibility of human speech. Since
high-quality training data remains a scarce resource until today, particularly for
non-English TTS models, we present two approaches to collecting large-scale
high-quality audio datasets. Both pipelines are applied and evaluated when col-
lecting the first German multi-speaker TTS corpus from LibriVox audiobooks
for model pretraining and two German single-speaker corpora from YouTube
videos for model finetuning. Evaluating the outcome of both pipelines and the
performance of TTS models trained on these outcomes, we find that our collec-
tion of training data allows to synthesize audio of high quality.
|