Description
Recently, the significance of reacting to a user’s emotional state has
been widely acknowledged in the field of human-computer interaction, and
voice and speech, in particular, have gained greater attention as a medium
from which to automatically deduce information about emotion. Up to this point,
only academic and non-application-oriented offline research has been conducted,
using previously recorded and labeled datasets of emotional speech.
However, the needs of online analysis differ from those of offline
analysis; in particular, situations are more unpredictable and require faster algorithms.
As a result, in this thesis, real-time automated emotion detection from acoustic characteristics
of speech was studied. First, offline tests were carried out to discover acceptable audio
segmentation and feature extraction. Then, supervised deep learning approaches were used
for the classification. The methods were analysed according to the following requirement:
They should be as fast as feasible while also producing as accurate results as possible.
For the evaluation, we collected the findings based on two databases of distinct speech
and emotion kinds, the Berlin Database of Emotional Speech and RAVDESS. Both comprise
read and spontaneous speech as well as acted and spontaneous emotions. We compare
two different learning approaches: supervised and unsupervised training with labeled and
unlabeled data, respectiveley. For this purpose, we discuss and analyze the speech synthesis
problem and Google’s text to speech system Tacotron as well as its extension, the Global
Style Tokens. We were able to successfully demonstrate that by combining Convolutional Neural
Networks for feature extraction and LSTMs for classification and employing supervised
learning, we could achieve results that were close to the state-of-the-art. In the second approach,
we achieve better style control and transfer through speech synthesis, as well as
increased interpretability of the Global Style Tokens while using only 5% labeled data. This thesis
also illustrates the similarities and differences between the two approaches. Furthermore, we
have also examined the possibility to combine the two approaches proposed in the thesis and
evaluated if it is possible to mix the models with other types of emotion recognition like
face, image or text.
|