TUM Logo

Detection of Misuse and Malicious Behaviours through an Emotion Analysis System - Speech

Detection of Misuse and Malicious Behaviours through an Emotion Analysis System - Speech

Supervisor(s): Ching-Yu Kao, Karla Markert
Status: finished
Topic: Others
Author: Georgi Hrusanov
Submission: 2021-10-15
Type of Thesis: Bachelorthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

Recently, the significance of reacting to a user’s emotional state has 
been widely acknowledged in the field of human-computer interaction, and 
voice and speech, in particular, have gained greater attention as a medium 
from which to automatically deduce information about emotion. Up to this point, 
only academic and non-application-oriented offline research has been conducted, 
using previously recorded and labeled datasets of emotional speech.
However, the needs of online analysis differ from those of offline 
analysis; in particular, situations are more unpredictable and require faster algorithms.
As a result, in this thesis, real-time automated emotion detection from acoustic characteristics
of speech was studied. First, offline tests were carried out to discover acceptable audio
segmentation and feature extraction. Then, supervised deep learning approaches were used
for the classification. The methods were analysed according to the following requirement:
They should be as fast as feasible while also producing as accurate results as possible.
For the evaluation, we collected the findings based on two databases of distinct speech
and emotion kinds, the Berlin Database of Emotional Speech and RAVDESS. Both comprise
read and spontaneous speech as well as acted and spontaneous emotions. We compare
two different learning approaches: supervised and unsupervised training with labeled and
unlabeled data, respectiveley. For this purpose, we discuss and analyze the speech synthesis
problem and Google’s text to speech system Tacotron as well as its extension, the Global
Style Tokens. We were able to successfully demonstrate that by combining Convolutional Neural 
Networks for feature extraction and LSTMs for classification and employing supervised 
learning, we could achieve results that were close to the state-of-the-art. In the second approach, 
we achieve better style control and transfer through speech synthesis, as well as 
increased interpretability of the Global Style Tokens while using only 5% labeled data. This thesis 
also illustrates the similarities and differences between the two approaches. Furthermore, we 
have also examined the possibility to combine the two approaches proposed in the thesis and 
evaluated if it is possible to mix the models with other types of emotion recognition like 
face, image or text.