Detection of Misuse and Malicious Behaviours through an Emotion Analysis System - Speech

Supervisor(s):	Ching-Yu Kao, Karla Markert
Status:	finished
Topic:	Others
Author:	Georgi Hrusanov
Submission:	2021-10-15
Type of Thesis:	Bachelorthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching
Description Recently, the significance of reacting to a user’s emotional state has been widely acknowledged in the field of human-computer interaction, and voice and speech, in particular, have gained greater attention as a medium from which to automatically deduce information about emotion. Up to this point, only academic and non-application-oriented offline research has been conducted, using previously recorded and labeled datasets of emotional speech. However, the needs of online analysis differ from those of offline analysis; in particular, situations are more unpredictable and require faster algorithms. As a result, in this thesis, real-time automated emotion detection from acoustic characteristics of speech was studied. First, offline tests were carried out to discover acceptable audio segmentation and feature extraction. Then, supervised deep learning approaches were used for the classification. The methods were analysed according to the following requirement: They should be as fast as feasible while also producing as accurate results as possible. For the evaluation, we collected the findings based on two databases of distinct speech and emotion kinds, the Berlin Database of Emotional Speech and RAVDESS. Both comprise read and spontaneous speech as well as acted and spontaneous emotions. We compare two different learning approaches: supervised and unsupervised training with labeled and unlabeled data, respectiveley. For this purpose, we discuss and analyze the speech synthesis problem and Google’s text to speech system Tacotron as well as its extension, the Global Style Tokens. We were able to successfully demonstrate that by combining Convolutional Neural Networks for feature extraction and LSTMs for classification and employing supervised learning, we could achieve results that were close to the state-of-the-art. In the second approach, we achieve better style control and transfer through speech synthesis, as well as increased interpretability of the Global Style Tokens while using only 5% labeled data. This thesis also illustrates the similarities and differences between the two approaches. Furthermore, we have also examined the possibility to combine the two approaches proposed in the thesis and evaluated if it is possible to mix the models with other types of emotion recognition like face, image or text.

Detection of Misuse and Malicious Behaviours through an Emotion Analysis System - Speech

Description