Description
The popularity of devices containing software for automatic speech
recognition (ASR) is ever increasing. These AI systems interpret spoken
words to execute voice commands or provide automatic transcriptions.
As they are a part of AI personal assistants like Amazon Alexa or Apple
Siri, which can often be used to control Smart Home functions, security
is an important issue. However, the technology many state-of-the-art
ASR systems use, end-to-end neural networks, has an inherent
vulnerability to adversarial examples. These are specifically generated
small and ideally imperceptible manipulations to the input of a
network, which enable an attacker to arbitrarily control the network's
output.
In this work, we investigate the current possibilities of audio
adversarial examples, building on previous work in both the image and
audio domain. We use room impulse responses dynamically created by a
neural network to simulate a physical environment during the generation
process and harden our examples against transformations experienced in
over-the-air attacks. Furthermore, we use a psychoacoustic model for
auditory masking, whereby spikes in certain frequencies will make
changes in adjacent frequencies completely imperceptible. As a result,
the attack is hidden in frequencies imperceptible to a human listener,
yet an ASR understands a phrase clearly different than the original.
Thanks to our improved attack mechanisms, the target phrase is
correctly transcribed even in physical environments under volatile
real-world conditions, while being less perceptible to human observers.
This is further evaluated in a human study.
|