TUM Logo

An Integrated Algorithm for Robust and Imperceptible Audio Adversarial Examples

Audio adversarial examples are audio files that have been manipulated to fool an automatic speech recognition (ASR) system, while still sounding benign to a human listener. Most methods to generate such samples are based on a two-step algorithm: first, a viable adversarial audio file is produced, then, this is fine-tuned with respect to perceptibility and robustness. In this work, we present an integrated algorithm that uses psychoacoustic models and room impulse responses (RIR) in the generation step. The RIRs are dynamically created by a neural network during the generation process to simulate a physical environment to harden our examples against transformations experienced in over-the-air attacks. We compare the different approaches in three experiments: in a simulated environment and in a realistic over-the-air scenario to evaluate the robustness, and in a human study to evaluate the perceptibility. Our algorithms considering psychoacoustics only or in addition to the robustness show an improvement in the signal-to-noise ratio (SNR) as well as in the human perception study, at the cost of an increased word error rate (WER).

An Integrated Algorithm for Robust and Imperceptible Audio Adversarial Examples

Proc. 3rd Symposium on Security and Privacy in Speech Communication

Authors: Armin Ettenhofer, Jan-Philipp Schulze, and Karla Pizzi
Year/month: 2023/10
Booktitle: Proc. 3rd Symposium on Security and Privacy in Speech Communication
Fulltext: click here

Abstract

Audio adversarial examples are audio files that have been manipulated to fool an automatic speech recognition (ASR) system, while still sounding benign to a human listener. Most methods to generate such samples are based on a two-step algorithm: first, a viable adversarial audio file is produced, then, this is fine-tuned with respect to perceptibility and robustness. In this work, we present an integrated algorithm that uses psychoacoustic models and room impulse responses (RIR) in the generation step. The RIRs are dynamically created by a neural network during the generation process to simulate a physical environment to harden our examples against transformations experienced in over-the-air attacks. We compare the different approaches in three experiments: in a simulated environment and in a realistic over-the-air scenario to evaluate the robustness, and in a human study to evaluate the perceptibility. Our algorithms considering psychoacoustics only or in addition to the robustness show an improvement in the signal-to-noise ratio (SNR) as well as in the human perception study, at the cost of an increased word error rate (WER).

Bibtex:

@inproceedings {
author = { Armin Ettenhofer and Jan-Philipp Schulze and Karla Pizzi},
title = { An Integrated Algorithm for Robust and Imperceptible Audio Adversarial Examples },
year = { 2023 },
month = { October },
booktitle = { Proc. 3rd Symposium on Security and Privacy in Speech Communication },
url = { https://doi.org/10.21437/SPSC.2023-4 },

}