Description
In the past as well as recently, numerous failure cases have been observed where, on
the one hand, machine learning systems achieve super-human performance on a task,
while on the other hand astonishingly fail to generalize to slightly different tasks. One
of the reasons for this behavior is shortcut learning, in which shortcuts or spurious
correlations in datasets lead to good training and test accuracy, but ultimately lead to
significant performance drops on out-of-distribution data. Because of the good training
and test performance, shortcuts often go unnoticed until the trained model is put into
practice. These machine learning (ML) shortcuts limit the generalization ability of even
sophisticated models, mainly because of two reasons: 1) the correlating features are
significant enough such that they are exploited and chosen over more complex features
and 2) through the absence of learned shortcut features on unseen, real-world datasets.
In this thesis, we will investigate different learning shortcuts and their impact on
deep learning models. On the one hand, we will use explainable AI methods to develop
a strategy to detect shortcuts in image classification tasks without a human in the loop.
On the other, we exploit the effect of shortcuts on the generalization ability and the fact,
that they are hard to discover: We can leverage learning shortcuts to protect features
from being learned from datasets that include proprietary data or personal information
which has been released into the public without the consent of a person.
Ultimately, we propose a defense strategy for online databases against web crawling:
Providers such as dating platforms, clothing manufacturers, or used car dealers have
to deal with a professionalized crawling industry that grabs and resells data points
on a large scale. We show that a deterrent can be created by deliberately adding ML
shortcuts: Such augmented datasets are then unusable for ML use cases, which deters
crawlers. Using real-world data from three use cases, we show that the proposed ap-
proach renders such collected data unusable but is not noticeable in human perception
and can thus serve as a proactive protection against data crawling.
|