TUM Logo

A Study of Detecting Computer Viruses in Real-Infected Files in the n-gram Representation with Machine Learning Methods

Machine learning methods were successfully applied in recent years for detecting new and unseen computer viruses. The viruses were, however, detected in small virus loader files and not in real infected executable files. We created data sets of benign files, virus loader files and real infected executable files and represented the data as collections of ngrams. Histograms of the relative frequency of the ngram collections indicate that detecting viruses in real infected executable files with ma- chine learning methods is nearly impossible in the ngram representation. This statement is underpinned by exploring the ngram representation from an information theoretic perspective and empirically by performing classification experiments with machine learning methods.

A Study of Detecting Computer Viruses in Real-Infected Files in the n-gram Representation with Machine Learning Methods

A Study of Detecting Computer Viruses in Real-Infected Files in the n-gram Representation with Machine Learning Methods

Authors: Thomas Stibor
Year/month: 2010/
Booktitle: A Study of Detecting Computer Viruses in Real-Infected Files in the n-gram Representation with Machine Learning Methods
Series: Lecture Notes in Artificial Intelligence
Publisher: Springer-Verlag
Fulltext: iea.aie.final.extended.pdf

Abstract

Machine learning methods were successfully applied in recent years for detecting new and unseen computer viruses. The viruses were, however, detected in small virus loader files and not in real infected executable files. We created data sets of benign files, virus loader files and real infected executable files and represented the data as collections of ngrams. Histograms of the relative frequency of the ngram collections indicate that detecting viruses in real infected executable files with ma- chine learning methods is nearly impossible in the ngram representation. This statement is underpinned by exploring the ngram representation from an information theoretic perspective and empirically by performing classification experiments with machine learning methods.

Bibtex:

@inproceedings { Stibor:2010,
author = { Thomas Stibor},
title = { A Study of Detecting Computer Viruses in Real-Infected Files in the n-gram Representation with Machine Learning Methods },
year = { 2010 },
booktitle = { A Study of Detecting Computer Viruses in Real-Infected Files in the n-gram Representation with Machine Learning Methods },
series = { Lecture Notes in Artificial Intelligence },
publisher = { Springer-Verlag },
url = {https://www.sec.in.tum.de/i20/publications/a-study-of-detecting-computer-viruses-in-real-infected-files-in-the-n-gram-representation-with-machine-learning-methods/@@download/file/iea.aie.final.extended.pdf}
}