TUM Logo

Detection of Software Vulnerabilities using Fine-Tuned Large Language Models

Detection of Software Vulnerabilities using Fine-Tuned Large Language Models

Supervisor(s): Daniel Kowatsch, Tobias Specht
Status: finished
Topic: Others
Author: Fami Mahmud
Submission: 2024-11-12
Type of Thesis: Masterthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

Traditional and deep learning methods have been widely used for detecting software
vulnerabilities, but they often struggle to provide consistent and reliable automated
solutions. Recent advancements in large language models (LLMs) have demonstrated remarkable
capabilities in understanding complex patterns within both natural language
and code. This thesis explores the potential of fine-tuning state-of-the-art open-source
LLMs for the specific task of vulnerability detection.
A major challenge in training models for this purpose is the limited availability of
high-quality, large-scale datasets. To address this, we explore the construction of an
extensive and well-curated dataset by combining existing real-world and synthetic
datasets. Through careful selection, preprocessing, merging, and additional cleaning,
we created a dataset designed to better support effective training and evaluation of
LLMs for vulnerability detection.
We evaluate the effectiveness of fine-tuned LLMs in detecting vulnerabilities in
software code, comparing models of different sizes and architectures. Furthermore,
we analyze the impact of dataset composition, examining how class balance and data
complexity affect model performance.
Our results indicate that while fine-tuned LLMs can learn certain patterns and
identify some vulnerabilities with high confidence, distinguishing more complex cases
remains a challenge, particularly when subtle code modifications are crucial for identifying
vulnerabilities. Additionally, we observe that model size alone is not a determining
factor for improved performance in vulnerability detection, highlighting the importance
of dataset size and quality for task-specific fine-tuning.