Description
Graph-based code analysis systems are a versatile tools for reasoning about the correctness of complex software projects. One area in which they are widely used is in source code auditing: Security vulnerabilities, for example using cryptographic functions with insecure algorithms, can be introduced by coding patterns that spread over the boundaries of several methods, classes or even files in the project. This is where graph-based analysis makes finding these vulnerabilities easier, by creating a framework where the source code can be represented as a graph and vulnerable patterns can be found by performing efficient graph-based pattern matching queries. There are several graph representations for source code, but this thesis is based on the Code Property Graph (CPG) approach, as it conveniently combines a program's Abstract Syntax Tree (AST), Control Flow Graph (CFG) and Data Flow Graph (DFG) in a single data structure. One drawback of this representation is that its generation is a complex task that can be very time consuming for larger projects. Considering that security analyses are desirably carried out repeatedly, to find out if recent changes introduced new vulnerabilities, long graph generation times may slow down the feedback cycle. This is why this thesis presents an improved CPG generation process that can take advantage of the fact that most of the changes to a project will not affect the whole program, but will rather have a limited scope: Portions of the CPG generated for the previous code version can be reused, if the corresponding code did not change. The actual file changes are then modeled as incremental updates to the graph, which avoids having to re-generate the CPG for all files in case only a few have been modified. This technique has been evaluated by analyzing the source code of the CPG implementation that serves as the basis for this thesis, an actively maintained project containing over 200 Java files: While the previous implementation took 60 minutes to sequentially parse 50 commits in this repository, the new incremental approach showed a performance improvement of nearly 80%, taking only 13 minutes for the same commit range.
|