A Metric for Evaluating Static Analysis Tools

A Metric for Evaluating Static Analysis Tools
Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software

Four Perspectives on the Problem
General How good are software security tools today? Tools vendor Is my static analysis product getting better over time? (What is better?) How much has it improved since the last release? What should I focus on to improve my tool in the future? If I make my tool detect a new kind of security bug, will an auditor or a developer thank me? Or both? Tools user: auditor Is the tool finding all the important types of security bugs? Tools user: developer Is the tool producing a lot of noise? Auditors and developers have different criteria for security tools, so we need a way to answer posed questions on two scales – “Auditor” and “Developer”

Proposed Solution Define metrics that model tool characteristics and conjecture a formula for calculating the score for each tool version Counts of true positives (t), false positives (p), and false negatives (n) 100 * t / (t + p + n) augmented by the weights and penalties (score out of 100) Define weights & penalties for reported results Results with different reported severities should be weighed differently High (h), Medium (m), and Low (l) False negatives penalties per bug category should differ depending on whether the tool claims to detect this kind of bug or not Define weights & penalties for “Auditor” and “Developer” scale Auditors tolerate false positives while developers tolerate false negatives – make false positive and false negative weights different to reflect this Importance & value of a vulnerability category (vc) for auditors & developers should affect the weights of the results Conduct an experiment and collect the necessary data to prove or disprove the conjecture

Experiment Analyzed three different projects: wuftpd (C), webgoat (Java), and securibench (Java) Ran four versions of Fortify tool Did a full audit of reported results for all product / version combinations (time consuming) TP (t) FP (p) FN (n) Important 2 0.5 Not important Table 1. Penalties with respect to category importance Defined weights based on our experiences with auditors and developers Table 1 presents chosen weights & penalties for true positives (t), false positives (p), and false negatives (n) based on high-value (high vc) and low-value (low vc) categories Table 2 presents false negatives penalty per bug category based on whether the tool claims to detect the category or not Table 3 presents High (h), Medium (m), and Low (l) severity weights Table 4 presents false positives (p) and false negatives (n) penalties for “Auditor” and “Developer” scales FP (p) FN (n) Auditor 0.5 2 Developer Claims to detect Doesn’t detect 1 0.5 High (h) Medium (m) Low (l) 4 2 1 Table 2. False negatives penalty based on whether the tool claims to detect the category or not Table 3. Severity weights Table 4. “Auditor” vs. “Developer” scales penalties

Experimental Results & Analysis
Collected data seems to indicate that we are headed in the right direction Both scores for wuftpd get higher until version 3.1 The number of false positives decreases, but in version 3.1 it increases wuftpd “Developer” score is lower than “Auditor” score for all four versions “Developer” false positives penalty is higher -- tool is tuned better for Java than for C After all, Fortify is a security company webgoat “Developer” score drops between versions 3.1 and 3.5 With the addition of multiple auditor-oriented categories Both scores are best for latest release examined (whew) Version 2.1 Version 3.0 Version 3.1 Version 3.5 Auditor score 7 8 38 94 Developer score 2 16 78 72 6 10 15 55 59 43 34 40 12 27 14 18 webgoat securibench wuftpd (complete set of data for one experiment is available as a handout)

Conclusions & Future Work
Proposed approach is useful for our purposes – measuring improvements of Fortify static analyzer It is unclear whether the same approach would be useful for comparing two different tools Determining an “answer key” to grade the results of the tool with is still a hard problem On our to-do list: Do more audits of various projects to collect more data to adjust the weights and penalties Include projects written for other languages the tool supports Experiment with additional weights and penalties Introduce penalty for incorrectly reporting severity of results Define a good visual representation of the collected data Make it intuitive to determine the area that needs improvement

A Metric for Evaluating Static Analysis Tools

Similar presentations

Presentation on theme: "A Metric for Evaluating Static Analysis Tools"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Metric for Evaluating Static Analysis Tools

Similar presentations

Presentation on theme: "A Metric for Evaluating Static Analysis Tools"— Presentation transcript:

Similar presentations

About project

Feedback