# A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software.

## Presentation on theme: "A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software."— Presentation transcript:

A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software

1 Four Perspectives on the Problem General £ How good are software security tools today? Tools vendor £ Is my static analysis product getting better over time? (What is better?) £ How much has it improved since the last release? £ What should I focus on to improve my tool in the future? £ If I make my tool detect a new kind of security bug, will an auditor or a developer thank me? Or both? Tools user: auditor £ Is the tool finding all the important types of security bugs? Tools user: developer £ Is the tool producing a lot of noise? Auditors and developers have different criteria for security tools, so we need a way to answer posed questions on two scales – “Auditor” and “Developer”

2 Proposed Solution Define metrics that model tool characteristics and conjecture a formula for calculating the score for each tool version £ Counts of true positives ( t ), false positives ( p ), and false negatives ( n ) £ 100 * t / (t + p + n) augmented by the weights and penalties (score out of 100) Define weights & penalties for reported results £ Results with different reported severities should be weighed differently High ( h ), Medium ( m ), and Low ( l ) £ False negatives penalties per bug category should differ depending on whether the tool claims to detect this kind of bug or not Define weights & penalties for “Auditor” and “Developer” scale £ Auditors tolerate false positives while developers tolerate false negatives – make false positive and false negative weights different to reflect this £ Importance & value of a vulnerability category ( v c ) for auditors & developers should affect the weights of the results Conduct an experiment and collect the necessary data to prove or disprove the conjecture

3 Experiment Analyzed three different projects: wuftpd (C), webgoat (Java), and securibench (Java) Ran four versions of Fortify tool Did a full audit of reported results for all product / version combinations (time consuming) TP (t)FP (p)FN (n) Important 20.52 Not important 0.52 FP (p)FN (n) Auditor 0.52 Developer 20.5 High (h)Medium (m)Low (l) 421 Claims to detectDoesn’t detect 10.5 Table 1. Penalties with respect to category importance Defined weights based on our experiences with auditors and developers £ Table 1 presents chosen weights & penalties for true positives ( t ), false positives ( p ), and false negatives ( n ) based on high-value (high v c ) and low-value (low v c ) categories £ Table 2 presents false negatives penalty per bug category based on whether the tool claims to detect the category or not £ Table 3 presents High ( h ), Medium ( m ), and Low ( l ) severity weights £ Table 4 presents false positives ( p ) and false negatives ( n ) penalties for “Auditor” and “Developer” scales Table 4. “Auditor” vs. “Developer” scales penalties Table 2. False negatives penalty based on whether the tool claims to detect the category or not Table 3. Severity weights

4 Experimental Results & Analysis Collected data seems to indicate that we are headed in the right direction Both scores for wuftpd get higher until version 3.1 £ The number of false positives decreases, but in version 3.1 it increases wuftpd “Developer” score is lower than “Auditor” score for all four versions £ “Developer” false positives penalty is higher -- tool is tuned better for Java than for C £ After all, Fortify is a security company webgoat “Developer” score drops between versions 3.1 and 3.5 £ With the addition of multiple auditor-oriented categories Both scores are best for latest release examined (whew) Version 2.1Version 3.0Version 3.1Version 3.5 Auditor score 783894 Developer score 2167872 Auditor score 861072 Developer score 6155559 Auditor score 38433440 Developer score 12271418 webgoat securibench wuftpd (complete set of data for one experiment is available as a handout)

5 Conclusions & Future Work Proposed approach is useful for our purposes – measuring improvements of Fortify static analyzer £ It is unclear whether the same approach would be useful for comparing two different tools Determining an “answer key” to grade the results of the tool with is still a hard problem On our to-do list: £ Do more audits of various projects to collect more data to adjust the weights and penalties Include projects written for other languages the tool supports £ Experiment with additional weights and penalties Introduce penalty for incorrectly reporting severity of results £ Define a good visual representation of the collected data Make it intuitive to determine the area that needs improvement

Download ppt "A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software."

Similar presentations