ESEM | October 9, 2008 On Establishing a Benchmark for Evaluating Static Analysis Prioritization and Classification Techniques Sarah Heckman and Laurie.

Slides:



Advertisements
Similar presentations
Author: Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, Thomas Ball MIT CSAIL.
Advertisements

ICSE Doctoral Symposium | May 21, 2007 Adaptive Ranking Model for Ranking Code-Based Static Analysis Alerts Sarah Smith Heckman Advised by Laurie Williams.
CS527: Advanced Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 18, 2008.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Evaluation of segmentation. Example Reference standard & segmentation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science.
Metrics for Process and Projects
Benjamin J. Deaver Advisor – Dr. LiGuo Huang Department of Computer Science and Engineering Southern Methodist University.
SBSE Course 3. EA applications to SE Analysis Design Implementation Testing Reference: Evolutionary Computing in Search-Based Software Engineering Leo.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Chapter 4 Validity.
Model Evaluation Metrics for Performance Evaluation
Software Quality Metrics
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Empirically Assessing End User Software Engineering Techniques Gregg Rothermel Department of Computer Science and Engineering University of Nebraska --
Tutorial 2 LIU Tengfei 2/19/2009. Contents Introduction TP, FP, ROC Precision, recall Confusion matrix Other performance measures Resource.
Aplicaciones de Ingeniería de Software
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
Jeremy Wyatt Thanks to Gavin Brown
ISSRE 2006 | November 10, 2006 Automated Adaptive Ranking and Filtering of Static Analysis Alerts Sarah Heckman Laurie Williams November 10, 2006.
Expediting Programmer AWAREness of Anomalous Code Sarah E. Smith Laurie Williams Jun Xu November 11, 2005.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 27 Slide 1 Quality Management 1.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
CS4723 Software Validation and Quality Assurance
by B. Zadrozny and C. Elkan
Towards a Benchmark for Evaluating Design Pattern Miner Tools Date:102/1/17 Publisher:IEEE Software Maintenance and Reengineering, CSMR th.
Verification and Validation Overview References: Shach, Object Oriented and Classical Software Engineering Pressman, Software Engineering: a Practitioner’s.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Assessing the Maturity of Climate Data Records
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Technology and Science, Osaka University Dependence-Cache.
Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.
This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.
Disciplined Software Engineering Lecture #2 Software Engineering Institute Carnegie Mellon University Pittsburgh, PA Sponsored by the U.S. Department.
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 1 1 Disciplined Software Engineering Lecture #2 Software Engineering.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.
Evaluating Results of Learning Blaž Zupan
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
A Metrics Program. Advantages of Collecting Software Quality Metrics Objective assessments as to whether quality requirements are being met can be made.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Test Case Designing UNIT - 2. Topics Test Requirement Analysis (example) Test Case Designing (sample discussion) Test Data Preparation (example) Test.
Evaluating VR Systems. Scenario You determine that while looking around virtual worlds is natural and well supported in VR, moving about them is a difficult.
Evaluating Classification Performance
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
DevCOP: A Software Certificate Management System for Eclipse Mark Sherriff and Laurie Williams North Carolina State University ISSRE ’06 November 10, 2006.
Evolving Decision Rules (EDR)
Evaluating Results of Learning
Verification and Validation Overview
Evaluating Models Part 1
Ben Smith and Laurie Williams
White Box testing & Inspections
Presentation transcript:

ESEM | October 9, 2008 On Establishing a Benchmark for Evaluating Static Analysis Prioritization and Classification Techniques Sarah Heckman and Laurie Williams Department of Computer Science North Carolina State University

ESEM | October 9, Contents Motivation Research Objective FAULTBENCH Case Study –False Positive Mitigation Models –Results Future Work

ESEM | October 9, Motivation Static analysis tools identify potential anomalies early in development process. –Generate overwhelming number of alerts –Alert inspection required to determine if developer should fix Actionable – important anomaly the developer wants to fix – True Positive (TP) Unactionable – unimportant or inconsequential alerts – False Positive (FP) FP mitigation techniques can prioritize or classify alerts after static analysis is run.

ESEM | October 9, Research Objective Problem –Several false positive mitigation models have been proposed. –Difficult to compare and evaluate different models. Research Objective: to propose the FAULTBENCH benchmark to the software anomaly detection community for comparison and evaluation of false positive mitigation techniques.

ESEM | October 9, FAULTBENCH Definition [1] Motivating Comparison: find the static analysis FP mitigation technique that correctly prioritizes or classifies actionable and unactionable alerts Research Questions –Q1: Can alert prioritization improve the rate of anomaly detection when compared to the tool’s output? –Q2: How does the rate of anomaly detection compare between alert prioritization techniques? –Q3: Can alert categorization correctly predict actionable and unactionable alerts?

ESEM | October 9, FAULTBENCH Definition [1] (2) Task Sample: representative sample of tests that FP mitigation techniques should solve. –Sample programs –Oracles of FindBugs alerts (actionable or unactionable) –Source code changes for fix (adaptive FP mitigation techniques)

ESEM | October 9, FAULTBENCH Definition [1] (3) Evaluation Measures: metrics used to evaluate and compare FP mitigation techniques Prioritization –Spearman rank correlation Classification –Precision –Recall –Accuracy –Area under anomaly detection rate curve ActionableUnactionable Actionable True Positive (TP C ) False Positive (FP C ) Unactionable False Negative (FN C ) True Negative (TN C ) Actual Predicted

ESEM | October 9, Subject Selection Selection Criteria –Open source –Various domains –Small –Java –Source Forge –Small, commonly used libraries and applications

ESEM | October 9, FAULTBENCH v0.1 Subjects SubjectDomain# Dev. # LOC # Alerts MaturityAlert Dist. Area cvsobjectsData format Prod import scrubber Software dev Beta iTrustWeb Alpha jbookEdu Prod jdomData format Prod org.eclipse. core.runtime Software dev Prod

ESEM | October 9, Subject Characteristics Visualization

ESEM | October 9, FAULTBENCH Initialization Alert Oracle – classification of alerts as actionable or unactionable –Read alert description generated by FindBugs –Inspection of surrounding code and comments –Search message boards Alert Fixes –Changed required to fix alert –Minimize alert closures and creations Experimental Controls –Optimal ordering of alerts –Random ordering of alerts –Tool ordering of alerts

ESEM | October 9, FAULTBENCH Process 1.For each subject program 1.Run static analysis on clean version of subject 2.Record original state of alert set 3.Prioritize or classify alerts with FP mitigation technique 2.Inspect each alert starting at top of prioritized list or by randomly selecting an alert predicted as actionable 1.If oracle says actionable, fix with specified code change. 2.If oracle says unactionable, suppress alert 3.After each inspection, record alert set state and rerun static analysis tool 4.Evaluate results via evaluation metrics.

ESEM | October 9, Case Study Process 1.Open subject program in Eclipse Run FindBugs on clean version of subject 2.Record original state of alert set 3.Prioritize alerts with a version of AWARE-APM 2.Inspect each alert starting at top of prioritized list 1.If oracle say actionable, fix with specified code change. 2.If oracle says unactionable, suppress alert 3.After each inspection, record alert set state. FindBugs should run automatically. 4.Evaluate results via evaluation metrics.

ESEM | October 9, AWARE-APM Adaptively prioritizes and classifies static analysis alerts by the likelihood an alert is actionable Uses alert characteristics, alert history, and size information to prioritize alerts. Unactionable 1 Actionable 0 Unknown

ESEM | October 9, AWARE-APM Concepts Alert Type Accuracy (ATA): the alert’s type Code Locality (CL): location of the alert at the source folder, class, and method Measure the likelihood alert is actionable based on developer feedback –Alert Closure: alert no longer identified by static analysis tool –Alert Suppression: explicit action by developer to remove alert from listing

ESEM | October 9, Rate of Anomaly Detection Curve SubjectOptimalRandomATACLATA+CLTool jdom91.82%71.66%86.16%63.54%85.35%46.89% Average87.58%61.73%72.57%53.94%67.88%50.42% jdom

ESEM | October 9, Spearman Rank Correlation ATACLATA +CL Tool csvobjects importscrubber0.512** iTrust0.418**0.264**0.261**0.772** jbook0.798**0.389**0.599** jdom0.675**0.288*0.457**0.724** org.eclipse.core.runtime0.395**0.325**0.246*0.691** * Significant at the 0.05 level** Significant at the 0.01 level

ESEM | October 9, Classification Evaluation Measures SubjectAverage Precision Average RecallAverage Accuracy ATACLATA +CL ATACLATA +CL ATACLATA +CL csvobjects import- scrubber iTrust jbook jdom org.eclipse. core.runtime Average

ESEM | October 9, Case Study Limitations Construct Validity –Possible closure and alert creation when fixing alerts –Duplicate alerts Internal Validity –External variable, alert classification, subjective from inspection External Validity –May not scale to larger programs

ESEM | October 9, FAULTBENCH Limitations Alert oracles chosen from 3 rd party inspection of source code, not developers. Generation of optimal ordering biased to the tool ordering of alerts. Subjects written in Java, so may not generalize to FP mitigation techniques for other languages.

ESEM | October 9, Future Work Collaborate with other researchers to evolve FAULTBENCH Use FAULTBENCH to compare FP mitigation techniques from literature

ESEM | October 9, Questions? FAULTBENCH: Sarah Heckman:

ESEM | October 9, References [1]S. E. Sim, S. Easterbrook, and R. C. Holt, “Using Benchmarking to Advance Research: A Challenge to Software Engineering,” ICSE, Portland, Oregon, May 3-10, 2003, pp