Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.

Slides:



Advertisements
Similar presentations
Delta Debugging and Model Checkers for fault localization
Advertisements

Hypothesis testing Another judgment method of sampling data.
Automated Documentation Inference to Explain Failed Tests Sai Zhang University of Washington Joint work with: Cheng Zhang, Michael D. Ernst.
Statistical Issues in Research Planning and Evaluation
Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.
Fundamentals of Forensic DNA Typing Slides prepared by John M. Butler June 2009 Appendix 3 Probability and Statistics.
Behavioural Science II Week 1, Semester 2, 2002
CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Between Group & Within Subjects Designs Mann-Whitney Test.
Software Bug Localization with Markov Logic Sai Zhang, Congle Zhang University of Washington Presented by Todd Schiller.
Getting Started with Hypothesis Testing The Single Sample.
State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Automated Diagnosis of Software Configuration Errors
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 14.
Fig Theory construction. A good theory will generate a host of testable hypotheses. In a typical study, only one or a few of these hypotheses can.
Testing Hypotheses.
© 2011 Pearson Prentice Hall, Salkind. Introducing Inferential Statistics.
AMOST Experimental Comparison of Code-Based and Model-Based Test Prioritization Bogdan Korel Computer Science Department Illinois Institute of Technology.
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Locating Causes of Program Failures Texas State University CS 5393 Software Quality Project Yin Deng.
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University,
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan University of Wisconsin, Stanford University, and.
Software Reliability SEG3202 N. El Kadri.
Chapter 8 Introduction to Hypothesis Testing
© SERG Dependable Software Systems (Mutation) Dependable Software Systems Topics in Mutation Testing and Program Perturbation Material drawn from [Offutt.
Bug Localization with Machine Learning Techniques Wujie Zheng
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Inductive Generalizations Induction is the basis for our commonsense beliefs about the world. In the most general sense, inductive reasoning, is that in.
1 Software Testing. 2 Path Testing 3 Structural Testing Also known as glass box, structural, clear box and white box testing. A software testing technique.
Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
Experimentation in Computer Science (Part 1). Outline  Empirical Strategies  Measurement  Experiment Process.
Yazd University, Electrical and Computer Engineering Department Course Title: Advanced Software Engineering By: Mohammad Ali Zare Chahooki 1 Machine Learning.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
By James Miller et.all. Presented by Siv Hilde Houmb 1 November 2002
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Bug Localization with Association Rule Mining Wujie Zheng
Bug Isolation via Remote Sampling. Lemonade from Lemons Bugs manifest themselves every where in deployed systems. Each manifestation gives us the chance.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
Automated Adaptive Bug Isolation using Dyninst Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 7 l Hypothesis Tests 7.1 Developing Null and Alternative Hypotheses 7.2 Type I & Type.
Chapter 8 Introducing Inferential Statistics.
Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative
Section Testing a Proportion
Testing Tutorial 7.
Bowden, Shores, & Mathias (2006): Failure to Replicate or Just Failure to Notice. Does Effort Still Account for More Variance in Neuropsychological Test.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Learning Software Behavior for Automated Diagnosis
Authors: Nirav Desai, prof. hema gaikwad
Jiawei Han and Micheline Kamber Department of Computer Science
Ask the Mutants: Mutating Faulty Programs for Fault Localization
Chapter 10 Verification and Validation of Simulation Models
CONCEPTS OF HYPOTHESIS TESTING
Jiawei Han and Micheline Kamber Department of Computer Science
Statistical Data Analysis
Regression Testing.
50.530: Software Engineering
Using Automated Program Repair for Evaluating the Effectiveness of
Presentation transcript:

Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.

Motivations Software is full of bugs  Windows 2000 had about 63,000 known bugs at its time of release, 2 bugs per 1000 lines  A study by the National Institute of Standards and Technology showed that software faults cost the U.S. economy about $59.5 billion annually Testing and debugging are laborious and expensive  “50% of my company employees are testers, and the rest spends 50% of their time testing!” --Bill Gates, in 1995

Expedite Debugging Manual debugging  Trace the executions step-by-step.  Verify observations against expectations Automated debugging  Collect runtime behaviors as the program executes.  Identify bug-relevant points by contrasting the correct and incorrect executions  Best efforts so far: bug localization

An Example Symptoms  563 lines of C code  130 out of 5542 test cases fail to give correct outputs  No crashes Conventional debugging  Few hints  Step-by-step tracing Better method  Pinpoint the buggy line void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; }

Review of Recent Work SOBER Algorithm Cause Transition Algorithm Statistical Debugging: Liblit05 Statistical Debugging: Simultaneous Identification of Multiple Bugs

SOBER: Statistical Model-based Bug Localization Program Predicates Predicate Rankings Experimental Results

Program Predicates A predicate is a proposition about any program properties  e.g., idx 0 …  Each can be evaluated multiple times during one execution  Every evaluation gives either true or false Therefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.

Evaluation Bias of Predicate P Evaluation bias  Def’n: the probability of being evaluated as true within one execution  Maximum likelihood estimation: Number of true evaluations over the total number of evaluations in one run  Each run gives one observation of evaluation bias for predicate P Suppose we have n correct and m incorrect executions, for any predicate P, we end up with  An observation sequence for correct runs S_p = (X’_1, X’_2, …, X’_n)  An observation sequence for incorrect runs S_f = (X_1, X_2, …, X_m) Can we infer whether P is suspicious based on S_p and S_f?

Underlying Populations Imagine the underlying distribution of evaluation bias for correct and incorrect executions are and S_p and S_f can be viewed as a random sample from the underlying populations respectively One major heuristic is  The larger the divergence between and, the more relevant the predicate P is to the bug 01 Prob Evaluation bias 01 Prob Evaluation bias

Major Challenges No knowledge of the closed forms of both distributions Usually, we do not have sufficient incorrect executions to estimate reliably. 01 Prob Evaluation bias 01 Prob Evaluation bias

SOBER’s Approach

Algorithm Outputs A ranked list of program predicates w.r.t. the bug relevance score s(P)  Higher-ranked predicates are regarded more relevant to the bug What’s the use?  Top-ranked predicates suggest the possible buggy regions  Several predicates may point to the same region  … …

Cause Transition (CT) “Locating Causes of Program Failures”, Cleve et al., published in ICSE’05, May 15, 2005 A variant of delta debugging [Z02] Previous state-of-the-art performance holder on Siemens suite Cons: it relies on memory abnormality, hence its performance is restricted.

Statistical Debugging: Liblit05 “Scalable Statistical bug isolation”, Liblit et al., published in PLDI’05, June 12, 2005 Main idea: rank predicates according to their correlation with program crashes

Statistical Debugging: Liblit05 Context (P) = Pr(Crash | P observed) Failure (P) = Pr(Crash | P observed as true) The probability difference Increase (P) = Failure (P) – Context (P) Limitation: Ignores evaluation patterns of predicates within each execution

Experiment Results Localization quality metric  Software bug benchmark  Quantitative metric Related works  Cause Transition (CT), [CZ05]  Statistical Debugging, [LN+05] Performance comparisons

Bug Benchmark Bug benchmark  Dreaming benchmark Large number of known bugs on large-scale programs with adequate test suite  Siemens Program Suite 130 variants of 7 subject programs, each of LOC 130 known bugs in total mainly logic (or semantic) bugs  Advantages Known bugs, thus judgments are objective Large number of bugs, thus comparative study is statistically significant.  Disadvantages Small-scaled subject programs State-of-the-art performance, so far claimed in literature,  Cause-transition approach, [CZ05]

Localization Quality Metric [RR03]

1st Example T-score = 70%

2nd Example T-score = 20% 8

Localized bugs w.r.t. Examined Code

Cumulative Effects w.r.t. Code Examination

Top-k Selection Regardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holder From k=2 to 10, SOBER is better than Liblit05 consistently

Conclusion and Discussion A tutorial on statistical debugging Discussion on Future Work  Better Statistical Models…  Identification of Multiple Bugs  Robust to Sampling  …