CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)

Slides:



Advertisements
Similar presentations
T-tests continued.
Advertisements

Delta Debugging and Model Checkers for fault localization
Hypothesis testing Another judgment method of sampling data.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
1 G Lect 2a G Lecture 2a Thinking about variability Samples and variability Null hypothesis testing.
Statistical Issues in Research Planning and Evaluation
Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Topic 2: Statistical Concepts and Market Returns
Evaluating Hypotheses
Statistical Background
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Today Today: More on the Normal Distribution (section 6.1), begin Chapter 8 (8.1 and 8.2) Assignment: 5-R11, 5-R16, 6-3, 6-5, 8-2, 8-8 Recommended Questions:
Personality, 9e Jerry M. Burger
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Between Group & Within Subjects Designs Mann-Whitney Test.
1 Software Testing and Quality Assurance Lecture 5 - Software Testing Techniques.
Software Bug Localization with Markov Logic Sai Zhang, Congle Zhang University of Washington Presented by Todd Schiller.
Single-Subject Designs
Relationships Among Variables
Automated Diagnosis of Software Configuration Errors
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Scales and Indices While trying to capture the complexity of a phenomenon We try to seek multiple indicators, regardless of the methodology we use: Qualitative.
© 2011 Pearson Prentice Hall, Salkind. Introducing Inferential Statistics.
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.
1 1 Slide Chapter 7 (b) – Point Estimation and Sampling Distributions Point estimation is a form of statistical inference. Point estimation is a form of.
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University,
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan University of Wisconsin, Stanford University, and.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Bug Localization with Machine Learning Techniques Wujie Zheng
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 10. Hypothesis Testing II: Single-Sample Hypothesis Tests: Establishing the Representativeness.
1 Software Testing. 2 Path Testing 3 Structural Testing Also known as glass box, structural, clear box and white box testing. A software testing technique.
Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
WHAT IS THE NATURE OF SCIENCE?. SCIENTIFIC WORLD VIEW 1.The Universe Is Understandable. 2.The Universe Is a Vast Single System In Which the Basic Rules.
By James Miller et.all. Presented by Siv Hilde Houmb 1 November 2002
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Non – Parametric Test Dr.L.Jeyaseelan Dept. of Biostatistics Christian Medical College Vellore, India.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Bug Localization with Association Rule Mining Wujie Zheng
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
Model Counting with Applications to CodeHunt Willem Visser Stellenbosch University South Africa.
Abdul-Rahman Elshafei – ID  Introduction  SLAT & iSTAT  Multiplet Scoring  Matching Passing Tests  Matching Complex Failures  Multiplet.
Automated Adaptive Bug Isolation using Dyninst Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 7 l Hypothesis Tests 7.1 Developing Null and Alternative Hypotheses 7.2 Type I & Type.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Research Design
WHAT IS THE NATURE OF SCIENCE?
Chapter 8 Introducing Inferential Statistics.
Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
CONCEPTS OF HYPOTHESIS TESTING
Week 11 Chapter 17. Testing Hypotheses about Proportions
Introduction to Instrumentation Engineering
Jiawei Han and Micheline Kamber Department of Computer Science
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presentation transcript:

CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)

A Very Important Principle  Traditional debugging techniques deal with single (or very few) executions.  With the acquisition of a large set of executions, including passing and failing executions, statistical debugging is often highly effective. Failure reporting In house testing

Tarantula (ASE 2005, ISSTA 2007)

Scalable Remote Bug Isolation (PLDI 2004, 2005)  Look at predicates Branches Function returns ( 0, >=0, ==0, !=0) Scalar pairs  For each assignment x=…, find all variables y_i and constants c_j, each pair of x (=,<,<=…) y_i/c_j  Sample the predicate evaluations (Bernoulli sampling) Investigate the relation of the probability of a predicate being true with the bug manifestion.

Bug Isolation

How much does P being true increase the probability of failure over simply reaching the line P is sampled.

An Example  Symptoms 563 lines of C code 130 out of 5542 test cases fail to give correct outputs No crashes  The predicate are evaluated to both true and false in one execution void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } not enough

void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } P_f (A) = tilde P (A | A & !B) P_t (A) = tilde P (A | !(A&!B))

Program Predicates  A predicate is a proposition about any program properties e.g., idx 0 … Each can be evaluated multiple times during one execution Every evaluation gives either true or false  Therefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.

Evaluation Bias of Predicate P  Evaluation bias Def’n: the probability of being evaluated as true within one execution Maximum likelihood estimation: Number of true evaluations over the total number of evaluations in one run Each run gives one observation of evaluation bias for predicate P  Suppose we have n correct and m incorrect executions, for any predicate P, we end up with An observation sequence for correct runs  S_p = (X’_1, X’_2, …, X’_n) An observation sequence for incorrect runs  S_f = (X_1, X_2, …, X_m)  Can we infer whether P is suspicious based on S_p and S_f?

Underlying Populations  Imagine the underlying distribution of evaluation bias for correct and incorrect executions are and  S_p and S_f can be viewed as a random sample from the underlying populations respectively  One major heuristic is The larger the divergence between and, the more relevant the predicate P is to the bug 01 Prob Evaluation bias 01 Prob Evaluation bias

Major Challenges  No knowledge of the closed forms of both distributions  Usually, we do not have sufficient incorrect executions to estimate reliably. 01 Prob Evaluation bias 01 Prob Evaluation bias

Our Approach

Algorithm Outputs  A ranked list of program predicates w.r.t. the bug relevance score s(P) Higher-ranked predicates are regarded more relevant to the bug  What’s the use? Top-ranked predicates suggest the possible buggy regions Several predicates may point to the same region … …

Outline  Program Predicates  Predicate Rankings  Experimental Results  Case Study: bc-1.06  Future Work  Conclusions

Experiment Results  Localization quality metric Software bug benchmark Quantitative metric  Related works Cause Transition (CT), [CZ05] Statistical Debugging, [LN+05]  Performance comparisons

Bug Benchmark  Bug benchmark Dreaming benchmark  Large number of known bugs on large-scale programs with adequate test suite Siemens Program Suite  130 variants of 7 subject programs, each of LOC  130 known bugs in total  mainly logic (or semantic) bugs Advantages  Known bugs, thus judgments are objective  Large number of bugs, thus comparative study is statistically significant. Disadvantages  Small-scaled subject programs  State-of-the-art performance, so far claimed in literature, Cause-transition approach, [CZ05]

Localization Quality Metric [RR03]

1st Example T-score = 70%

2nd Example T-score = 20% 8

Related Works  Cause Transition (CT) approach [CZ05] A variant of delta debugging [Z02]delta debugging Previous state-of-the-art performance holder on Siemens suite Published in ICSE’05, May 15, 2005 Cons: it relies on memory abnormality, hence its performance is restricted.  Statistical Debugging (Liblit05) [LN+05] Predicate ranking based on discriminant analysis Published in PLDI’05, June 12, 2005 Cons: Ignores evaluation patterns of predicates within each execution

Localized bugs w.r.t. Examined Code

Cumulative Effects w.r.t. Code Examination

Top-k Selection  Regardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holder  From k=2 to 10, SOBER is better than Liblit05 consistently

Outline  Evaluation Bias of Predicates  Predicate Rankings  Experimental Results  Case Study: bc-1.06  Future Work  Conclusions

Case Study: bc 1.06  bc LOC An arbitrary-precision calculator shipped with most distributions of Unix/Linux  Two bugs were localized One was reported by Liblit in [LN+05] One was not reported previously  Some lights on scalability

Outline  Evaluation Bias of Predicates  Predicate Rankings  Experimental Results  Case Study: bc-1.06  Future Work  Conclusions

Future Work  Further leverage the localization quality  Robustness to sampling  Torture on large-scale programs to confirm its scalability to code size  …

Conclusions  We devised a principled statistical method for bug localization.  No parameter setting hassles  It handles both crashing and noncrashing bugs.  Best quality so far.

Discussion  Features Easy implementation Difficult experimentation More advanced statistical technique may not be necessary Go wide, not go deep…  Predicates are treated as independent random variables.  Can execution indexing help?  Can statistical principles be combined with slicing or IWIH ?