Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jiawei Han and Micheline Kamber Department of Computer Science

Similar presentations


Presentation on theme: "Jiawei Han and Micheline Kamber Department of Computer Science"— Presentation transcript:

1 Data Mining: Concepts and Techniques — Chapter 11 — —Software Bug Mining—
Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgement: Chao Liu 11/22/2018 Data Mining: Principles and Algorithms

2 Data Mining: Principles and Algorithms
11/22/2018 Data Mining: Principles and Algorithms

3 Data Mining: Principles and Algorithms
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions. 11/22/2018 Data Mining: Principles and Algorithms

4 Software Bugs Are Costly
Software is “full of bugs” Windows 2000, 35 million lines of code 63,000 known bugs at the time of release, 2 per 1000 lines Software failure costs Ariane 5 explosion due to “errors in the software of the inertial reference system” (Ariaen-5 flight 501 inquiry board report A study by the National Institute of Standards and Technology found that software errors cost the U.S. economy about $59.5 billion annually Testing and debugging are laborious and expensive “50% of my company employees are testers, and the rest spends 50% of their time testing!” —Bill Gates, in 1995 This work is basically about how to automatically localize the software bugs. The major motivation is that software is full of bugs. A research once showed that the average error rate is 1 – 4.5 errors per 1000 lines of code. For example, the windows 2000, which has 35M lines of code, contains 63 thousands of KNOWN bugs at the time of its release. This means 2 errors are in each thousand lines. When the bugs happen in practice, the costs are tremendous. In 1996, the Ariane 5 exploded 40 seconds after lauching. As investigated, the explosion was due to errors in the software of the inertial reference system. A study by the National Institute of Standards and Technology found that the software errors cost the U.S. economy about $59.5 billions annually. Therefore, great many efforts are put on the testing and debugging during the software cycletime. Bill Gates once said that 50% of my company employees are testers, and the rest spends 50% of their time testing. As we all know, testing and debugging are tough task. So there are some researches carried out on bug localization. 11/22/2018 Data Mining: Principles and Algorithms

5 Automated Failure Reporting
End-users as Beta testers Valuable information about failure occurrences in reality 24.5 million/day in Redmond (if all users send) – John Dvorak, PC Magazine Widely adopted because of its usefulness Microsoft Windows, Linux Gentoo, Mozilla applications … Any applications can implement this functionality Software bugs cause program failures, and developers need to access program failures to debug. Recent years have seen a software practice, known as automated program failure reporting. I believe you have ever seen windows like this. Whenever …, a window will pop up asking you whether you’d like to send … to central server for software vendors to diagnose. As failure reports contain valuable information about program failures in reality, they prove very useful to help software vendors to enhance their software. Because of its usefulness, automated failure reporting has been widely adopted, such as in Linux, Mozilla applications, as well as in Microsoft windows. In fact, by using third party libraries or Windows APIs, any applications can implement their own failure reporting functionality. 11/22/2018 Data Mining: Principles and Algorithms

6 After Failures Collected …: Failure triage
Failure prioritization: What are the most severe bugs? Failure assignment: Which developers should debug a given set of failures? Automated debugging Where is the likely bug location? But after failures are collected, we cannot randomly pick up a failure, and randomly ask a developer to diagnose it. Instead, we need to identify the most severe failures and assign them to the appropriate developers, which is the so-called failure triage. Specifically, two tasks need to be handled, first failure prioritization and second failure assignment. The severity of a bug is determined by the number of failures caused by the bug. Usually, as the most frequently reported failures are the most severe, we need to identify failures likely due to the same bug. On the other hand, failure assignment means locating the appropriate developers to diagnose a given set of failures. 11/22/2018 Data Mining: Principles and Algorithms

7 A Glimpse on Software Bugs
Crashing bugs Symptoms: segmentation faults Reasons: memory access violations Tools: Valgrind, CCured Noncrashing bugs Symptoms: unexpected outputs Reasons: logic or semantic errors if ((m >= 0)) vs. if ((m >= 0) && (m != lastm)) < vs. <=, > vs. >=, etc .. j = i vs. j= i+1 Tools: No sound tools 11/22/2018 Data Mining: Principles and Algorithms

8 Semantic Bugs Dominate
Memory-related Bugs: Many are detectable Others Concurrency bugs Semantic Bugs: Application specific Only few detectable Mostly require annotations or specifications Although semantic bugs look tricky, they are by no means rare. According to a recent study of bug characteristics, semantic bugs actually dominate, accounting for about 78% of all the bugs. In contrast, memory bugs only account for 16%. [Why] This is because in recent years, a lot of memory checking tools have been developed for memory sanity, and they are actually used in practice. Because semantic bugs become dominant, we need to pay more attention to semantic bugs. Bug Distribution [Li et al., ICSE’07] 264 bugs in Mozilla and 98 bugs in Apache manually checked 29,000 bugs in Bugzilla automatically checked Courtesy of Zhenmin Li 11/22/2018 Data Mining: Principles and Algorithms

9 Hacking Semantic Bugs is HARD
Major challenge: No crashes! No failure signatures No debugging hints Major Methods Statistical debugging of semantic bugs [Liu et al., FSE’05, TSE’06] Triage noncrashing failures through statistical debugging [Liu et al., FSE’06] Unfortunately, hacking semantic bugs is hard because of the absence of crashes. With no crashes, we have no failure signature so that failure triage becomes elusive. Still without crashes, developers will have few hints on where the bug could be. Our contribution is that 1) we developed a statistical debugging algorithm that can automatically localize the semantic bugs without any prior knowledge of program semantics. and 2) we found that statistical debugging can be used to triage noncrashing failures. 11/22/2018 Data Mining: Principles and Algorithms

10 Data Mining: Principles and Algorithms
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions. 11/22/2018 Data Mining: Principles and Algorithms

11 Data Mining: Principles and Algorithms
A Running Example void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ lastm = m; } if ((m == -1) || (m == i)){ i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; Predicate # of true # of false (lin[i] != ENDSTR)==true Ret_amatch < 0 Ret_amatch == 0 Ret_amatch > 0 (m >= 0) == true (m == i) == true Let us use an example to explain statistical debugging. Here is a function in a buggy program. Suppose we know this is the buggy function, and where is the bug. It turns out that the bug is here. There could have been a conjunction of two subclauses. However, the developer forgot one subclause, and in consequence, 130 out of 5542 cases fail to give the correct outputs, and in particular, no crashes for any failures. Conventionally, for such semantic bugs, software developers need to find a failing execution, and trace it step by step. However our tool SOBER can pinpoint the real bug location immediately, so that software developers can first examine this part, set up breakpoint here and check for abnormality. How does our tool work? We first instrument source code with predicates. In general, predicates can be about any program properties. In particular, we instrument two kinds of predicates. This first one is boolean predicate: for every boolean expression, we instrument a predicate that the boolean value is true. The second category of predicate is about function calls. For every function call, we instrument three predicates that the return value is less than, equal to and greater than 0. [like tossing a coin] For any predicate, every time its associated source code is executed, the predicate is evaluated, and every evaluation is either true or false, just like tossing a coin. For example, if this while loop is executed 6 times, 5 times true and 1 time false, the number of true and false evaluation is recorded, which is the predicate profile of the execution. (m >= -1) == true Predicate evaluation as tossing a coin 130 of 5542 test cases fail, no crashes 11/22/2018 Data Mining: Principles and Algorithms

12 Profile Executions as Vectors
Two passing executions 5 1 4 2 19 1 18 2 One failing execution 9 1 8 2 Extreme case Always false in passing and always true in failing … Generalized case Different true probability in passing and failing executions Then we can concatenate the predicate profiles of each predicate, and get a vector representation of each execution. Suppose we have three executions of the function, two passing and one failing, then we may get representation like this. Intuitively, if a predicate is always false in passing executions, and happen to be true in all failing executions, then this predicate is very likely related to program failures. Our tool generalizes such intuition, and tries to find predicates having divergent true probability. 11/22/2018 Data Mining: Principles and Algorithms

13 Estimated Head Probability
Evaluation bias Estimated head probability from every execution Specifically, where and are the number of true and false evaluations in one execution. Defined for each predicate and each execution In order to identify the divergence in head probability, we define a random variable to estimate the head probability from every execution, and it is called the evaluation bias. Specifically, evaluation bias is the percentage of true evaluations for the predicate in one execution. We treat each predicate independently 11/22/2018 Data Mining: Principles and Algorithms

14 Divergence in Head Probability
Multiple evaluation biases from multiple executions Evaluation bias as generated from models 1 Prob Head Probability Therefore, multiple evaluation biases are observed from multiple executions. And these evaluation biases can be treated as generated from the underlying models for passing and failing executions. Then we want to quantify the divergence between the two models, and the larger divergence, the more likely the corresponding predicate is fault relevant. The rationale is that this predicate shows the largest divergence. 11/22/2018 Data Mining: Principles and Algorithms

15 Data Mining: Principles and Algorithms
Major Challenges 1 Prob Head Probability 1 Prob Head Probability No closed form of either model No sufficient number of failing executions to estimate However, there are two major challenges to quantify the model divergence directly. First, we have no idea of the closed form of either model. Second, even though it is possible … So it is hard to quantify the divergence directly, so we may consider an indirect approach. 11/22/2018 Data Mining: Principles and Algorithms

16 Data Mining: Principles and Algorithms
We proposed a hypothesis testing-based indirect approach to quantify the model divergence. In order to …, we first …, and then we derive a test statistic, which conforms to a normal distribution based on central limit theorem. Intuitively, the value corresponds to the likelihood of observing the evaluation biases from failing executions as if they were generated from the model of passing executions. So the smaller the value, the more likely the null hypothesis is not true, and the larger the divergence between the two models, and finally, the predicate P is more fault relevant. 11/22/2018 Data Mining: Principles and Algorithms

17 Data Mining: Principles and Algorithms
SOBER in Summary J L SOBER Test Suite Pred2 Pred6 Pred1 Pred3 Source Code Here is a summary of SOBER We are given the source code, we first instrument it with program predicates. Then we run a test suite, and get the predicate profiles for both passing and failing executions. This step needs a test oracle to identify the set of passing and failing executions. Then SOBER takes in the predicate profiles, and generate a ranked list of all instrumented predicates, and top predicates point to the blamed bug location. Pred2 Pred6 Pred1 Pred3 11/22/2018 Data Mining: Principles and Algorithms

18 Previous State of the Art [Liblit et al, 2005]
Correlation analysis Context(P) = Prob(fail | P ever evaluated) Failure(P) = Prob(fail | P ever evaluated as true) Increase(P) = Failure(P) – Context(P) The idea of statistical debugging is not new. Previously, Ben Liblit proposed a statistical debugging algorithm based on correlation analysis. Specifically, for each predicate, they estimate the probability that the program fails if the predicate is ever evaluated, which they call Context(P), and they also estimate the probability that the program fails if the predicate P is ever evaluated as true. Then finally, they take the difference as the fault relevance score for the predicate. And intuitively, they estimate how more likely the program fails when a predicate is ever evaluated true. How more likely the program fails when a predicate is ever evaluated true 11/22/2018 Data Mining: Principles and Algorithms

19 Liblit05 in Illustration
Context(P) = Prob(fail | P ever evaluated) = 4/10 = 2/5 + + Failing + + + Increase(P) = Failure(P) – Context(P) = 3/7 – 2/5 = 1/35 + O Passing O O O O Failure(P) = Prob(fail | P ever evaluated as true) = 3/7 O O O O O Now we use a simple example to illustrate their algorithm. The discrimination of Liblit05 relies on the proportion of executions where P is evaluated only as false. A larger value implies higher fault-relevance 11/22/2018 Data Mining: Principles and Algorithms

20 Data Mining: Principles and Algorithms
SOBER in Illustration 1 Prob Evaluation bias + + Failing + + + + 1 Prob Evaluation bias O Passing O O O O O O O In comparison, SOBER adopts a fundamentally different approach. We learn a model … and do the contrast. As a direct approach is infeasible, we proposed an indirect approach to quantifying the divergence between the two model. O O 11/22/2018 Data Mining: Principles and Algorithms

21 Difference between SOBER and Liblit05
Methodology: Liblit05: Correlation analysis SOBER: Model-based approach Utilized information Liblit05: Ever true? SOBER: What percentage is true? void subline(char *lin, char *pat, char *sub) { 1 int i, lastm, m; 2 lastm = -1; 3 i = 0; 4 while((lin[i] != ENDSTR)) { 5 m = amatch(lin, i, pat, 0); 6 if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } 11 } Liblit05: Line 6 is ever true in most passing and failing exec. Besides the difference in methodology, SOBER also differs from Liblit05 in terms of utilized information. Line 8 likely executes SOBER can solve it because X_f = , and X_p = SOBER: Prone to be true in failing exec. Prone to be false in passing exec. 11/22/2018 Data Mining: Principles and Algorithms

22 T-Score: Metric of Debugging Quality
How close is the blamed to the real bug location? T-score = 70% void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; Now we need to evaluate the debugging quality of SOBER, and we use T-score as the quality metric. Intuitively, T-score quantifies how close the blamed bug location is to the real bug location. And still we use this example to explain how T-score is computed, and why it is a reasonable measure. Represent program as a PDG, where each node represents a certain piece of code, and edges represent the data and control dependences. Mark out the real bug locations Mark out blamed locations Breath-first search from blamed locations until reaching real bug location T-score is the percentage of covered PDG Intuitively, it estimates the percentage of code that needs to be examined by software developers before locating the bug if he/she examines the code along dependencies. Why is it a good measure? Objective Reflect the needed manual efforts The above is the two reasons why it is widely used. 11/22/2018 Data Mining: Principles and Algorithms

23 A Better Debugging Result
T-score = 40% void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; 11/22/2018 Data Mining: Principles and Algorithms

24 Evaluation 1: Siemens Program Suite
130 buggy versions of 7 small (<700LOC) programs What percentage bugs can be located with no more than % code examination So now you have understood how to use T-score to quantify the debugging result. Now here is the comparison between SOBER, Liblit05 and another debugging algorithm CT, which claimed the best result on the Siemen suite before SOBER. The x-axis shows the T-score, and the y-axis shows what percentage of the 130 bugs can be located without no more than a certain percentage of code examination. For example, at t-score equal to 10%, SOBER can identify 52% of the 130 bugs without no more 10% code examination, and in comparison, Liblit05 identifies 40%. When a developer is OK with examining no more than 20% of the code, SOBER identifies 73% while Liblit05 is about 63%. We note that T-score more than 20% is generally meaningless in the sense that … T-Score <= 20% is meaningful 11/22/2018 Data Mining: Principles and Algorithms

25 Evaluation 2: Reasonably Large Programs
Bug Type Failure Number T-Score Flex (8,834 LOC) Bug 1 Misuse >= for > 163/525 0.5% Bug 2 Misuse of = for == 356/525 1.6% Bug 3 Mis-assign value true for false 69/525 7.6% Bug 4 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) 22/525 15.4% Bug 5 Off-by-one 92/525 45.6% Grep 2.2 (11,826 LOC) 48/470 0.6% Subclause-missing 88/470 0.2% Gzip 1.2 (6,184 LOC) 65/217 17/217 2.9% Because the Siemens program suite mainly contains small programs, we also did a set of case studies on reasonably large programs. They are flex, grep, and gzip. The subject programs, bugs and test suites are all obtained from the SIR, and the number of lines of code are shown here, and the T-score from SOBER is listed in the last column. Software-artifact Infrastructure Repository (SIR): 11/22/2018 Data Mining: Principles and Algorithms

26 A Glimpse of Bugs in Flex-2.4.7
Let’s take a quick look at those bugs, and get a sense of how tricky semantic bugs could be. 11/22/2018 Data Mining: Principles and Algorithms

27 Evaluation 2: Reasonably Large Programs
Bug Type Failure Number T-Score Flex (8,834 LOC) Bug 1 Misuse >= for > 163/525 0.5% Bug 2 Misuse of = for == 356/525 1.6% Bug 3 Mis-assign value true for false 69/525 7.6% Bug 4 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) 22/525 15.4% Bug 5 Off-by-one 92/525 45.6% Grep 2.2 (11,826 LOC) 48/470 0.6% Subclause-missing 88/470 0.2% Gzip 1.2 (6,184 LOC) 65/217 17/217 2.9% In the following, let’s examine two cases where SOBER performs well, and one case where SOBER performed badly. Software-artifact Infrastructure Repository (SIR): 11/22/2018 Data Mining: Principles and Algorithms

28 A Close Look: Grep-2.2: Bug 1
11,826 lines of C code 3,136 predicates instrumented 48 out of 470 cases fail This is the first bug in grep-2.2, which has LOC. There are 3136 predicates instrumented. The bug is at line 553, and the plus 1 shouldn’t be there, and this bug causes 48 out of 470 cases fail, and after running SOBER, we get a predicate ranking, whose top predicates are P1470 and P1484, which point to …. With the two points identified, we then check where the variables beg and lastout are assigned, and it turns out that the buggy line is the only place where the variable beg is assigned. So with our debugging tool SOBER, developers can easily identify the bug. Otherwise, he/she needs to hunt for the bug in more than 11K lines of code. 11/22/2018 Data Mining: Principles and Algorithms

29 Data Mining: Principles and Algorithms
Grep-2.2: Bug 2 11,826 lines of C code 3,136 predicates instrumented 88 out of 470 cases fail For the second bug in grep, the highest predicate identified by SOBER is P1952, and it pinpoints to the bug location. 11/22/2018 Data Mining: Principles and Algorithms

30 No Silver Bullet: Flex Bug 5
8,834 lines of C code 2,699 predicates instrumented No wrong value in chk[offset -1] Certainly, SOBER is not a silver bullet, and it cannot guarantee general effective for all semantic bugs because semantic bugs can be very very tricky. The 5th bug in Flex is an example. The constant should have been assigned to chk[offset], rather than chk[offset – 1]. But since the wrong value is over-written with correct value, so no wrong value in chk[offset – 1]. In addition, chk[offset] is not used here but later, so this is a very tricky bug, and even human beings may find it hard to debug. chk[offset] is not used here but later 11/22/2018 Data Mining: Principles and Algorithms

31 Experiment Result in Summary
Effective for bugs demonstrating abnormal control flows Bug Type Failure Number T-Score Flex (8,834 LOC) Bug 1 Misuse >= for > 163/525 0.5% Bug 2 Misuse of = for == 356/525 1.6% Bug 3 Mis-assign value true for false 69/525 7.6% Bug 4 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) 22/525 15.4% Bug 5 Off-by-one 92/525 45.6% Grep 2.2 (11,826 LOC) 48/470 0.6% Subclause-missing 88/470 0.2% Gzip 1.2 (6,184 LOC) 65/217 17/217 2.9% How to capture value abnormity is a future work. Subclause-missing bug is relatively easy to locate, which can be justified by reasoning, details are in our paper. Off-by-one can be easy or hard to detect, depending on how significant the off-by-one value affect the control flow as currently, SOBER mainly monitors booleans. which are relatively equivalent to control flows. Bug 1 of Grep-2.2 is good because the wrong value is immediately used for flow control Bug 5 of Flex is bad because the wrong value is never used for control flow, but only value errors. For the rest 4 various kinds of bugs, although they look wild, they can be detected because of control flow anormaly In conclusion, SOBER is more sensitive to abnormal program flows, than to abnormal values. This could partially be due to our current predicates do not capture value abnormality. How to capture value abnormality information through predicates is an interesting question. 11/22/2018 Data Mining: Principles and Algorithms

32 SOBER Handles Memory Bugs As Well
bc 1.06: Two memory bugs found with SOBER One of them is unreported Blamed location is NOT the crashing venue While we have been demonstrating the effectiveness of SOBER for semantic bugs, it does not mean SOBER can do nothing with memory bugs. Instead, SOBER is also effective for memory bugs. In fact, we used SOBER to identify two memory bugs in bc 1.06, an arbitary precision calculator, and one bug has not previously reported. One important point that needs mentioning is that the blamed point by SOBER is NOT the crashing venue, but very close to the root cause. This first bug is unreported, the top predicate is as circled, as old_count is copied from v_count, by putting a watch on v_count, we notice that it is unexpectedly over-written. The second bug is the widely reported copy-paste bug in bc. The first predicate is a_count < v_count, and the shown v_count should be a_count. 11/22/2018 Data Mining: Principles and Algorithms

33 Data Mining: Principles and Algorithms
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions. 11/22/2018 Data Mining: Principles and Algorithms

34 Major Problems in Failure Triage
Failure Prioritization What failures are likely due to the same bug What bugs are the most severe Worst 1% bugs = 50% failures As a reminder, there are two major problems in failure triage. Failure Assignment Which developer should debug which set of failures? Courtesy of Microsoft Corporation 11/22/2018 Data Mining: Principles and Algorithms

35 A Solution: Failure Clustering
Failure indexing Identify failures likely due to the same bug Y Fault in core.io? Fault in function initialize()? + Failure Reports Most sever Less Severe Least Severe One solution to address all the three problems is to CLUSTER failures, such that failures likely due to the same fault are clustered together. In particular, it would be better if we can visualize the clustering so that developers can intuitively analyze the relationship between failures. Now, let’s suppose we can. <click to show the plus signs> In this graph, you do not need to worry about what the x- and y-axes mean. What you need to know is that each cross represents a failure, and a small distance between two crosses means that the corresponding two failures are likely due to the same fault. This graph can help us tackle the three problem. First, we can visually identify the failure clusters. <click to draw the two circles>. Second, based on the cluster size, we can prioritize failure diagnosis. For example, the upper cluster likely represents the severest fault, and should be diagnosed first, followed by the lower cluster of failures. Finally, if each cluster automatically provides the likely fault location, like <click to show callouts>, failure clusters can be assigned to the appropriate developers automatically. In this talk, we will discuss how we can obtain a graph like this, which we call the failure proximity graph, and automatically find the fault location for each cluster. X 11/22/2018 Data Mining: Principles and Algorithms

36 The Central Question: A Distance Measure between Failures
Different measures render different clusterings Y O + Dist. defined on X-axis Dist. defined on Y-axis As you may have identified, the central question of failure clustering is how to define a distance between failures, such that … Different distance definitions will result in different failure clusterings So our object is to find a distance measure such that failures due to different faults are well separated. O + Y O + X X 11/22/2018 Data Mining: Principles and Algorithms

37 How to Define a Distance
Previous work [Podgurski et al., 2003] T-Proximity: Distance defined on literal trace similarity = SOBER So how to define a distance? Previously, Podgurski proposed T-Proximity, which defines distances between failures based on the literal trace similarity. However, since not the entire trace is failure relevant, T-Proximity is not good at identifying failures due to the same bug. Therefore, we proposed another approach, called R-Proximity, which uses SOBER to find the likely bug location for every failure, and defines distances between failures based on the likely bug location. In fact, Our approach [Liu et al., 2006] R-Proximity: Distance defined on likely bug location 11/22/2018 Data Mining: Principles and Algorithms

38 Why Our Approach is Reasonable
Optimal proximity: defined on root causes (RC) Our approach: defined on likely causes (LC) F P = Automated Fault Localization + X Y Put failures with the same root cause (RC) together Inevitable manual work to find root causes In fact, what we just want is to avoid manual identification of the root cause for every failure This actually provides a general framework for failure triage. Explicitly, if we can find the root cause for each failure, then the clustering Based on the root causes would be the optimal. For example, given a set of failing executions, fail1, fail2 to failm, and a set of passing executions pass1, pass2 to passn, then a developer manually investigates each failure, and finds the root cause for each failure. Then, if RC2 and RC3 are the same, then failure 2 and failure3 are clustered together. Similarly, if RC1 and RCm are the same, failure1 and failurem are clustered together. As we can see, since the root cause can only be found through manual work, the optimal failure proximity is too expensive to obtain. + 11/22/2018 Data Mining: Principles and Algorithms

39 R-Proximity: An Instantiation with SOBER
Likely causes (LCs) are predicate rankings F P Pred2 Pred6 Pred1 Pred3 SOBER Pred2 Pred3 Pred1 Pred6 Pred2 Pred3 Pred1 Pred6 Pred2 Pred6 Pred1 Pred3 A distances between rankings is needed + X Y 11/22/2018 Data Mining: Principles and Algorithms

40 Distance between Rankings
Traditional Kendall’s tau distance Number of preference disagreements E.g. NOT all predicates need to be considered? Predicates are uniformly instrumented Only fault-relevant predicates count Ranking is an expression about preferences Ranking distance measures the preference disagreement Explain what predicates are instrumented in this study. 11/22/2018 Data Mining: Principles and Algorithms

41 Predicate Weighting in a Nutshell
Fault-relevant predicates receive higher weights Fault-relevance is implied by rankings Mostly favored predicates receive higher weights Pred2 Pred6 Pred1 Pred3 11/22/2018 Data Mining: Principles and Algorithms

42 Automated Failure Assignment
Most-favored predicates indicate the agreed bug location for a group of failures Predicate spectrum graph Pred2 Pred6 Pred1 Pred3 Pred2 Pred1 Pred3 Pred6 Pred2 Pred1 Pred3 Pred6 Pred2 Pred6 Pred1 Pred3 Y 4 2 1 2 3 4 5 6 Pred. Index 11/22/2018 Data Mining: Principles and Algorithms

43 Data Mining: Principles and Algorithms
Case Study 1: Grep-2.2 Our first case study is with Grep-2.2. 470 test cases in total 136 cases fail due to both faults, no crashes 48 fail due to Fault 1, 88 fail due to Fault 2 11/22/2018 Data Mining: Principles and Algorithms

44 Failure Proximity Graphs
T-Proximity R-Proximity For both R-Proximity and T-Proximity, we calculate the pair-wise distances between failures, and then present these failures on 2-d dimensional space such that the pair-wise distances are best preserved on the 2-d space. Red crosses are failures due to Fault 1 Blue circles are failures due to Fault 2 Divergent behaviors due to the same fault Better clustering result under R-Proximity 11/22/2018 Data Mining: Principles and Algorithms

45 Guided Failure Assignment
What predicates are favored in each group? No matter what you circle, the predicates 1470 and 1484 always dominate 11/22/2018 Data Mining: Principles and Algorithms

46 Assign Failures to Appropriate Developers
The 21 failing cases in Cluster 1 are assigned to developers responsible for the function grep The 112 failing cases in Cluster 2 are assigned to developers responsible for the function comsub 11/22/2018 Data Mining: Principles and Algorithms

47 Data Mining: Principles and Algorithms
Case Study 2: Gzip-1.2.3 217 test cases in total 82 cases fail due to both faults, no crashes 65 fail due to Fault 1, 17 fail due to Fault 2 11/22/2018 Data Mining: Principles and Algorithms

48 Failure Proximity Graphs
T-Proximity R-Proximity Red crosses are for failures due to Fault 1 Blue circles are for failures due to Fault 2 Nearly perfect clustering under R-Proximity Accurate failure assignment 11/22/2018 Data Mining: Principles and Algorithms

49 Data Mining: Principles and Algorithms
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions. 11/22/2018 Data Mining: Principles and Algorithms

50 Mining Copy-Paste Bugs
Copy-pasting is common 12% in Linux file system [Kasper2003] 19% in X Window system [Baker1995] Copy-pasted code is error prone Among 35 errors in Linux drivers/i2o, 34 are caused by copy-paste [Chou2001] void __init prom_meminit(void) { …… for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } Forget to change! for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } (Simplified example from linux-2.6.6/arch/sparc/prom/memory.c) 11/22/2018 Data Mining: Principles and Algorithms

51 An Overview of Copy-Paste Bug Detection
Parse source code & build a sequence database Mine for basic copy-pasted segments Compose larger copy-pasted segments Prune false positives 11/22/2018 Data Mining: Principles and Algorithms

52 Data Mining: Principles and Algorithms
Parsing Source Code Purpose: building a sequence database Idea: statement  number Tokenize each component Different operators/constant/key words  different tokens Handle identifier renaming: same type of identifiers  same token old = 3; new = 3; Tokenize Hash Hash 16 16 11/22/2018 Data Mining: Principles and Algorithms

53 Building Sequence Database
Program  a long sequence Need a sequence database Cut the long sequence Naïve method: fixed length Our method: basic block Hash values … 65 16 71 for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } …… for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } Final sequence DB: (65) (16, 16, 71) … (65) (16, 16, 71) 11/22/2018 Data Mining: Principles and Algorithms

54 Mining for Basic Copy-pasted Segments
Apply frequent sequence mining algorithm on the sequence database Modification Constrain the max gap Frequent subsequence total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; Insert 1 statement (gap = 1) (16, 16, 71) …… (16, 16, 71) (16, 16, 71) …… (16, 16, 10, 71) taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; 11/22/2018 Data Mining: Principles and Algorithms

55 Composing Larger Copy-Pasted Segments
Combine the neighboring copy-pasted segments repeatedly Hash values 65 for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } for (i=0; i<n; i++) { combine total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; …… copy-pasted 65 for (i=0; i<n; i++) { for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } combine taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; 11/22/2018 Data Mining: Principles and Algorithms

56 Pruning False Positives
Unmappable segments Identifier names cannot be mapped to corresponding ones Tiny segments For more detail, see Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004 f (a1); f (a2); f (a3); f1 (b1); f1 (b2); f2 (b3); conflict 11/22/2018 Data Mining: Principles and Algorithms

57 Some Test Results of C-P Bug Detection
Software Verified Bugs Potential Bugs (careless programming) Linux 28 21 FreeBSD 23 8 Apache 5 PostgreSQL 2 Software # LOC Linux 4.4 M FreeBSD 3.3 M Apache 224 K PostgreSQL 458 K Space (MB) Time Software 57 38 secs PostgreSQL 30 15 secs Apache 459 20 mins FreeBSD 527 Linux 11/22/2018 Data Mining: Principles and Algorithms

58 Data Mining: Principles and Algorithms
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions. 11/22/2018 Data Mining: Principles and Algorithms

59 Data Mining: Principles and Algorithms
Conclusions Data mining into software and computer systems Identify incorrect executions from program runtime behaviors Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs A hypothesis testing-like approach is developed to localize logic bugs in software No prior knowledge about the program semantics is assumed Lots of other software bug mining methods should be and explored 11/22/2018 Data Mining: Principles and Algorithms

60 Future Research: Mining into Computer Systems
Huge volume of data from computer systems Persistent state interactions, event logs, network logs, CPU usage, … Mining system data for … Reliability Performance Manageability Challenges in data mining Statistical modeling of computer systems Online, scalability, interpretability … Most of these problems are noncrashing failures. 11/22/2018 Data Mining: Principles and Algorithms

61 Data Mining: Principles and Algorithms
References [DRL+98] David L. Detlefs, K. Rustan, M. Leino, Greg Nelson and James B. Saxe. Extended static checking, 1998 [EGH+94] David Evans, John Guttag, James Horning, and Yang Meng Tan. LCLint: A tool for using specifications to check code. In Proceedings of the ACM SIG-SOFT '94 Symposium on the Foundations of Software Engineering, pages 87-96, 1994. [DLS02] Manuvir Das, Sorin Lerner, and Mark Seigle. Esp: Path-sensitive program verication in polynomial time. In Conference on Programming Language Design and Implementation, 2002. [ECC00] D.R. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specic, programmer-written compiler extensions. In Proc. 4th Symp. Operating Systems Design and Implementation, October 2000. [M93] Ken McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993 [H97] Gerard J. Holzmann. The model checker SPIN. Software Engineering, 23(5): , 1997. [DDH+92] David L. Dill, Andreas J. Drexler, Alan J. Hu, and C. Han Yang. Protocol verication as a hardware design aid. In IEEE Int. Conf. Computer Design: VLSI in Computers and Processors, pages , 1992. [MPC+02] M. Musuvathi, D. Y.W. Park, A. Chou, D. R. Engler and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. In Proc. 5th Symp. Operating Systems Design and Implementation, 2002. 11/22/2018 Data Mining: Principles and Algorithms

62 Data Mining: Principles and Algorithms
References (cont’d) [G97] P. Godefroid. Model Checking for Programming Languages using VeriSoft. In Proc. 24th ACM Symp. Principles of Programming Languages, 1997 [BHP+-00] G. Brat, K. Havelund, S. Park, and W. Visser. Model checking programs. In IEEE Int.l Conf. Automated Software Engineering (ASE), 2000. [HJ92] R. Hastings and B. Joyce. Purify: Fast Detection of Memory Leaks and Access Errors in Proc. Winter 1992 USENIX Conference, pp San Francisco, California Chao Liu, Xifeng Yan, and Jiawei Han, “Mining Control Flow Abnormality for Logic Error Isolation,” in Proc SIAM Int. Conf. on Data Mining (SDM'06), Bethesda, MD, April 2006. C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff, “SOBER: Statistical Model-based Bug Localization”, in Proc ACM SIGSOFT Symp. Foundations of Software Engineering (FSE 2005), Lisbon, Portugal, Sept C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for Backtrace of Noncrashing Bugs”, in Proc SIAM Int. Conf. on Data Mining (SDM'05), Newport Beach, CA, April 2005. [SN00] Julian Seward and Nick Nethercote. Valgrind, an open-source memory debugger for x86-GNU/Linux [LLM+04] Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004 [LCS+04] Zhenmin Li, Zhifeng Chen, Sudarshan M. Srinivasan, Yuanyuan Zhou. C-Miner: Mining Block Correlations in Storage Systems. In pro. 3rd USENIX conf. on file and storage technologies, 2004 11/22/2018 Data Mining: Principles and Algorithms

63 Data Mining: Principles and Algorithms
11/22/2018 Data Mining: Principles and Algorithms

64 Data Mining: Principles and Algorithms
Surplus Slides The remaining are leftover slides Now let’s take a look at how semantic bugs that incur no crashes can be located through statistical analysis. 11/22/2018 Data Mining: Principles and Algorithms

65 Representative Publications
Chao Liu, Long Fei, Xifeng Yan, Jiawei Han and Samuel Midkiff, “Statistical Debugging: A Hypothesis Testing-Based Approach,” IEEE Transaction on Software Engineering, Vol. 32, No. 10, pp , Oct., 2006. Chao Liu and Jiawei Han, “R-Proximity: Failure Proximity Defined via Statistical Debugging,” IEEE Transaction on Software Engineering, Sept (under review) Chao Liu, Zeng Lian and Jiawei Han, "How Bayesians Debug", the 6th IEEE International Conference on Data Mining, pp. pp ,Hong Kong, China, Dec. 2006. Chao Liu and Jiawei Han, "Failure Proximity: A Fault Localization-Based Approach", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp , Portland, USA, Nov. 2006. Chao Liu, "Fault-aware Fingerprinting: Towards Mutualism between Failure Investigation and Statistical Debugging", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, Portland, USA, Nov. 2006. Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu, "GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis", the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , Philadelphia, USA, Aug Qiaozhu Mei, Chao Liu, Hang Su and Chengxiang Zhai, "A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs", the 15th International Conference on World Wide Web, pp.  , Edinburgh, Scotland, May, 2006.  Chao Liu, Xifeng Yan and Jiawei Han, "Mining Control Flow Abnormality for Logic Error Isolation", 2006 SIAM International Conference on Data Mining, pp , Bethesda, US, April, 2006. Chao Liu, Xifeng Yan, Long Fei, Jiawei Han and Samuel Midkiff, "SOBER: Statistical Model-Based Bug Localization", the 5th joint meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp.  , Lisbon, Portugal, Sept William Yurcik and Chao Liu. "A First Step Toward Detecting SSH Identity Theft on HPC Clusters: Discriminating Cluster Masqueraders Based on Command Behavior" the 5th International Symposium on Cluster Computing and the Grid, pp.  , Cardiff, UK, May 2005. Chao Liu, Xifeng Yan, Hwanjo Yu, Jiawei Han and Philip S. Yu, "Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs", In Proc SIAM Int. Conf. on Data Mining, pp.  , Newport Beach, US, April, 2005. 11/22/2018 Data Mining: Principles and Algorithms

66 Example of Noncrashing Bugs
void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; From memory access point of view, even incorrect executions are correct. 11/22/2018 Data Mining: Principles and Algorithms

67 Data Mining: Principles and Algorithms
Debugging Crashes Crashing Bugs 11/22/2018 Data Mining: Principles and Algorithms

68 Bug Localization via Backtrace
Can we circle out the backtrace for noncrashing bugs? Major challenges We do not know where abnormality happens Observations Classifications depend on discriminative features, which can be regarded as a kind of abnormality Can we extract backtrace from classification results? Recall that in crashing bugs, memory accesses are obviously where abnormality happens so that the call stack constitutes the backtrace. Should we have known where abnormality happens, the call stack can also be the back trace of noncrashing bugs. 11/22/2018 Data Mining: Principles and Algorithms

69 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 11/22/2018 Data Mining: Principles and Algorithms

70 Data Mining: Principles and Algorithms
Related Work Crashing bugs Memory access monitoring Purify [HJ92], Valgrind [SN00] … Noncrashing bugs Static program analysis Traditional model checking Model checking source code 11/22/2018 Data Mining: Principles and Algorithms

71 Static Program Analysis
Methodology Examine source code directly Enumerate all the possible execution paths without running the program Check user-specified properties, e.g. free(p) …… (*p) lock(res) …… unlock(res) receive_ack() … … send_data() Strengths Check all possible execution paths Problems Shallow semantics Properties can be directly mapped to source code structure Tools ESC [DRL+98], LCLint [EGH+94], ESP [DLS02], MC Checker [ECC00] … × 11/22/2018 Data Mining: Principles and Algorithms

72 Traditional Model Checking
Methodology Formally model the system under check in a particular description language Exhaustive exploration of the reachable states in checking desired or undesired properties Strengths Model deep semantics Naturally fit in checking event-driven systems, like protocols Problems Significant amount of manual efforts in modeling State space explosion Tools SMV [M93], SPIN [H97], Murphi [DDH+92] … usually, this is a final state machine 11/22/2018 Data Mining: Principles and Algorithms

73 Model Checking Source Code
Methodology Run real program in sandbox Manipulate event happenings, e.g., Message incomings the outcomes of memory allocation Strengths Less significant manual specification Problems Application restrictions, e.g., Event-driven programs (still) Clear mapping between source code and logic event Tools CMC [MPC+02], Verisoft [G97], Java PathFinder [BHP+-00] … 11/22/2018 Data Mining: Principles and Algorithms

74 Summary of Related Work
In common, Semantic inputs are necessary Program model Properties to check Application scenarios Shallow semantics Event-driven system When these methods do not work? 11/22/2018 Data Mining: Principles and Algorithms

75 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 11/22/2018 Data Mining: Principles and Algorithms

76 Data Mining: Principles and Algorithms
Example Revisited void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; No memory violations Not event-driven program No explicit error properties From memory access point of view, even incorrect executions are correct. , hence hard to model using finite state machines. 11/22/2018 Data Mining: Principles and Algorithms

77 Identification of Incorrect Executions
A two-class classification problem How to abstract program executions Program behavior graph Feature selection Edges + Closed frequent subgraphs Program behavior graphs Function-level abstraction of program behaviors int main(){ ... A(); B(); } int A(){ ... } int B(){ ... C() ... } int C(){ ... } Behavior graph = call graph + transition graph One graph from one execution 11/22/2018 Data Mining: Principles and Algorithms

78 Values of Classification
A graph classification problem Every execution gives one behavior graph Two sets of instances: correct and incorrect Values of classification Classification itself does not readily work for bug localization Classifier only labels each run as either correct or incorrect as a whole It does not tell when abnormality happens Successful classification relies on discriminative features Can discriminative features be treated as a kind of abnormality? When abnormality happens? Incremental classification? ? 11/22/2018 Data Mining: Principles and Algorithms

79 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 11/22/2018 Data Mining: Principles and Algorithms

80 Incremental Classification
Classification works only when instances of two classes are different. So that we can use classification accuracy as a measure of difference. Relate classification dynamics to bug relevant functions The main idea of incremental classification is that we train classifiers at different stages of program executions so that we have chance to capture when the bug happens or where the abnormality is. Basically, the incorrect execution looks the same at the beginning of execution, and then at certain stage, the execution triggers the bug, then the execution diverge from a correct execution. So if we can 11/22/2018 Data Mining: Principles and Algorithms

81 Illustration: Precision Boost
main A E F G H B C D One Correct Execution One Incorrect Execution 11/22/2018 Data Mining: Principles and Algorithms

82 Data Mining: Principles and Algorithms
Bug Relevance Precision boost For each function F: Precision boost = Exit precision - Entrance precision. Intuition Differences take place within the execution of F Abnormalities happens while F is in the stack The larger this precision boost, the more likely F is part of the backtrace Bug-relevant function 11/22/2018 Data Mining: Principles and Algorithms

83 Data Mining: Principles and Algorithms
Outline Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Case Study Conclusions 11/22/2018 Data Mining: Principles and Algorithms

84 Data Mining: Principles and Algorithms
Case Study void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; Subject program replace: perform regular expression matching and substitutions 563 lines of C code 17 functions are involved Execution behaviors 130 out of 5542 test cases fail to give correct outputs No incorrect executions incur segmentation faults Logic bug Can we circle out the backtrace for this bug? From memory access point of view, even incorrect executions are correct. 11/22/2018 Data Mining: Principles and Algorithms

85 Data Mining: Principles and Algorithms
Precision Pairs 11/22/2018 Data Mining: Principles and Algorithms

86 Precision Boost Analysis
Objective judgment of bug relevant functions main function is always bug relevant Stepwise precision boost Line-up property 11/22/2018 Data Mining: Principles and Algorithms

87 Backtrace for Noncrashing Bugs
11/22/2018 Data Mining: Principles and Algorithms

88 Data Mining: Principles and Algorithms
Method Summary Identify incorrect executions from program runtime behaviors Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs Data mining can contribute to software engineering and system researches in general CP-Miner [LLM+04] detects copy-paste bugs in OS code uses Clospan algorithm C-Miner [LCS+04] discovers block correlations in storage systems again uses Clospan algorithm effectively reduces I/O response time … … 11/22/2018 Data Mining: Principles and Algorithms

89 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 11/22/2018 Data Mining: Principles and Algorithms

90 Data Mining: Principles and Algorithms
An Example void dodash(char delim, char *src, int *i, char *dest, int *j, int maxset) { while (…){ if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){ for(k = src[*i-1]+1; k<=src[*i+1]; k++) junk = addst(k, dest, j, maxset); *i = *i + 1; } Had the function been written correctly, the subclause in red should have been there. Replace program: 563 lines of C code, 20 functions Symptom: 30 out of 5542 test cases fail to give correct outputs, and no crashes Goal: Localizing the bug, and prioritizing manual examination 11/22/2018 Data Mining: Principles and Algorithms

91 Difficulty & Expectation
Statically, even small programs are complex due to dependencies Dynamically, execution paths can vary significantly across all possible inputs Logic errors have no apparent symptoms Expectations Unrealistic to fully unload developers Localize buggy region Prioritize manual examination 11/22/2018 Data Mining: Principles and Algorithms

92 Data Mining: Principles and Algorithms
Execution Profiling Full execution trace Control flow + value tags Too expensive to record at runtime Unwieldy to process Summarized control flow for conditionals (if, while, for) Branch evaluation counts Lightweight to take at runtime Easy to process and effective How to represent 11/22/2018 Data Mining: Principles and Algorithms

93 Analysis of the Example
if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){ for(k = src[*i-1]+1; k<=src[*i+1]; k++) junk = addst(k, dest, j, maxset); *i = *i + 1; } A = isalnum(isalnum(src[*i+1])) B = src[*i-1]<=src[*i+1] An execution is logically correct until (A ^ ¬B) is evaluated as true when the evaluation reaches this condition If we monitor the program conditionals like A here, their evaluation will shed light on the hidden error and can be exploited for error isolation Had the function been written correctly, the subclause in red should have been there. 11/22/2018 Data Mining: Principles and Algorithms

94 Analysis of Branching Actions
Correct vs. in correct runs in program P AS we tested through 5542 test cases, the true eval prob for (A^¬B) is in a correct and in an incorrect execution on average Error location does exhibit detectable abnormal behaviors in incorrect executions A ¬A B nAB n¬AB ¬B nA¬B = 0 n¬A¬B A ¬A B nAB n¬AB ¬B nA¬B ≥1 n¬A¬B 11/22/2018 Data Mining: Principles and Algorithms

95 Conditional Test Works for Nonbranching Errors
Void makepat (char *arg, int start, char delim, char *pat) { if (!junk) result = 0; else result = i + 1; /* off-by-one error */ /* should be: result = i */ return result; } Had the function been written correctly, the subclause in red should have been there. Off-by-one error can still be detected using the conditional tests 11/22/2018 Data Mining: Principles and Algorithms

96 Ranking Based on Boolean Bias
Let input di has a desired output oi. We execute P. P passes the test iff oi’ is identical to oi Tp = {ti| oi’= P(di) matches oi} Tf = {ti| oi’= P(di) does not match oi} Boolean bias: nt: # times that a boolean feature B evaluates true, similar for nf Boolean bias: π(B) = (nt – nf )/(nt + nf) It encodes the distribution of B’s value: 1 if B always assumes true, -1 if always false, in between for all the other mixtures 11/22/2018 Data Mining: Principles and Algorithms

97 Evaluation Abnormality
Boolean bias for branch P the probability of being evaluated as true within one execution Suppose we have n correct and m incorrect executions, for any predicate P, we end up with An observation sequence for correct runs S_p = (X’_1, X’_2, …, X’_n) An observation sequence for incorrect runs S_f = (X_1, X_2, …, X_m) Can we infer whether P is suspicious based on S_p and S_f? 11/22/2018 Data Mining: Principles and Algorithms

98 Underlying Populations
Imagine the underlying distribution of boolean bias for correct and incorrect executions are f(X|θp) and f(X|θf) S_p and S_f can be viewed as random sample from the underlying populations respectively Major heuristic: The larger the divergence between f(X|θp) and f(X|θf), the more relevant the branch P is to the bug 1 Prob Evaluation bias 1 Prob Evaluation bias 11/22/2018 Data Mining: Principles and Algorithms

99 Data Mining: Principles and Algorithms
Major Challenges 1 Prob Evaluation bias 1 Prob Evaluation bias No knowledge of the closed forms of both distributions Usually, we do not have sufficient incorrect executions to estimate f(X|θf) reliably. If we knew them, some standard measures may apply, i.e., KL-divergence 11/22/2018 Data Mining: Principles and Algorithms

100 Our Approach: Hypothesis Testing
11/22/2018 Data Mining: Principles and Algorithms

101 Data Mining: Principles and Algorithms
Faulty Functions Motivation Bugs are not necessarily on branches Higher confidence in function rankings than branch rankings Abnormality score for functions Calculate the abnormality score for each branch within each function Aggregate them 11/22/2018 Data Mining: Principles and Algorithms

102 Two Evaluation Measures
CombineRank Combine these score by summation Intuition: When a function contains many abnormal branches, it is likely bug-relevant UpperRank Choose the largest score as the representative Intuition: When a function has one extremely abnormal branch, it is likely bug-relevant With some derivation shown in paper. 11/22/2018 Data Mining: Principles and Algorithms

103 Data Mining: Principles and Algorithms
Dodash vs. Omatch: Which function is likely buggy?─And Which Measure is More Effective? 11/22/2018 Data Mining: Principles and Algorithms

104 Data Mining: Principles and Algorithms
Bug Benchmark Bug benchmark Siemens Program Suite 89 variants of 6 subject programs, each of LOC 89 known bugs in total Mainly logic (or semantic) bugs Widely used in software engineering research 11/22/2018 Data Mining: Principles and Algorithms

105 Results on Program “replace”
11/22/2018 Data Mining: Principles and Algorithms

106 Comparison between CombineRank and UpperRank
Buggy function ranked within top-k 11/22/2018 Data Mining: Principles and Algorithms

107 Results on Other Programs
11/22/2018 Data Mining: Principles and Algorithms

108 More Questions to Be Answered
What will happen (i.e., how to handle) if multiple errors exist in one program? How to detect bugs if only very few error test cases are available? Is it really more effective if we have more execution traces? How to integrate program semantics in this statistics-based testing algorithm? How to integrate program semantics analysis with statistics-based analysis? Here comes the outline. We first discuss based on an example, which illustrates why logic errors are hard to deal with. 11/22/2018 Data Mining: Principles and Algorithms


Download ppt "Jiawei Han and Micheline Kamber Department of Computer Science"

Similar presentations


Ads by Google