Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jiawei Han and Micheline Kamber Department of Computer Science

Similar presentations


Presentation on theme: "Jiawei Han and Micheline Kamber Department of Computer Science"— Presentation transcript:

1 Data Mining: Concepts and Techniques — Chapter 11 — — Additional Theme: Software Bug Mining—
Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgement: Chao Liu 9/20/2018 Data Mining: Principles and Algorithms

2 Data Mining: Principles and Algorithms
9/20/2018 Data Mining: Principles and Algorithms

3 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 9/20/2018 Data Mining: Principles and Algorithms

4 Data Mining: Principles and Algorithms
Motivation Software is “full of bugs” Windows 2000, 35 million lines of code 63,000 known bugs at the time of release, 2 per 1000 lines Software failure costs Ariane 5 explosion due to “errors in the software of the inertial reference system” (Ariaen-5 flight 501 inquiry board report A study by the National Institute of Standards and Technology found that software errors cost the U.S. economy about $59.5 billion annually Testing and debugging are laborious and expensive “50% of my company employees are testers, and the rest spends 50% of their time testing!” —Bill Gates, in 1995 Courtesy to CNN.com This work is basically about how to automatically localize the software bugs. The major motivation is that software is full of bugs. A research once showed that the average error rate is 1 – 4.5 errors per 1000 lines of code. For example, the windows 2000, which has 35M lines of code, contains 63 thousands of KNOWN bugs at the time of its release. This means 2 errors are in each thousand lines. When the bugs happen in practice, the costs are tremendous. In 1996, the Ariane 5 exploded 40 seconds after lauching. As investigated, the explosion was due to errors in the software of the inertial reference system. A study by the National Institute of Standards and Technology found that the software errors cost the U.S. economy about $59.5 billions annually. Therefore, great many efforts are put on the testing and debugging during the software cycletime. Bill Gates once said that 50% of my company employees are testers, and the rest spends 50% of their time testing. As we all know, testing and debugging are tough task. So there are some researches carried out on bug localization. 9/20/2018 Data Mining: Principles and Algorithms

5 A Glimpse on Software Bugs
Crashing bugs Symptoms: segmentation faults Reasons: memory access violations Tools: Valgrind, CCured Noncrashing bugs Symptoms: unexpected outputs Reasons: logic or semantic errors if ((m >= 0)) vs. if ((m >= 0) && (m != lastm)) < vs. <=, > vs. >=, etc .. j = i vs. j= i+1 Tools: No sound tools 9/20/2018 Data Mining: Principles and Algorithms

6 Example of Noncrashing Bugs
void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; From memory access point of view, even incorrect executions are correct. 9/20/2018 Data Mining: Principles and Algorithms

7 Data Mining: Principles and Algorithms
Debugging Crashes Crashing Bugs 9/20/2018 Data Mining: Principles and Algorithms

8 Bug Localization via Backtrace
Can we circle out the backtrace for noncrashing bugs? Major challenges We do not know where abnormality happens Observations Classifications depend on discriminative features, which can be regarded as a kind of abnormality Can we extract backtrace from classification results? Recall that in crashing bugs, memory accesses are obviously where abnormality happens so that the call stack constitutes the backtrace. Should we have known where abnormality happens, the call stack can also be the back trace of noncrashing bugs. 9/20/2018 Data Mining: Principles and Algorithms

9 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 9/20/2018 Data Mining: Principles and Algorithms

10 Data Mining: Principles and Algorithms
Related Work Crashing bugs Memory access monitoring Purify [HJ92], Valgrind [SN00] … Noncrashing bugs Static program analysis Traditional model checking Model checking source code 9/20/2018 Data Mining: Principles and Algorithms

11 Static Program Analysis
Methodology Examine source code directly Enumerate all the possible execution paths without running the program Check user-specified properties, e.g. free(p) …… (*p) lock(res) …… unlock(res) receive_ack() … … send_data() Strengths Check all possible execution paths Problems Shallow semantics Properties can be directly mapped to source code structure Tools ESC [DRL+98], LCLint [EGH+94], ESP [DLS02], MC Checker [ECC00] … × 9/20/2018 Data Mining: Principles and Algorithms

12 Traditional Model Checking
Methodology Formally model the system under check in a particular description language Exhaustive exploration of the reachable states in checking desired or undesired properties Strengths Model deep semantics Naturally fit in checking event-driven systems, like protocols Problems Significant amount of manual efforts in modeling State space explosion Tools SMV [M93], SPIN [H97], Murphi [DDH+92] … usually, this is a final state machine 9/20/2018 Data Mining: Principles and Algorithms

13 Model Checking Source Code
Methodology Run real program in sandbox Manipulate event happenings, e.g., Message incomings the outcomes of memory allocation Strengths Less significant manual specification Problems Application restrictions, e.g., Event-driven programs (still) Clear mapping between source code and logic event Tools CMC [MPC+02], Verisoft [G97], Java PathFinder [BHP+-00] … 9/20/2018 Data Mining: Principles and Algorithms

14 Summary of Related Work
In common, Semantic inputs are necessary Program model Properties to check Application scenarios Shallow semantics Event-driven system When these methods do not work? 9/20/2018 Data Mining: Principles and Algorithms

15 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 9/20/2018 Data Mining: Principles and Algorithms

16 Data Mining: Principles and Algorithms
Example Revisited void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; No memory violations Not event-driven program No explicit error properties From memory access point of view, even incorrect executions are correct. , hence hard to model using finite state machines. 9/20/2018 Data Mining: Principles and Algorithms

17 Identification of Incorrect Executions
A two-class classification problem How to abstract program executions Program behavior graph Feature selection Edges + Closed frequent subgraphs Program behavior graphs Function-level abstraction of program behaviors int main(){ ... A(); B(); } int A(){ ... } int B(){ ... C() ... } int C(){ ... } Behavior graph = call graph + transition graph One graph from one execution 9/20/2018 Data Mining: Principles and Algorithms

18 Values of Classification
A graph classification problem Every execution gives one behavior graph Two sets of instances: correct and incorrect Values of classification Classification itself does not readily work for bug localization Classifier only labels each run as either correct or incorrect as a whole It does not tell when abnormality happens Successful classification relies on discriminative features Can discriminative features be treated as a kind of abnormality? When abnormality happens? Incremental classification? ? 9/20/2018 Data Mining: Principles and Algorithms

19 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 9/20/2018 Data Mining: Principles and Algorithms

20 Incremental Classification
Classification works only when instances of two classes are different. So that we can use classification accuracy as a measure of difference. Relate classification dynamics to bug relevant functions The main idea of incremental classification is that we train classifiers at different stages of program executions so that we have chance to capture when the bug happens or where the abnormality is. Basically, the incorrect execution looks the same at the beginning of execution, and then at certain stage, the execution triggers the bug, then the execution diverge from a correct execution. So if we can 9/20/2018 Data Mining: Principles and Algorithms

21 Illustration: Precision Boost
main A E F G H B C D One Correct Execution One Incorrect Execution 9/20/2018 Data Mining: Principles and Algorithms

22 Data Mining: Principles and Algorithms
Bug Relevance Precision boost For each function F: Precision boost = Exit precision - Entrance precision. Intuition Differences take place within the execution of F Abnormalities happens while F is in the stack The larger this precision boost, the more likely F is part of the backtrace Bug-relevant function 9/20/2018 Data Mining: Principles and Algorithms

23 Data Mining: Principles and Algorithms
Outline Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Case Study Conclusions 9/20/2018 Data Mining: Principles and Algorithms

24 Data Mining: Principles and Algorithms
Case Study void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; Subject program replace: perform regular expression matching and substitutions 563 lines of C code 17 functions are involved Execution behaviors 130 out of 5542 test cases fail to give correct outputs No incorrect executions incur segmentation faults Logic bug Can we circle out the backtrace for this bug? From memory access point of view, even incorrect executions are correct. 9/20/2018 Data Mining: Principles and Algorithms

25 Data Mining: Principles and Algorithms
Precision Pairs 9/20/2018 Data Mining: Principles and Algorithms

26 Precision Boost Analysis
Objective judgment of bug relevant functions main function is always bug relevant Stepwise precision boost Line-up property 9/20/2018 Data Mining: Principles and Algorithms

27 Backtrace for Noncrashing Bugs
9/20/2018 Data Mining: Principles and Algorithms

28 Data Mining: Principles and Algorithms
Method Summary Identify incorrect executions from program runtime behaviors Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs Data mining can contribute to software engineering and system researches in general CP-Miner [LLM+04] detects copy-paste bugs in OS code uses Clospan algorithm C-Miner [LCS+04] discovers block correlations in storage systems again uses Clospan algorithm effectively reduces I/O response time … … 9/20/2018 Data Mining: Principles and Algorithms

29 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 9/20/2018 Data Mining: Principles and Algorithms

30 Data Mining: Principles and Algorithms
An Example void dodash(char delim, char *src, int *i, char *dest, int *j, int maxset) { while (…){ if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){ for(k = src[*i-1]+1; k<=src[*i+1]; k++) junk = addst(k, dest, j, maxset); *i = *i + 1; } Had the function been written correctly, the subclause in red should have been there. Replace program: 563 lines of C code, 20 functions Symptom: 30 out of 5542 test cases fail to give correct outputs, and no crashes Goal: Localizing the bug, and prioritizing manual examination 9/20/2018 Data Mining: Principles and Algorithms

31 Difficulty & Expectation
Statically, even small programs are complex due to dependencies Dynamically, execution paths can vary significantly across all possible inputs Logic errors have no apparent symptoms Expectations Unrealistic to fully unload developers Localize buggy region Prioritize manual examination 9/20/2018 Data Mining: Principles and Algorithms

32 Data Mining: Principles and Algorithms
Execution Profiling Full execution trace Control flow + value tags Too expensive to record at runtime Unwieldy to process Summarized control flow for conditionals (if, while, for) Branch evaluation counts Lightweight to take at runtime Easy to process and effective How to represent 9/20/2018 Data Mining: Principles and Algorithms

33 Analysis of the Example
if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){ for(k = src[*i-1]+1; k<=src[*i+1]; k++) junk = addst(k, dest, j, maxset); *i = *i + 1; } A = isalnum(isalnum(src[*i+1])) B = src[*i-1]<=src[*i+1] An execution is logically correct until (A ^ ¬B) is evaluated as true when the evaluation reaches this condition If we monitor the program conditionals like A here, their evaluation will shed light on the hidden error and can be exploited for error isolation Had the function been written correctly, the subclause in red should have been there. 9/20/2018 Data Mining: Principles and Algorithms

34 Analysis of Branching Actions
Correct vs. in correct runs in program P AS we tested through 5542 test cases, the true eval prob for (A^¬B) is in a correct and in an incorrect execution on average Error location does exhibit detectable abnormal behaviors in incorrect executions A ¬A B nAB n¬AB ¬B nA¬B = 0 n¬A¬B A ¬A B nAB n¬AB ¬B nA¬B ≥1 n¬A¬B 9/20/2018 Data Mining: Principles and Algorithms

35 Conditional Test Works for Nonbranching Errors
Void makepat (char *arg, int start, char delim, char *pat) { if (!junk) result = 0; else result = i + 1; /* off-by-one error */ /* should be: result = i */ return result; } Had the function been written correctly, the subclause in red should have been there. Off-by-one error can still be detected using the conditional tests 9/20/2018 Data Mining: Principles and Algorithms

36 Ranking Based on Boolean Bias
Let input di has a desired output oi. We execute P. P passes the test iff oi’ is identical to oi Tp = {ti| oi’= P(di) matches oi} Tf = {ti| oi’= P(di) does not match oi} Boolean bias: nt: # times that a boolean feature B evaluates true, similar for nf Boolean bias: π(B) = (nt – nf )/(nt + nf) It encodes the distribution of B’s value: 1 if B always assumes true, -1 if always false, in between for all the other mixtures 9/20/2018 Data Mining: Principles and Algorithms

37 Evaluation Abnormality
Boolean bias for branch P the probability of being evaluated as true within one execution Suppose we have n correct and m incorrect executions, for any predicate P, we end up with An observation sequence for correct runs S_p = (X’_1, X’_2, …, X’_n) An observation sequence for incorrect runs S_f = (X_1, X_2, …, X_m) Can we infer whether P is suspicious based on S_p and S_f? 9/20/2018 Data Mining: Principles and Algorithms

38 Underlying Populations
Imagine the underlying distribution of boolean bias for correct and incorrect executions are f(X|θp) and f(X|θf) S_p and S_f can be viewed as random sample from the underlying populations respectively Major heuristic: The larger the divergence between f(X|θp) and f(X|θf), the more relevant the branch P is to the bug 1 Prob Evaluation bias 1 Prob Evaluation bias 9/20/2018 Data Mining: Principles and Algorithms

39 Data Mining: Principles and Algorithms
Major Challenges 1 Prob Evaluation bias 1 Prob Evaluation bias No knowledge of the closed forms of both distributions Usually, we do not have sufficient incorrect executions to estimate f(X|θf) reliably. If we knew them, some standard measures may apply, i.e., KL-divergence 9/20/2018 Data Mining: Principles and Algorithms

40 Our Approach: Hypothesis Testing
9/20/2018 Data Mining: Principles and Algorithms

41 Data Mining: Principles and Algorithms
Faulty Functions Motivation Bugs are not necessarily on branches Higher confidence in function rankings than branch rankings Abnormality score for functions Calculate the abnormality score for each branch within each function Aggregate them 9/20/2018 Data Mining: Principles and Algorithms

42 Two Evaluation Measures
CombineRank Combine these score by summation Intuition: When a function contains many abnormal branches, it is likely bug-relevant UpperRank Choose the largest score as the representative Intuition: When a function has one extremely abnormal branch, it is likely bug-relevant With some derivation shown in paper. 9/20/2018 Data Mining: Principles and Algorithms

43 Data Mining: Principles and Algorithms
Dodash vs. Omatch: Which function is likely buggy?─And Which Measure is More Effective? 9/20/2018 Data Mining: Principles and Algorithms

44 Data Mining: Principles and Algorithms
Bug Benchmark Bug benchmark Siemens Program Suite 89 variants of 6 subject programs, each of LOC 89 known bugs in total Mainly logic (or semantic) bugs Widely used in software engineering research 9/20/2018 Data Mining: Principles and Algorithms

45 Results on Program “replace”
9/20/2018 Data Mining: Principles and Algorithms

46 Comparison between CombineRank and UpperRank
Buggy function ranked within top-k 9/20/2018 Data Mining: Principles and Algorithms

47 Results on Other Programs
9/20/2018 Data Mining: Principles and Algorithms

48 More Questions to Be Answered
What will happen (i.e., how to handle) if multiple errors exist in one program? How to detect bugs if only very few error test cases are available? Is it really more effective if we have more execution traces? How to integrate program semantics in this statistics-based testing algorithm? How to integrate program semantics analysis with statistics-based analysis? Here comes the outline. We first discuss based on an example, which illustrates why logic errors are hard to deal with. 9/20/2018 Data Mining: Principles and Algorithms

49 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 9/20/2018 Data Mining: Principles and Algorithms

50 Mining Copy-Paste Bugs
Copy-pasting is common 12% in Linux file system [Kasper2003] 19% in X Window system [Baker1995] Copy-pasted code is error prone Among 35 errors in Linux drivers/i2o, 34 are caused by copy-paste [Chou2001] void __init prom_meminit(void) { …… for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } Forget to change! for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } (Simplified example from linux-2.6.6/arch/sparc/prom/memory.c) 9/20/2018 Data Mining: Principles and Algorithms

51 An Overview of Copy-Paste Bug Detection
Parse source code & build a sequence database Mine for basic copy-pasted segments Compose larger copy-pasted segments Prune false positives 9/20/2018 Data Mining: Principles and Algorithms

52 Data Mining: Principles and Algorithms
Parsing Source Code Purpose: building a sequence database Idea: statement  number Tokenize each component Different operators/constant/key words  different tokens Handle identifier renaming: same type of identifiers  same token old = 3; new = 3; Tokenize Hash Hash 16 16 9/20/2018 Data Mining: Principles and Algorithms

53 Building Sequence Database
Program  a long sequence Need a sequence database Cut the long sequence Naïve method: fixed length Our method: basic block Hash values … 65 16 71 for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } …… for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } Final sequence DB: (65) (16, 16, 71) … (65) (16, 16, 71) 9/20/2018 Data Mining: Principles and Algorithms

54 Mining for Basic Copy-pasted Segments
Apply frequent sequence mining algorithm on the sequence database Modification Constrain the max gap Frequent subsequence total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; Insert 1 statement (gap = 1) (16, 16, 71) …… (16, 16, 71) (16, 16, 71) …… (16, 16, 10, 71) taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; 9/20/2018 Data Mining: Principles and Algorithms

55 Composing Larger Copy-Pasted Segments
Combine the neighboring copy-pasted segments repeatedly Hash values 65 for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } for (i=0; i<n; i++) { combine total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; …… copy-pasted 65 for (i=0; i<n; i++) { for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } combine taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; 9/20/2018 Data Mining: Principles and Algorithms

56 Pruning False Positives
Unmappable segments Identifier names cannot be mapped to corresponding ones Tiny segments For more detail, see Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004 f (a1); f (a2); f (a3); f1 (b1); f1 (b2); f2 (b3); conflict 9/20/2018 Data Mining: Principles and Algorithms

57 Some Test Results of C-P Bug Detection
Software Verified Bugs Potential Bugs (careless programming) Linux 28 21 FreeBSD 23 8 Apache 5 PostgreSQL 2 Software # LOC Linux 4.4 M FreeBSD 3.3 M Apache 224 K PostgreSQL 458 K Space (MB) Time Software 57 38 secs PostgreSQL 30 15 secs Apache 459 20 mins FreeBSD 527 Linux 9/20/2018 Data Mining: Principles and Algorithms

58 Data Mining: Principles and Algorithms
Outline Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Mining Control Flow Abnormality for Logic Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions 9/20/2018 Data Mining: Principles and Algorithms

59 Data Mining: Principles and Algorithms
Conclusions Data mining into software and computer systems Identify incorrect executions from program runtime behaviors Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs A hypothesis testing-like approach is developed to localize logic bugs in software No prior knowledge about the program semantics is assumed Lots of other software bug mining methods should be and explored 9/20/2018 Data Mining: Principles and Algorithms

60 Data Mining: Principles and Algorithms
References [DRL+98] David L. Detlefs, K. Rustan, M. Leino, Greg Nelson and James B. Saxe. Extended static checking, 1998 [EGH+94] David Evans, John Guttag, James Horning, and Yang Meng Tan. LCLint: A tool for using specifications to check code. In Proceedings of the ACM SIG-SOFT '94 Symposium on the Foundations of Software Engineering, pages 87-96, 1994. [DLS02] Manuvir Das, Sorin Lerner, and Mark Seigle. Esp: Path-sensitive program verication in polynomial time. In Conference on Programming Language Design and Implementation, 2002. [ECC00] D.R. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specic, programmer-written compiler extensions. In Proc. 4th Symp. Operating Systems Design and Implementation, October 2000. [M93] Ken McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993 [H97] Gerard J. Holzmann. The model checker SPIN. Software Engineering, 23(5): , 1997. [DDH+92] David L. Dill, Andreas J. Drexler, Alan J. Hu, and C. Han Yang. Protocol verication as a hardware design aid. In IEEE Int. Conf. Computer Design: VLSI in Computers and Processors, pages , 1992. [MPC+02] M. Musuvathi, D. Y.W. Park, A. Chou, D. R. Engler and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. In Proc. 5th Symp. Operating Systems Design and Implementation, 2002. 9/20/2018 Data Mining: Principles and Algorithms

61 Data Mining: Principles and Algorithms
References (cont’d) [G97] P. Godefroid. Model Checking for Programming Languages using VeriSoft. In Proc. 24th ACM Symp. Principles of Programming Languages, 1997 [BHP+-00] G. Brat, K. Havelund, S. Park, and W. Visser. Model checking programs. In IEEE Int.l Conf. Automated Software Engineering (ASE), 2000. [HJ92] R. Hastings and B. Joyce. Purify: Fast Detection of Memory Leaks and Access Errors in Proc. Winter 1992 USENIX Conference, pp San Francisco, California Chao Liu, Xifeng Yan, and Jiawei Han, “Mining Control Flow Abnormality for Logic Error Isolation,” in Proc SIAM Int. Conf. on Data Mining (SDM'06), Bethesda, MD, April 2006. C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff, “SOBER: Statistical Model-based Bug Localization”, in Proc ACM SIGSOFT Symp. Foundations of Software Engineering (FSE 2005), Lisbon, Portugal, Sept C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for Backtrace of Noncrashing Bugs”, in Proc SIAM Int. Conf. on Data Mining (SDM'05), Newport Beach, CA, April 2005. [SN00] Julian Seward and Nick Nethercote. Valgrind, an open-source memory debugger for x86-GNU/Linux [LLM+04] Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004 [LCS+04] Zhenmin Li, Zhifeng Chen, Sudarshan M. Srinivasan, Yuanyuan Zhou. C-Miner: Mining Block Correlations in Storage Systems. In pro. 3rd USENIX conf. on file and storage technologies, 2004 9/20/2018 Data Mining: Principles and Algorithms

62 Data Mining: Principles and Algorithms
9/20/2018 Data Mining: Principles and Algorithms


Download ppt "Jiawei Han and Micheline Kamber Department of Computer Science"

Similar presentations


Ads by Google