Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara

Outline Code duplication problem Our anti-unification based algorithm Comparison with existing methods Clone Digger, the tool for finding software clones

What is software clone? Two fragments of code form clone if they are similar enough (according to a given measure of similarity) for(int i=0; i<5; i++) for(j=0; j<=i; j++) cout << i+j; for(int k=0; k<6; k++) for(m=0; m<=k; m++) cout << k+m;

Why is it important to detect code clones? 5% - 20% of code in software systems are clones 1 Why do programmers produce clones? 2 Development strategy Maintenance benefits Overcoming underlying limitations Cloning by accident Why is the presence of code clones bad? Errors in the original must be fixed in every clone 1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, 1998. 2. C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research, 2007.

Our clone definition Different clone definitions can be classified according to the level of granularity: List of strings Sequence of tokens Abstract syntax trees (AST) Semantic information We work on the AST level We consider two sequences of statements a clone if one of them can be obtained from the other by replacing some subtrees

Example x = a; y = f(x,i); cout << y; x = a + b; y = f(x,j); cout << y; ; = cout x+ y ab = yf xj ; = xa y = yf xi

Automatic clone detection tool Detect occurrences of similar code Applications Refactoring into new functions or base classes Number of clones can be used as a measure of code quality Several tools exists 1 1. S. Bellon, et.al. Comparison and Evaluation of Clone Detection Tools, 2007.

The sketch of the algorithm Partition similar statements into clusters Find pairs of identical cluster sequences Refine by examining identified code sequences for structural similarity i=0i++f(i) k++f(k)k=0 i=0f(k)

Main problems How to compute similarity between two trees? Use editing distance How to compute similarity between a new tree and an existing tree cluster? Comparing with each tree in cluster is expensive Compare new tree with an average value stored for a cluster

Anti-unification Anti-unifier of two trees is the most specific generalization that matches both ? f +* ? xyx 2 f +/ xzx2 f + x ?

Anti-unification features Anti-unifier of a set of trees keeps common features: tree structure and common labels Anti-unification can be used to compute editing distance between two trees: Ө 1 и Ө 2 - substitutions, E 0 Ө 1 =E 1 и E 0 Ө 2 =E 2 distance = |Ө 1 | + |Ө 2|

The first phase: building clusters of statements We use a simple one-pass clustering algorithm for each tree in statement trees: bestcluster = argmax(cluster.add_cost(tree)) if bestcluster.add_cost(tree) < threshold bestcluster.append(tree) else clusters.append(new Cluster(tree))

Finding the best cluster What add_cost function should we use? Cost value should be high for these cases: If cluster is large and by joining the new tree the cluster’s average value changes significantly If the average value of the new cluster is far away from the tree add_cost = n * (|au| - |au’|) + (|tree| - |au’|) n – the old size of the cluster au – the old anti-unifier of the cluster au’ - the new anti-unifier of the cluster

Increase of effectiveness In order not to compare each AST with each other AST we use hashing. The upper parts of the trees are hashed. = [ ]+ abx0 = + a+x0 bc

Why is this not enough? By considering pairs from the same cluster only individually we miss sequences of statements We should find all pairs of identical cluster sequences and then check them for similarity void f() { // cluster №1 cin >> i; // cluster №2 int j = i * 100; // cluster №3 cout << i << j; // cluster №4 } void f(int j) { // cluster №5 cin >> i; // cluster №2 int j = i * 100; // cluster №3 cout << j; // cluster №6 }

The second phase: finding all common subsequences After the first phase each statement node is marked with the ID of its cluster We want to find all pairs of similar sequences of cluster IDs We do it using suffix trees Only long common subsequences are considered

The third phase: finding similar sequences of statements i=0 k=3 f(i,k) k=0 n=3 f(k,n) i=0 k=3 f(i,k) k=0 n=3 f(k,n)

Comparison with existing AST methods W. Yang, 1991 Editing distance between two trees I. Baxter, et. al, 1998 Hash functions on subtrees, some kind of editing distance V. Wahler, 2004 Feature vectors comparison S. Evans, et. al, 2007 Subtree patterns (similar to anti-unification), hash functions on subtrees

Clone Digger The tool is written in Python Supported languages: Python (ASTs are build using standard package “compiler”) Java 1.5 (parser generator ANTLR) The information on found clones is written to HTML with a highlighting of differences It’s application to open-source projects NLTK and BioPython showed, that they are 12% clones

Clone Digger Provided under the GPL license and can be downloaded from the site http://clonedigger.sourceforge.net

Thank you!

Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Similar presentations

Presentation on theme: "Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Similar presentations

Presentation on theme: "Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara."— Presentation transcript:

Similar presentations

About project

Feedback