Presentation on theme: "Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department."— Presentation transcript:
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department
Outline Theoretic part Clone detection problem in general The theory behind the tool Practical part Clone Digger and the results of its application to several Python open-source projects Other ongoing projects
What is software clone? Two fragments of code form clone if they are similar enough (according to a given measure of similarity) for i in range(5): for j in range(i): print i+j for k in range(6): for m in range(k): print k+m
Why is it important to detect code clones? 5% - 20% of code in software systems are clones 1 Why do programmers produce clones? 2 Development strategy Maintenance benefits Overcoming underlying limitations Cloning by accident Why is the presence of code clones bad? Errors in the original must be fixed in every clone 1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research, 2007.
Our definition of clone Different clone definitions can be classified according to the level of granularity: List of strings Sequence of tokens Abstract syntax trees (AST) Semantic information We work on the AST level We consider two sequences of statements as a clone if one of them can be obtained from the other by replacing some subtrees
Example x = a y = f(x,i) print y x = a + b y = f(x,j) print y = print x+ y ab = yf xj = xa y = yf xi block
The sketch of the algorithm Partition similar statements into clusters Find pairs of identical cluster sequences Refine by examining identified code sequences for structural similarity i=0i+=1f(i) k+=1f(k)k=0 i=0f(k)
Main problems How to compute similarity between two trees? Use editing distance How to compute similarity between a new tree and an existing tree cluster? Comparing with each tree in cluster is expensive Compare new tree with an average value stored for a cluster
Anti-unification Anti-unifier of two trees is the most specific generalization that matches both of them ? f +* ? xyx 2 f +/ xzx2 f + x ?
Anti-unification features Anti-unifier of a set of trees keeps common features: the common upper part Anti-unification can be used to compute editing distance between two trees: Ө 1 и Ө 2 - substitutions, E 0 Ө 1 =E 1 и E 0 Ө 2 =E 2 distance = |Ө 1 | + |Ө 2|
Clone Digger Is the first clone detection tool focused on Python (except Pylint) Is provided under the GPL license Writes the information on found clones to HTML in two column format with highlighting of differences
Comparison with existing tools working with ASTs CloneDR by Semantic Designs, I. Baxter, 1998 Hash functions on subtrees, some kind of editing distance Asta by Microsoft Research, S. Evans, et. al, 2007 Subtree patterns (similar to anti-unification), hash functions on subtrees
Quick Start 1. $ easy_install clonedigger 2. $ clonedigger --recursive source_tree 3. $ firefox output.html Additional parameters such as thresholds can be also set (use --help to know more)
Running on real-life open- source projects BioPython12.19% NLTK11.85% Zope27.41% Plone29.89% These numbers mean nothing … … except that every large project has clones and they should be detected
What to do with found clones? Remove clones by refactoring. Extract method and Pull Up method can be used Detect library candidates Search for bugs