Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department.

Slides:



Advertisements
Similar presentations
Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu
Advertisements

A Mutation / Injection-based Automatic Framework for Evaluating Code Clone Detection Tools Chanchal Roy University of Saskatchewan The 9th CREST Open Workshop.
Indexing DNA Sequences Using q-Grams
People Counting and Human Detection in a Challenging Situation Ya-Li Hou and Grantham K. H. Pang IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART.
Chapter 8 Improving the User Interface
Unification and Refactoring of Clones Giri Panamoottil Krishnan and Nikolaos Tsantalis Department of Computer Science & Software Engineering Clone images.
Programming Paradigms and languages
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Copyright © IBM Corp., Introducing the new Web Tools JavaScript™ Features Phil Berkland IBM Software Group 9/26/2007.
Clean code. Motivation Total cost = the cost of developing + maintenance cost Maintenance cost = cost of understanding + cost of changes + cost of testing.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Code recognition & CL modeling through AST Xingzhong Xu Hong Man.
Aki Hecht Seminar in Databases (236826) January 2009
A Tool Support to Merge Similar Methods with a Cohesion Metric COB ○ Masakazu Ioka 1, Norihiro Yoshida 2, Tomoo Masai 1,Yoshiki Higo 1, Katsuro Inoue 1.
Analyzing Software Code and Execution – Plagiarism and Bug Detection Shoaib Jameel.
Refactoring Support Tool: Cancer Yoshiki Higo Osaka University.
Overview of program analysis Mooly Sagiv html://
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Data Structures and Programming.  John Edgar2.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Introduction to Systems Analysis and Design Trisha Cummings.
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Blind Pattern Matching Attack on Watermark Systems D. Kirovski and F. A. P. Petitcolas IEEE Transactions on Signal Processing, VOL. 51, NO. 4, April 2003.
“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp Today presented by Kenny Kwok.
Python – Part 1 Python Programming Language 1. What is Python? High-level language Interpreted – easy to test and use interactively Object-oriented Open-source.
Python From the book “Think Python”
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Cross Language Clone Analysis Team 2 October 27, 2010.
Supported by ELTE IKKK, Ericsson Hungary, in cooperation with University of Kent Erlang refactoring with relational database Anikó Víg and Tamás Nagy Supervisors:
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
Feasibility Study Cross-language Clone Analysis Team 2.
Functions, Procedures, and Abstraction Dr. José M. Reyes Álamo.
Supported by ELTE IKKK, Ericsson Hungary, in cooperation with University of Kent Erlang refactoring with relational database Anikó Víg and Tamás Nagy Supervisors:
1 Evaluating Code Duplication Detection Techniques Filip Van Rysselberghe and Serge Demeyer Lab On Re-Engineering University Of Antwerp Towards a Taxonomy.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1 Towards an Assessment of the Quality of Refactoring.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
With Jeff Gray and Ira Baxter Robert Tairas Visualization of Clone Detection Results Eclipse Technology Exchange Workshop OOPSLA 2006 Portland, Oregon.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
Cross Language Clone Analysis Team 2 February 3, 2011.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
Cross Language Clone Analysis Team 2 February 3, 2011.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CS Class 04 Topics  Selection statement – IF  Expressions  More practice writing simple C++ programs Announcements  Read pages for next.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Refactoring Tools – Proparse, Prorefactor, Prolint etc. Steven Lichtenberg Sr. Technologist Jenark Business Systems, Inc.
STATIC CODE ANALYSIS. OUTLINE  INTRODUCTION  BACKGROUND o REGULAR EXPRESSIONS o SYNTAX TREES o CONTROL FLOW GRAPHS  TOOLS AND THEIR WORKING  ERROR.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
Dr.K.Venkata Subba Reddy Professor-CSE Department
Syntax-based Deep Matching of Short Texts
Natural Language Processing (NLP)
CBCD: Cloned Buggy Code Detector
A Refactoring Technique for Large Groups of Software Clones
Project Implementation for ITCS4122
Learning to Program in Python
: Clone Refactoring Davood Mazinanian Nikolaos Tsantalis Raphael Stein
Refactoring Support Tool: Cancer
Assessing the Refactorability of Software Clones
Fundamentals of Python: First Programs
Natural Language Processing (NLP)
Functions, Procedures, and Abstraction
Natural Language Processing (NLP)
Presentation transcript:

Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Outline Theoretic part Clone detection problem in general The theory behind the tool Practical part Clone Digger and the results of its application to several Python open-source projects Other ongoing projects

What is software clone? Two fragments of code form clone if they are similar enough (according to a given measure of similarity) for i in range(5): for j in range(i): print i+j for k in range(6): for m in range(k): print k+m

Why is it important to detect code clones? 5% - 20% of code in software systems are clones 1 Why do programmers produce clones? 2 Development strategy Maintenance benefits Overcoming underlying limitations Cloning by accident Why is the presence of code clones bad? Errors in the original must be fixed in every clone 1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research, 2007.

Our definition of clone Different clone definitions can be classified according to the level of granularity: List of strings Sequence of tokens Abstract syntax trees (AST) Semantic information We work on the AST level We consider two sequences of statements as a clone if one of them can be obtained from the other by replacing some subtrees

Example x = a y = f(x,i) print y x = a + b y = f(x,j) print y = print x+ y ab = yf xj = xa y = yf xi block

The sketch of the algorithm Partition similar statements into clusters Find pairs of identical cluster sequences Refine by examining identified code sequences for structural similarity i=0i+=1f(i) k+=1f(k)k=0 i=0f(k)

Main problems How to compute similarity between two trees? Use editing distance How to compute similarity between a new tree and an existing tree cluster? Comparing with each tree in cluster is expensive Compare new tree with an average value stored for a cluster

Anti-unification Anti-unifier of two trees is the most specific generalization that matches both of them ? f +* ? xyx 2 f +/ xzx2 f + x ?

Anti-unification features Anti-unifier of a set of trees keeps common features: the common upper part Anti-unification can be used to compute editing distance between two trees: Ө 1 и Ө 2 - substitutions, E 0 Ө 1 =E 1 и E 0 Ө 2 =E 2 distance = |Ө 1 | + |Ө 2|

Clone Digger Is the first clone detection tool focused on Python (except Pylint) Is provided under the GPL license Writes the information on found clones to HTML in two column format with highlighting of differences

Comparison with existing tools working with ASTs CloneDR by Semantic Designs, I. Baxter, 1998 Hash functions on subtrees, some kind of editing distance Asta by Microsoft Research, S. Evans, et. al, 2007 Subtree patterns (similar to anti-unification), hash functions on subtrees

Quick Start 1. $ easy_install clonedigger 2. $ clonedigger --recursive source_tree 3. $ firefox output.html Additional parameters such as thresholds can be also set (use --help to know more)

Running on real-life open- source projects BioPython12.19% NLTK11.85% Zope27.41% Plone29.89% These numbers mean nothing … … except that every large project has clones and they should be detected

What to do with found clones? Remove clones by refactoring. Extract method and Pull Up method can be used Detect library candidates Search for bugs

Any questions?