CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.

Slides:



Advertisements
Similar presentations
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department.
Advertisements

A Mutation / Injection-based Automatic Framework for Evaluating Code Clone Detection Tools Chanchal Roy University of Saskatchewan The 9th CREST Open Workshop.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Compilers and Language Translation
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Using Programmer-Written Compiler Extensions to Catch Security Holes Authors: Ken Ashcraft and Dawson Engler Presented by : Hong Chen CS590F 2/7/2007.
CPSC Compiler Tutorial 9 Review of Compiler.
A Comparison of Online and Dynamic Impact Analysis Algorithms Ben Breech Mike Tegtmeyer Lori Pollock University of Delaware.
Analyzing Software Code and Execution – Plagiarism and Bug Detection Shoaib Jameel.
Chapter 16 Programming and Languages: Telling the Computer What to Do.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
Cliff Rhyne and Jerry Fu June 5, 2007 Parallel Image Segmenter CSE 262 Spring 2007 Project Final Presentation.
Refactoring Support Tool: Cancer Yoshiki Higo Osaka University.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
XFindBugs: eXtended FindBugs for AspectJ Haihao Shen, Sai Zhang, Jianjun Zhao, Jianhong Fang, Shiyuan Yao Software Theory and Practice Group (STAP) Shanghai.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Similar.
P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.
Detecting software clones in binaries Zaharije Radivojević, Saša Stojanović, Miloš Cvetanović School of Electrical Engineering, Belgrade University 14th.
Dependency Tracking in software systems Presented by: Ashgan Fararooy.
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Criterion for.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Change Impact Analysis for AspectJ Programs Sai Zhang, Zhongxian Gu, Yu Lin and Jianjun Zhao Shanghai Jiao Tong University.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
2002/12/11PROFES20021 On software maintenance process improvement based on code clone analysis Yoshiki Higo* , Yasushi Ueda* , Toshihiro Kamiya** , Shinji.
“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp Today presented by Kenny Kwok.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for.
Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Compiler design Lecture 1: Compiler Overview Sulaimany University 2 Oct
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Compiler Design Introduction 1. 2 Course Outline Introduction to Compiling Lexical Analysis Syntax Analysis –Context Free Grammars –Top-Down Parsing –Bottom-Up.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
Dr. Mohamed Ramadan Saady 314ALL CH1.1 Chapter 1: Introduction to Compiling.
ICS312 Introduction to Compilers Set 23. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Advanced Computer Systems
Compiler Design (40-414) Main Text Book:
Code Optimization.
Introduction to Compiler Construction
-by Nisarg Vasavada (Compiled*)
Optimization Code Optimization ©SoftMoore Consulting.
Chapter 5 Conclusion CIS 61.
CBCD: Cloned Buggy Code Detector
CS 536 / Fall 2017 Introduction to programming languages and compilers
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Refactoring Support Tool: Cancer
On Refactoring Support Based on Code Clone Dependency Relation
Chapter 10: Compilers and Language Translation
Presentation transcript:

CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking University

Code Clones In software development, it is common to reuse some code fragments by copying with or without minor modifications. This kind of code fragments are called code clones. [Jurgens et al., ICSE 2009]

Scenario-based Evaluation Original CopyExample of Scenario #1

Scenario-based Evaluation Original CopyExample of Scenario #2

Scenario-based Evaluation Original CopyExample of Scenario #3

Scenario-based Evaluation Original CopyExample of Scenario #4

Importance of Code Clones Code clone brings troubles: – Increase the complexity of source code – Increase the maintenance cost of software system – Increase the possibility of getting bugs 7%-23% of the code in large software system is cloned. [Roy et al., SCP 2009] Detecting code clones may help: – Analyze the programming habits of the programmers – Find the design patterns of the source code

Previous Work in Clone Detection lower level: – Textual approach SDD [Lee and Jeong, OOPSLA 2005] NICAD [Roy and Cordy, ICPC 2008]... – Lexical approach DUP [Baker, WCRE 1995] CCFinder [Kamiya et al., TSE 2002] CP-Miner [Li et al., OSDI 2004, TSE 2006] ….

Previous Work in Clone Detection Higher level: – Syntactic approach CloneDr [Baxter et al., ICSM 1998] Deckard [Jiang et al., ICSE 2007] CloneDigger [Bulychev, SyRCoSE 2008] … – Semantic approach Duplix [Krinke, WCRE 2001] GPLAG [Liu et al., KDD 06] …

Challenges Low level approaches Faster Usually focusing on local characters No Idea about global meanings High level approaches Slower Better understanding of the programs Difficult to scale GAPGAP

Our idea A novel count matrix based clone detection approach. Benefits of counting – By ignoring the order of variables, it can identify clones with statement swapping cases, which is difficult for both lexical and syntactic approaches. – Easy to calculate and implement Reduces space and time complexity

Count Matrix Construction Token SequenceCount Vector Count Matrix tot,=,n,+,Find,(,n,),for,i,=,1,to,n,-,1, if,a,[,i,],>,a,[,j,],,k,=,a,[,i,]…. tot100…0 i300…2 j100…1 a300…3 n210…0 100…0 i300…2 j100…1 A300…3 n210…0

Comparison Algorithms Goal: – Find more scenario #4 clones with more transformations such as sentence swapping – Run fast General principles: – Compare individual variables, instead of variable sequences – Ignore variable orders in the count matrix

bipartite graph matching Use bipartite graph matching to find code clone in different granularity: – Bottom-up approach Can be used for compute the similarity between two projects, two classes, or two methods – Use two kinds of bipartite graph KM algorithm (low-level, slow, accurate) Hungarian algorithm (high-level, fast, inaccurate)

Optimization Use Euclidean metrics to compute the similarity of CVs Use quick rejection algorithm to improve speed Eliminate false positives: – Cut and check – Slice and match

Implementation Use Soot to convert Java->Jimple [Vallee-Rai et al., CASCON 1999] – 3-address intermediate representation – Smaller language set – Break complex statements into basic ones – Does not change the meaning of the program A new version of CMCD without using Soot

Overview

Performance Comparison to Deckard

Scenario-based Evaluation Based on scenario classification from Roy et al., paper “Comparison and Evaluation of Code Clone Detection Techniques ”

Detecting Plagiarisms Student-submitted compiler lab projects – 29 submissions – Java classes – 7,825 – 38,086 Lines of code Experimental Results – Running time: 123 minutes – 2 clusters of code clones, each has 3 copies – Confirmed – Now used by two courses in Peking University for detecting students’ homework

Analyzing JDK 1.6 Source Code JDK 1.6.0_18 – 7,197 files – 2,079,166 LoC Experimental Results – Running time: 163 minutes – Found: 786 methods in 174 clusters (Small methods are omitted)

Code Comparison: Two Clones Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactory getSyncFactory(){ if(syncFactory == null){ synchronized(SyncFactory.class) { if(syncFactory == null){ syncFactory = new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 2: (in javax.swing.JComponent) static Set getManagingFocusBackwardTraversalKeys() { synchronized(JComponent.class) { if (managingFocusBackwardTraversalKeys == null) { managingFocusBackwardTraversalKeys = new HashSet (1); managingFocusBackwardTraversalKeys.add(KeyStroke.getKeyStroke( KeyEvent.VK_TAB,InputEvent.SHIFT_MASK|InputEvent.CTRL_MASK)); } return managingFocusBackwardTraversalKeys; }

Detected a bug Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactory getSyncFactory(){ if(syncFactory == null){ synchronized(SyncFactory.class) { if(syncFactory == null){ syncFactory = new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 3: (in com.sun.corba.se.impl.ior.iiop.JavaSerializationComponent) public static JavaSerializationComponent singleton() { if (singleton == null) { synchronized (JavaSerializationComponent.class) { singleton =new JavaSerializationComponent(Message.JAVA_ENC_VERSION); } return singleton; } w_bug.do?bug_id=

Conclusion We propose a code clone detection approach CMCD: – Extracting count-based information – Language independent – Scales to large programs (> 1M LoC) Capabilities – Performs well in scenario-based evaluation – Detects code plagiarism in students’ homework – Identifies a potential bug in JDK source code