Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.

Similar presentations


Presentation on theme: "CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking."— Presentation transcript:

1 CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking University

2 Code Clones In software development, it is common to reuse some code fragments by copying with or without minor modifications. This kind of code fragments are called code clones. [Jurgens et al., ICSE 2009]

3 Scenario-based Evaluation Original CopyExample of Scenario #1

4 Scenario-based Evaluation Original CopyExample of Scenario #2

5 Scenario-based Evaluation Original CopyExample of Scenario #3

6 Scenario-based Evaluation Original CopyExample of Scenario #4

7 Importance of Code Clones Code clone brings troubles: – Increase the complexity of source code – Increase the maintenance cost of software system – Increase the possibility of getting bugs 7%-23% of the code in large software system is cloned. [Roy et al., SCP 2009] Detecting code clones may help: – Analyze the programming habits of the programmers – Find the design patterns of the source code

8 Previous Work in Clone Detection lower level: – Textual approach SDD [Lee and Jeong, OOPSLA 2005] NICAD [Roy and Cordy, ICPC 2008]... – Lexical approach DUP [Baker, WCRE 1995] CCFinder [Kamiya et al., TSE 2002] CP-Miner [Li et al., OSDI 2004, TSE 2006] ….

9 Previous Work in Clone Detection Higher level: – Syntactic approach CloneDr [Baxter et al., ICSM 1998] Deckard [Jiang et al., ICSE 2007] CloneDigger [Bulychev, SyRCoSE 2008] … – Semantic approach Duplix [Krinke, WCRE 2001] GPLAG [Liu et al., KDD 06] …

10 Challenges Low level approaches Faster Usually focusing on local characters No Idea about global meanings High level approaches Slower Better understanding of the programs Difficult to scale GAPGAP

11 Our idea A novel count matrix based clone detection approach. Benefits of counting – By ignoring the order of variables, it can identify clones with statement swapping cases, which is difficult for both lexical and syntactic approaches. – Easy to calculate and implement Reduces space and time complexity

12 Count Matrix Construction Token SequenceCount Vector Count Matrix tot,=,n,+,Find,(,n,),for,i,=,1,to,n,-,1, if,a,[,i,],>,a,[,j,],,k,=,a,[,i,]…. tot100…0 i300…2 j100…1 a300…3 n210…0 100…0 i300…2 j100…1 A300…3 n210…0

13 Comparison Algorithms Goal: – Find more scenario #4 clones with more transformations such as sentence swapping – Run fast General principles: – Compare individual variables, instead of variable sequences – Ignore variable orders in the count matrix

14 bipartite graph matching Use bipartite graph matching to find code clone in different granularity: – Bottom-up approach Can be used for compute the similarity between two projects, two classes, or two methods – Use two kinds of bipartite graph KM algorithm (low-level, slow, accurate) Hungarian algorithm (high-level, fast, inaccurate)

15 Optimization Use Euclidean metrics to compute the similarity of CVs Use quick rejection algorithm to improve speed Eliminate false positives: – Cut and check – Slice and match

16 Implementation Use Soot to convert Java->Jimple [Vallee-Rai et al., CASCON 1999] – 3-address intermediate representation – Smaller language set – Break complex statements into basic ones – Does not change the meaning of the program A new version of CMCD without using Soot

17 Overview

18 Performance Comparison to Deckard

19 Scenario-based Evaluation Based on scenario classification from Roy et al., paper “Comparison and Evaluation of Code Clone Detection Techniques ”

20 Detecting Plagiarisms Student-submitted compiler lab projects – 29 submissions – 106 - 251 Java classes – 7,825 – 38,086 Lines of code Experimental Results – Running time: 123 minutes – 2 clusters of code clones, each has 3 copies – Confirmed – Now used by two courses in Peking University for detecting students’ homework

21 Analyzing JDK 1.6 Source Code JDK 1.6.0_18 – 7,197 files – 2,079,166 LoC Experimental Results – Running time: 163 minutes – Found: 786 methods in 174 clusters (Small methods are omitted)

22 Code Comparison: Two Clones Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactory getSyncFactory(){ if(syncFactory == null){ synchronized(SyncFactory.class) { if(syncFactory == null){ syncFactory = new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 2: (in javax.swing.JComponent) static Set getManagingFocusBackwardTraversalKeys() { synchronized(JComponent.class) { if (managingFocusBackwardTraversalKeys == null) { managingFocusBackwardTraversalKeys = new HashSet (1); managingFocusBackwardTraversalKeys.add(KeyStroke.getKeyStroke( KeyEvent.VK_TAB,InputEvent.SHIFT_MASK|InputEvent.CTRL_MASK)); } return managingFocusBackwardTraversalKeys; }

23 Detected a bug Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactory getSyncFactory(){ if(syncFactory == null){ synchronized(SyncFactory.class) { if(syncFactory == null){ syncFactory = new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 3: (in com.sun.corba.se.impl.ior.iiop.JavaSerializationComponent) public static JavaSerializationComponent singleton() { if (singleton == null) { synchronized (JavaSerializationComponent.class) { singleton =new JavaSerializationComponent(Message.JAVA_ENC_VERSION); } return singleton; } http://bugs.sun.com/bugdatabase/vie w_bug.do?bug_id=6999537

24 Conclusion We propose a code clone detection approach CMCD: – Extracting count-based information – Language independent – Scales to large programs (> 1M LoC) Capabilities – Performs well in scenario-based evaluation – Detects code plagiarism in students’ homework – Identifies a potential bug in JDK source code


Download ppt "CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking."

Similar presentations


Ads by Google