1 Gemini: Code Clone Analysis Tool †Graduate School of Engineering Science, Osaka Univ., Japan ‡ Graduate School of Information Science and Technology,

Slides:



Advertisements
Similar presentations
Function Point Measurement from Java Programs
Advertisements

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extracting Code.
ISBN Chapter 10 Implementing Subprograms.
4 July 2005 overview Traineeship: Mapping of data structures in multiprocessor systems Nick de Koning
A Tool Support to Merge Similar Methods with a Cohesion Metric COB ○ Masakazu Ioka 1, Norihiro Yoshida 2, Tomoo Masai 1,Yoshiki Higo 1, Katsuro Inoue 1.
7. Duplicated Code Metrics Duplicated Code Software quality
13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code.
Refactoring Support Tool: Cancer Yoshiki Higo Osaka University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
Software Engineering Lab, Osaka University Code Clone Analysis and Its Application Katsuro Inoue Osaka University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Similar.
Code Clone Analysis and Its Application
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Refactoring.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Software Engineering.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 ARIES: Refactoring.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Method to Detect License Inconsistencies for Large-
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Analysis.
2002/12/11PROFES20021 On software maintenance process improvement based on code clone analysis Yoshiki Higo* , Yasushi Ueda* , Toshihiro Kamiya** , Shinji.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection and evolution analysis of code clones for.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Cross Language Clone Analysis Team 2 October 27, 2010.
1 Gemini: Maintenance Support Environment Based on Code Clone Analysis *Graduate School of Engineering Science, Osaka Univ. **PRESTO, Japan Science and.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Inoue Laboratory Eunjong Choi 1 Investigating Clone.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
Software Engineering Research Group, Graduate School of Engineering Science, Osaka University 1 Evaluation of a Business Application Framework Using Complexity.
Summarizing the Content of Large Traces to Facilitate the Understanding of the Behaviour of a Software System Abdelwahab Hamou-Lhadj Timothy Lethbridge.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Detection.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Retrieving Similar Code Fragments based on Identifier.
CPS 506 Comparative Programming Languages Syntax Specification.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Finding Code Clones.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Software Engineering Research Group, Graduate School of Engineering Science, Osaka University A Slicing Method for Object-Oriented Programs Using Lightweight.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code Clone Analysis.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
Cross Language Clone Analysis Team 2 February 3, 2011.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 コードクローン解析に基づくリファクタリング支援.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Aries: Refactoring.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Detection of License Inconsistencies in Free and.
On Detection of Gapped Code Clones using Gap Locations Yasushi Ueda†, Toshihiro Kamiya‡, Shinji Kusumoto†, and Katsuro Inoue† †Graduate School of Information.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A Metric-based Approach for Reconstructing Methods.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
Refactoring Support Based on Code Clone Analysis
CBCD: Cloned Buggy Code Detector
A Pluggable Tool for Measuring Software Metrics from Source Code
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
R.Rajkumar Asst.Professor CSE
Predicting Fault-Prone Modules Based on Metrics Transitions
Refactoring Support Tool: Cancer
Quaid-i-Azam University
Yuhao Wu1, Yuki Manabe2, Daniel M. German3, Katsuro Inoue1
Multilingual Detection of Code Clones Using ANTLR Grammar Definitions
On Refactoring Support Based on Code Clone Dependency Relation
Research Activities of Software Engineering Lab in Osaka University
Dotri Quoc†, Kazuo Kobori†, Norihiro Yoshida
Presentation transcript:

1 Gemini: Code Clone Analysis Tool †Graduate School of Engineering Science, Osaka Univ., Japan ‡ Graduate School of Information Science and Technology, Osaka Univ., Japan *PRESTO, Japan Science and Technology Corp., Japan {y-ueda, y-higo, kamiya, kusumoto, Yasushi Ueda †, Yoshiki Higo ‡, Toshihiro Kamiya*, Shinji Kusumoto ‡, and Katsuro Inoue ‡

2 Contents Background Code Clone Analysis Tool, Gemini Overview System structure Scatter Plot

3 Background (1/2) A code clone is a pair/set of code portions in source files that are identical or similar to each other.

4 Background (2/2) Code clone is one of the factors that make software maintenance more difficult. If some faults are found in a code portion, it is necessary to correct the faults in its all clone pairs. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7): , We have developed a code clone detection tool, CCFinder [1]. Token-based clone detector Its input is a set of source files and output is the locations of clone pairs.

5 Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting CCFinder Example of clone detection process Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 0.13,1 9,111,1 17,1

6 Gemini overview A GUI-based code clone analysis tool Uses CCFinder as a code clone detector. Has several views to interactive analysis. Scatter plot view Select by mouse dragging Sorting function Zoom in/out Metric graph view Select by metric values Source code view Implemented in Java About 10,000 lines of code

7 Scatter plot Both the vertical and horizontal axes represent a token sequence of source code. A dot means that corresponding two tokens on the two axes are same. The main diagonal line is always drawn, because each dot on it refers to an identical position of the two axes. A clone pair is shown as a diagonal line segment. The distribution is symmetrical with the main diagonal line. a b c a b c a d e c a, b, c,... : tokens : matched position

8 Sorting function When multiple files are compared in scatter plot, boundaries of their files are shown on the axes. Depending on the file orders, the distribution of dots is spread widely. We put similar files as near as possible.

9 Snapshots of Gemini

10 Conclusions We presented a maintenance support environment based on code clone analysis, Gemini. We are going to evaluate the applicability to large scale softwares in actual maintenance as future research work.

11

12 CCFinder: Implementation CCFinder extracts code clones by direct comparison of source text. It transforms source text for precise and effective detection of code clones. Token-based transformation rules to regularize and select code portion, for Java, C++, COBOL, etc. programs It uses an effective matching algorithm for large source code. Complexity of algorithm: O(n), where n is a length of source code Scalability: 108 min. for 7.2 million lines (Pentium III 650 MHz, 640MB memory)

13 The difference between ‘ diff ’ and clone detection tools Diff finds the longest common sub- string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions.

14 Example of transformation rules in Java All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ”java. lang. Math. PI” is transformed to ”Math. PI”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code.

15 Clone class metrics LEN (C ): Length of token sequence of each element in clone class C POP (C ): Number of elements in clone class C RAD (C ): Distribution in the file system of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine new sub routine caller statements

16 Snapshots of clone class metric graph RAD LENPOP DFL Filtering mode : ON

17 Aims of clone class metrics We are interested in Clone classes whose elements are spread widely. High value of POP means that there are many similar code fragments. High value of RAD means that the clones are spread over many subsystems. They are difficult to find all together in maintenance. Clone classes which are appropriate for refactoring. High value of DFL (high value POP and high value of LEN) means that the clone class is worth evaluating whether the elements can be merged into one routine.

18 Definition of DFL and RAD DFL(C ) DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C) LEN(C) ×POP(C) : the target code size for restructuring 5×POP(C) : the code size of new caller statements LEN(C) : the code size of new identical routine RAD (C ) Distribution in the file system of elements in clone class C RAD(C) = 0 : C is enclosed within a single file. RAD(C) = 1 : C is enclosed within a single directory. RAD(C) = n : C is enclosed within a directory tree of n layers. new sub routine caller statements

19 CCFinder (3/4) Application of CCFinder Free software JDK libraries (Java, 570 KLOC) Linux, FreeBSD (C, MLOC) FreeBSD, OpenBSD , NetBSD(C) Qt(C++ , 240KLOC) Commercial software NTT data Corp., Hitachi Ltd., NEC soft Ltd., ASTEC Inc., SRA Inc. NASDA (Control program for rocket)

20 CCFinder (4/4) Output of CCFinder #version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} C:\Gemini.java C:\GeneralManager.java : #end{file description} #begin{clone} ,9 63, ,9 553, ,9 63, ,9 633, ,9 152, ,9 216,51 42 : #end{clone} Object file ID ( file 0 in Group 0 ) Location of a clone pair ( Lines in file 0.1 and Lines in file 1.10 are identical or similar to each other) It is difficult to analyze source code by only this text-based information of the location of clone pairs.

21 Clone pair manager Metrics manager Scatter plot view Metric graph views User Interfaces System structure of Gemini Source files Source code manager Source code view Clone selection information User Gemini Code clone detector CCFinder Code clone database

22 Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } CCFinder Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Example of clone detection process Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 0.13,1, 9,111,1 17,1

23 Suffix-tree Suffix tree is a tree that satisfies the following conditions. 1.A leaf node represents the starting position of sub-string. 2.A path from root node to a leaf node represents a sub-string. 3.First characters of labels of all the edges from one node are different from each other. → A common path means a clone

24 Case study overview Application target Programs developed in a programming exercise of Osaka Univ. Compiler in C language Programs of 69 students Total size is 360,000 lines of code Issue of Analysis Similarity among all programs In the programming exercise, plagiarisms sometimes happen.

25 Analysis (1/2) Compiler of 69 students are arranged on the two axes. The distribution is spread widely. Rearrangement of scatter plot using sorting function The grid represents boundary lines between individuals.

26 Analysis (2/2) A B The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments.

27 RSA(i) : Ratio of covered code range in file i by clones between one file i of other files Step2: From among the remaining files, select the most similar file to F and put it next to F by the value of RST RST(i,j) : Ratio of covered code range in file i by clones between a file i and a file j f1 Sorting function Step1: Select a head file by the value of RSA (Make F the head file) Step3: Repeat step2 recursively while any file remains, treating the most similar file in previous step2 as new F f1 f6 f1 f6 f1 f6 f3 f1 f6f3 f4 f2 f5 f2