Presentation is loading. Please wait.

Presentation is loading. Please wait.

13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code.

Similar presentations


Presentation on theme: "13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code."— Presentation transcript:

1 13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code has been tested afterall.” “What a mess. This code has been copied, then changed a bit, all over the code base.” MSc Software Maintenance MS Viðhald hugbúnaðar

2 13/07/2015Dr Andy Brooks2 Case Study Dæmisaga Reference CCFinder: A Multi-Linguistic Token-based Code Clone Detection System For Large Scale Source Code, Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue, Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University. http://sel.ist.osaka-u.ac.jp/~kamiya/ http://sel.ics.es.osaka-u.ac.jp/cdtools/index.html.en einrækt

3 13/07/2015Dr Andy Brooks3 Reasons For Clones Copy-and-paste –one of the easiest ways to re-use code –one of the safest ways to re-use code in legacy applications as the original code base is unaltered Mental macro –frequently coded computations are remembered and coded the same way Repeated code portions for performance –inlined code is faster than called code Systematic code generation from a single base –several variations of code needed

4 13/07/2015Dr Andy Brooks4 The Problem With Clones It is difficulty to consistently modify source files with many clones. When a fault is found, the engineer has to identify all occurences in every subsystem. In large and complex systems, there can be dozens of engineers, each working on only one subsystem. Documenting the existence of clones as they are introduced does not happen in practise.

5 13/07/2015Dr Andy Brooks5 Motivation For CCFinder Government software system 1 million lines of code 2 thousand modules Written in COBOL and PL/I-like language Developed over 20 years ago Continually maintained by a large number of engineers Suspected that clones heavily reduce maintainability of system

6 13/07/2015Dr Andy Brooks6 Underlying Concepts CCFinder Industrial strength –deals with million-line size systems without excessive demands on time or memory token-by-token matching more expensive than line-by-line –several optimization technqiues employed Report only interesting clones –apply heuristic knowledge to remove unwanted clones Copy-and-paste detection –deal with variable renaming and other small changes Limited language dependence –easy to adapt tool to specific languages adaptation for Java took two person days

7 13/07/2015Dr Andy Brooks7 Definitions and Terms A clone relation holds between two code portions if and only if they are the same sequence. A pair of code portions is called a clone pair if the clone relation holds between the portions. A clone class is a maximal set of code portions in which a clone relation holds between any pair of code portions. In CCFinder, clone relations are determined for transformed token sequences.

8 13/07/2015Dr Andy Brooks8 a x y z b x y z c x y d 12 tokens Clone class C1 –a x y z b x y z c x y d Clone Class C2 –a x y z b x y z c x y d Note how the 3rd x y is not in C1 Clone class C3 –a x y z b x y z c x y d Portions are in C1 and this class is not of interest because it is not maximal.

9 13/07/2015Dr Andy Brooks9 Identification Of Structures A code portion that begins in the middle of a function definition and ends some way through another function definition can be very difficult to rewrite as shared code. –CCFinder separates each function definition. A code portion that is part of table initialization code can be very difficult to rewrite as shared code. –CCFinder identifies table definition code.

10 13/07/2015Dr Andy Brooks10 Clone Detection Process 1. 2. 3.4.

11 13/07/2015Dr Andy Brooks11 1. Lexical Analysis Source files are divided into tokens according to the rules of the language. The tokens from all source files are concatenated into a single sequence of tokens. Whitespaces, newlines, tabs, and comments between tokens are removed. –Sent to ‘Formatting’ to enable reconstruction of the original source files.

12 13/07/2015Dr Andy Brooks12 2.1 Transformation By Rules

13 13/07/2015Dr Andy Brooks13 2.1 Transformation By Rules

14 13/07/2015Dr Andy Brooks14 2.2 Parameter Replacement After 2.1, identifiers for types, variables, and constants are replaced with a special token 3. Match Detection All clone pairs detected –(Leftbegin,LeftEnd,RightBegin,RightEnd) with respect to the token sequence 4. Formatting Locations of clone pairs converted into line numbers in the original source files

15 13/07/2015Dr Andy Brooks15 Sample Code * * * *

16 13/07/2015Dr Andy Brooks16 Sample Code Transformed 2.1 * * * *

17 13/07/2015Dr Andy Brooks17 Sample Code Transformed 2.2 Clone pairs Lines 1:7 and 11:17 Lines 8:10 and 19:21

18 13/07/201518 Matrix Visualization token line 11. 17. 19. 21.

19 13/07/2015Dr Andy Brooks19 Metrics For Clone Pairs/Classes LEN(p), LEN(C) –Length can be measures in tokens, SLOC, and LOC (LOC excludes null or comment lines). –The token length of each portion of a clone class is identical when measured on the transformed token sequence. –LOC is used in the following metric definitions. POP(C) –The number of elements in a clone class C. –A large POP means similar code portions appear in many places.

20 13/07/2015Dr Andy Brooks20 Metrics For Clone Pairs/Classes DFL(C) –Deflation is an estimate of how much code is removed when a clone class is rewritten as shared code. –Suppose USELEN(C) is length of the caller statement. –LEN(C) x POP(C) - (USELEN(C) x POP(C) + LEN(C)) COVERAGE (%LOC) –percentage of lines that include any portion of a clone COVERAGE (%FILE) –percentage of files that include any clones

21 13/07/2015Dr Andy Brooks21 Metrics For Clone Pairs/Classes RAD(C)

22 13/07/2015Dr Andy Brooks22 Metrics For Clone Pairs/Classes RAD(C) is the maximum length of path from each file (containing a clone code portion belonging to C) to the lowest common ancestor. If all code portions of C are included in one file then RAD(C) = 0. A large RAD implies code portions spread throughout different subsystems. –Making maintenance difficult if each subsystem is maintained by different engineers.

23 13/07/2015Dr Andy Brooks23 CCFinder Time and Space Complexities CCFinder uses a suffix-tree algorithm with a time and space complexity of O(n). Complexity measurements made on a PC (Pentium 4, 1.5GHz, 640 MB RAM) given various sized subsets of Linux 2.4.9 source files (2600K lines)

24 13/07/2015Dr Andy Brooks24 Leading Token Restriction Optimization Identifying as clones, code portions which begin and end on the middle of statements, is not that useful. Leading tokens at the beginning of clones are restricted to labels or keywords that either initiate or terminate statements. Leading token restriction reduces the number of nodes in the suffix tree to one third in the C, C++, and Java case studies. –Very important restriction to make the tool scalable.

25 13/07/2015Dr Andy Brooks25 Repeated Code Removal Optimization The clone class {a2,a3,a4,a5,a6,b1-b3} will be detected. 6 C 2 = 15 clone pairs a1switch (c) { a2case ´0´: value = 0; break; a3case ´1´: value = 1; break; a4case ´2´: value = 2; break; a5case ´3´: value = 3; break; a6case ´4´: value = 4; break; a7} b1case ´a´: b2flag = 2; b3 break;

26 13/07/2015Dr Andy Brooks26 Repeated Code Removal Optimization To reduce the number of clone pairs, when building a suffix tree, after the first identification (repetition of a2 at a3), succeeding repetitions are not inserted. Clone pair (a2,b1-b3) is still reported. Repeated code removal is also said to stop reporting of self clones e.g. (a2- a5,a3-a6).

27 13/07/2015Dr Andy Brooks27 Token Concatenation Optimization Abutting tokens that are not punctuator keywords are joined together. The token sequence is made shorter in exchange for greater variation in what a token stands for.

28 13/07/201528 Clones in the JDK 1.3.0 >= 30 tokens java/awt/*.java javax/swing/*.java org/omg/CORBA/*.java

29 13/07/201529 Clones in the JDK 1.3.0 >= 30 tokens 570k lines in 1877 files. CCFinder 3 minutes on a PC. Files in the same directory are next to one another on the diagram axes. Most line segments look like dots because of the scale of the graph. Most cloning is near the main diagonal which means most clones occur within a file or between neighbouring directories.

30 13/07/201530 Similar source files in the JDK 1.3.0 These section D files are identical apart from lines 32, 161, 163.

31 13/07/2015Dr Andy Brooks31 Longest clone in the JDK 1.3.0 1647 tokens, 627 lines WindowFileChooserUI.java and MetalfileChoserUI.java each have nine internal classes, one constructor and 45 methods All but three methods are clones.

32 13/07/2015Dr Andy Brooks32 Effects Of Rules And Preprocessing Techniques Disabling various rules and techniques has dramatic effects on the number of clone pairs and classes detected.

33 13/07/2015Dr Andy Brooks33 Population And Length Of Clone Classes JDK 1.3.0 LEN(Token) POP

34 13/07/2015Dr Andy Brooks34 Clone Classes Of Top 5% DFL Source file investigation reveals various kinds of cloning: –sequence of several methods –single method body –source files generated by tool –routines within a method –entire class body Evidence points to different kinds of copy- and-paste style reuse in the JDK.

35 13/07/2015Dr Andy Brooks35 POP And RAD In JDK 1.3.0 Over 20 transformed tokens. swing exception classes

36 13/07/2015Dr Andy Brooks36 Conclusions Tools to detect clones are themselves complex pieces of software. Clone detection in CCFinder is sensitive to the rules, techniques, and clone threshold size employed. CCFinder has been successfuly used to detect clones in the JDK 1.3.0. As software systems get even bigger, clone detection will play an increasingly important part in code reengineering. niðurstöður


Download ppt "13/07/2015Dr Andy Brooks1 Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones “I can just copy these lines. That is the safest thing to do. The code."

Similar presentations


Ads by Google