Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,

Similar presentations


Presentation on theme: "Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,"— Presentation transcript:

1 Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July, 2006 Ettore Merlo, Ecole Polytechnique de Montréal

2 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Context Detect plagiarism in first years programming projects at university –Programming skills have to be developed during courses

3 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Plagiarism Detection Comparison of sets of syntactic blocks Spectral analysis of similarity –Increasing thresholds –Spectral shape parameters are computed Projects are ranked by similarity spectrum The most similar projects are considered as candidates for plagiarism

4 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Plagiarism Problem Detect code transformations that require little programming effort and make apparent differences in source code –Changed identifier by editing operations –Changed source code layout (comments, indentation, order of procedures, functions, and methods, file structure) –Changed constants (initialization, loops)

5 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Metrics-Based Similarity Definition –Two code fragments are similar if their associated vectors of metrics satisfy some similarity criterion

6 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Similarity Identification Process F 1 m 11 m 12 ……. M 1k …………………………………. F j m j1 m j2 ……. m jk Source code Parsing and Analysis Metrics Extraction Clones Extraction Abstract Syntax Tree Metrics Clones

7 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Metrics Extraction Metrics for similarity detection –Volume –Complexity –Module/function interface –Call graph structure –Local memory –Global memory –Dataflow

8 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Metrics Matching similar(f I,f J ) = | m k (f I ) – m k (f J ) | <= th k –forall k within the size of the metrics vector

9 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Metrics Matching Complexity n = | fragments_set | Exact solution algorithms show a worst-case O(n ) complexity in general Linear complexity exact solutions exist for specific sub-problems Opportunistic strategies and heuristics may reduce the average-case complexity Approximate solutions may reduce the worst-case complexity 2

10 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Threshold-Based Quantization

11 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Threshold-Based Quantization (2) Clusters represent the following hyper-parallelepiped: Clusters represent a partition of all fragments Complexity is O(M·n) where: –M is the cardinality of metrics –n is the total number of fragments –often M << n

12 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Quantization Error Fragments in neighboring clusters may be closer than (th i / 2) and still be in different clusters Errors for threshold level (th i ) disappear for threshold levels (k·th i ), (k > 1)

13 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Project Comparison Compute structural similarity spectrum –Compute similarity for increasing threshold levels in s steps Quantize projects for the current threshold level Traverse current clusters to check for commonality in compared project Count common structurally-similar fragments under current threshold level

14 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Project Comparison (2) Complexity: O(s·M·(n 1 + n 2 )) –n 1, n 2 : size of projects –M: cardinality of metrics –s: threshold steps Rationale: –Plagiarism is hard to deeply hide if little programming energy is deployed –Surface differences are quickly ignored by thresholds of increasing levels

15 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Project Comparison (3) Typical spectrum

16 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Parameters Granularity: functions and methods Steps: 5 Metrics and thresholds: –CALLS: 1 –LOCALS: 1 –NONLCALS: 1 –PARNUM: 1 –STMNT: 3 –NBRANCHES: 1 –NLOOPS: 1

17 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Plagiarism problem Projects are composed of a variable number of fragments –Problem similar to class comparison or to software evolution analysis Identify projects with high spectral similarity –p = number of projects –Galaxy approach O(p) –Pair comparison O(p 2 )

18 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Galaxy Algorithm:

19 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Procedural Projects

20 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 OO Projects

21 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Clone Visualization Visual display of source code fragments differences DP-matching algorithm on tokens

22 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Matching Algorithms Compute the sets of lexical changes –Dynamic programming –Sub-optimal and heuristic ones

23 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 int restore_stack ( object info ) { int restore_list ( int index, object info ) { Matching Example

24 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Remarks Similarity contrast is very good for procedural code Distribution of similarity for OO code is less sharp –Reference classes were given as a part of the projects –Methods tend to be smaller –More methods tend to be similar –Class structure could be taken into consideration –Inter-class relationship could be taken into account

25 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Administrative Approach Identify most similar projects Do not make any hypothesis about the causes of similarity Shift the burden of explanation over the authors of a project

26 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Conclusions A metrics based plagiarism detection approach in an academic environment has been presented The presented approach has been successfully used to discourage plagiarism in course projects

27 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Bibliography Merlo E., Antoniol G., Di Penta M., Rollo F. "Linear Complexity Object-Oriented Similarity for Clone Detection and Software Evolution Analysis", Proc. International Conference of Software Maintenance (ICSM), IEEE Computer Society Press, 2004, pp. 412-416 Merlo E., Antoniol G., Di Penta M., ``Complexity and Feasibility Issues in Object Oriented Clone Detection'', Proc. 2nd International Workshop on Detection of Software Clones (IWDSC-2003), Victoria (BC), Canada, 2003, pp. 5-6. G. Antoniol, U. Villano, E. Merlo, M. Di Penta, ``Analyzing Cloning Evolution in the Linux Kernel'‘, Information and Software Technology, Vol. 44, No. 13, pp. 755-765, October 1, 2002

28 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Bibliography (2) E. Merlo, M. Dagenais, P. Bachand, J. S. Sormani, G. Antoniol ``Investigating Large Software System Evolution: the Linux Kernel'' Computer Software and Applications Conference, COMPSAC - 2002 Dagenais M., Patenaude J. F., Merlo E., Lague B., ``Comparison of clones occurrence in Java and Modula-3 software systems'', in ``Advances in Software Engineering: Comprehension, Evaluation, and Evolution'', H. Erdogmus and O. Tanir (Eds.), Springer-Verlag, ISBN: 0-387-95109-1, 2001. Casazza G., Antoniol G., Villano U., Merlo E., Di Penta M., ``Identifying Clones in the Linux Kernel'', Proc. International Workshop on Source Code Analysis and Manipulation (IWSCAM), IEEE Computer Society Press, pp. 90-97, 2001

29 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Bibliography (3) Antoniol A., Casazza G., Di Penta M., Merlo E., ``Modeling Clones Evolution through Time Series'', Proc. International Conference of Software Maintenance (ICSM), IEEE Computer Society Press, pp. 273-280, 2001 Antoniol G., Casazza G., Merlo E., ``GAWK Software System Evolution'', International Workshop on Feedback and Evolution in Software and Business Processes (FEAST), July 2000 Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K., ``Advanced Clone-analysis as a Basis for Object-oriented System Refactoring'', Proc. Working Conference on Reverse Engineering (WCRE), IEEE Computer Society Press, pp. 98-107, 2000.

30 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Bibliography (4) Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K., ``Measuring Clone Based Reengineering Opportunities'', Proc. International Software Metrics Symposium, pp. 292-303, IEEE Computer Society Press, 1999 Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K., ``Partial Redesign of Java Software Systems Based on Clone Analysis'', Proc. 6th Working Conference on Reverse Engineering, WCRE99, pp. 326-336, IEEE Computer Society Press, 1999 Dagenais M., Merlo E., Lague B., Proulx D., ``Clones Occurrence on Large Object Oriented Software Packages'', Proc. CASCON'98, pp. 192-200, IBM Canada, National Research Council of Canada, 1998

31 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Bibliography (5) Lague, B., Proulx, D., Mayrand, J., Merlo, E.M., Hudepohl, J., ``Assessing the Benefits of Incorporating Function Clone Detection in a Development Process'', Proc. of International Conference on Software Maintenance, IEEE Computer Society Press, 1997, pp. 314-321. Mayrand, J., Leblanc, C., and Merlo, E., ``Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics'', Proc. IEEE International Conference on Software Maintenance, Monterey, California, November 1996, IEEE Computer Society Press, pp. 244-253. Kontogiannis K., De Mori R., Merlo E., Galler M., Bernstein M., ``Pattern matching techniques for clone detection'', Journal of Automated Software Engineering, V.3, 1996, pp. 77-108, Kluwer Academic Publishers.

32 Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006 Further Contacts Ettore Merlo Ecole Polytechnique de Montréal tel: +1 (514 ) 340 4711 ext. 5758 fax: +1 (514) 340 3240 ettore.merlo@polymtl.ca


Download ppt "Detection of Plagiarism In University Projects Using Metrics-Based Similarity Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software July,"

Similar presentations


Ads by Google