Presentation is loading. Please wait.

Presentation is loading. Please wait.

CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Similar presentations


Presentation on theme: "CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)"— Presentation transcript:

1 CodeSimian CS491B – Andrew Weng

2 Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student) Book contains many plagiarized passages Yoshihiko Wada (Painter, Japan) Artwork plagiarized from Alberto Sughi Scott D. Miller (Wesley College President) Plagiarized material found on his website

3 Is Plagiarism Harmful? Who does plagiarism really hurt? The student The class The University Plagiarism is not only concerned with the protection of intellectual property rights

4 Plagiarism Detection Benefits of Utilizing Plagiarism Detection Prevention Enforcement Objective standpoint

5 Platform Overview Developed on Visual Studio.NET 2005 Coded in Microsoft Visual C#.NET Windows Forms application Simple and familiar GUI (Windows) Intended focus is ease of use

6 Theoretical Overview CodeSimian is based on two primary principles Kolmogorov Complexity Information Distance

7 Kolmogorov Complexity Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output Purely theoretical Impossible to calculate exactly

8 Kolmogorov Complexity Define x to be a desired output string K(x) = The length of the program that produces x K(x|y) = The length of the program that produces x given y as an input K(xy) = The length of the program that produces x concatenated with y

9 Kolmogorov Complexity Compare two infinitely long numbers π and a randomly generated number between 0 and 1: π =3.1415926535897932384626433832795… n = 0.5234958723957329875320935293853… K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

10 Kolmogorov Complexity π =3.1415926535897932384626433832795… K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite Perhaps something as simple as the implementation of Leibniz’s formula:

11 Kolmogorov Complexity n = 0.5234958723957329875320935293853… In order to generate the full output of a truly random number n, the length of the program would be infinitely long. The code would essentially be System.out.println(“0.52349587…”);

12 Kolmogorov Complexity So how does this apply to plagiarism detection? Define x = π and y = π/4 K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.

13 Information Distance The distance (or difference) between two objects Formula used:

14 Information Distance Similarity Factor If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity

15 Implementation What does CodeSimian do to obtain the similarity factors? 1.Parse and Tokenize the code 2.Compress the tokenized strings 3.Compare the compressed strings

16 Parsing the Code Utilized ANTLR to parse and tokenize the code ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)

17 Tokenizing the Code The tokenized output is a string of characters, each of which represents a token within the code For Example: { int c = 0; } contains 7 “letters” Open Bracket Integer type declaration Variable name Assignment operator Integer Value Statement end Close Bracket

18 Compressing the String This string is then compressed using a Lempel- Ziv compression algorithm with unbounded buffers As the string is being read, a library is generated as it progresses. When repeats are detected, it utilizes pointers to the library to recreate the required section

19 Compressing the String Normally limitations exist on library size and the “word” length stored Memory utilization and efficiency is not important Lempel-Ziv is suitable for this application

20 Comparing the Compressed String K(x) is the size of the compressed and tokenized code x. K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library K(xy) is the size of the compressed and tokenized code x+y.

21 Results Using the test on trivial examples: LinkedList.java LinkedList2.java LinkedList3.java Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output. All files came out as >85% similar

22 Results Using the test on a small real-world sample Professor Kang’s CS201 HW1 Relatively simple homework assignment 30-50% similarity average 95% similarity detected on one pair of submissions Confirmed by Professor Kang as correct

23 Results Using the test on another small real-world sample Professor Kang’s CS201 HW4 More complex homework assignment involving 2-3 files; break down of java files according to function Problem being that specialized function files may possible present false positives? 30-70% similarity average 95+% similarity detected on pairs of submissions Confirmed by Professor Kang as correct

24 Results Things to note… The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes

25 Conclusions Successful test cases Simple and straightforward to use Based on an objective principle which works!

26 Future Work Enhancing the application to be able to compare internal “blocks” of code Improving the compression algorithm to better handle and adapt to “approximate matches” Improving the functionality with the GUI Providing a report printing capability of directories


Download ppt "CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)"

Similar presentations


Ads by Google