Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at.

Similar presentations


Presentation on theme: "1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at."— Presentation transcript:

1 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T.J. Waston Research Center Presented by Chao Liu

2 2 Motivations Blossom of open-source projects SourceForge.net: 125,090 projects as July 2006 Convenience for software plagiarism? You can always find something online Core-part plagiarism Ripping off GUIs and irrelevant parts (Illegally) reuse the implementations of core- algorithms Our goal Efficient detection of core-part plagiarism

3 3 Challenges Effectiveness Professional plagiarists Automated plagiarism Efficiency Only a small part of code is plagiarized, how to detect it efficiently?

4 4 Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

5 5 Original Program 01 static void 02 make_blank (struct line *blank, int count) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 blank->nfields = count; 08 blank->buf.size = blank->buf.length = count + 1; 09 blank->buf.buffer = (char*) xmalloc (blank->buf.size); 10 buffer = (unsigned char *) blank->buf.buffer; 11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){ } 15 } A procedure in a program, called join

6 6 Disguise 1: Format Alteration 01 static void 02 make_blank (struct line *blank, int count) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 blank->nfields = count; // initialization 08 blank->buf.size = blank->buf.length = count + 1; 09 blank->buf.buffer = (char*) xmalloc (blank->buf.size); 10 buffer = (unsigned char *) blank->buf.buffer; 11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){ } 15 } Insert comments and blanks

7 7 Disguise 2: Identifier Renaming 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 fill->nfields = num; // initialization 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 12 for (i = 0; i < num; i++){ } 15 } Rename variables consistently

8 8 Disguise 3: Statement Reordering 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 07 fill->nfields = num; // initialization 12 for (i = 0; i < num; i++){ } 15 } Reorder non-dependent statements

9 9 Disguise 4: Control Replacement 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 07 fill->nfields = num; // initialization 12 i = 0; 13 while (i < num){ i++; 16 } 17 } Use equivalent control structure

10 10 Disguise 5: Code Insertion 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 07 fill->nfields = num; // initialization 12 i = 0; 13 while (i < num){ for (int j = 0; j < i; j++); 15 i++; 16 } 17 } Insert immaterial code

11 11 Fully Disguised

12 12 Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

13 13 Review of Plagiarism Detection String-based [Baker et al. 1995] A program represented as a string Blanks and comments ignored. AST-based [Baxter et al. 1998, Kontogiannis et al. 1995] A program is represented as an Abstract Syntax Tree (AST) Fragile to statement reordering, control replacement and code insertion Token-based [Kamiya et al. 2002, Prechelt et al. 2002] Variables of the same type are mapped to the same token A program is represented as a token string Fingerprint of token strings is used for robustness [Schleimer et al. 2003] Partially robust to statement reordering, control replacement and code insertion Representatives: Moss and JPlag

14 14 Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

15 15 Graphic representation of source code int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }

16 16 Graphic representation of source code int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }

17 17 Control Dependency int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }

18 18 Data Dependency int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }

19 19 Plagiarism Detectible?

20 20 Corresponding PDGs PDG for the Original CodePDG for the Plagiarized Code

21 21 PDG-based Plagiarism Detection A program is represented as a set of PDGs Let g be a PDG of Procedure P in the original program Let g be a PDG of Procedure P in the plagiarism suspect Subgraph isomorphism implies plagiarism If g is subgraph isomorphic to g, P is likely plagiarized from P γ-isomorphism: Graph g is γ-isomorphic to g if there exists a subgraph s of g such that s is subgraph isomorphic to g, and | s | γ | g |. If g is γ – isomorphic to g, the PDG pair (g, g ) is regarded as a plagiarized PDG pair, and is then returned to human beings for examination.

22 22 Advantages Robust because it is hard to overhaul PDGs Dependencies encode program logic Incentive of plagiarism

23 23 Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

24 24 Efficiency and Scalability Search space If the original program has n procedures and the plagiarism suspect has m procedures n*m subgraph isomorphism testings Pruning search space Lossless filter Statistical lossy filter

25 25 Lossless filter Interestingness PDGs smaller than an interesting size K are excluded from both sides γ-isomorphism definition A PDG pair (g, g ) is discarded if |g | <γ|g|.

26 26 Lossy Filter Observation If procedure P is plagiarized from procedure P, its PDG g should look similar to g. So discard those dissimilar PDG pairs Requirement This filter must be light-weighted

27 27 Vertex Histogram Represent PDG g by h(g) = (n 1, n 2, …, n k ), where n i is the frequency of the ith kind of vertices. Similarly, represent PDG g by h(g ) = (m 1, m 2, …, m k ). Direct similarity measurement? How to define a proper similarity threshold? Is thus defined threshold program- independent?

28 28 Hypothesis Testing-based Approach Basic idea Estimate a k-dimensional multinomial distribution from h(g) Test whether h(g ) is likely an observation from If it is, g looks similar to g, and an isomorphism testing is needed. Otherwise, (g, g ) is discarded

29 29 Technical Details

30 30 Technical Details (cont d)

31 31 Work-flow of GPLAG PDGs are generated with Codesurfer Isomorphism testing is implemented with VFLib.

32 32 Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

33 33 Experiment Design Subject programs Effectiveness Filter efficiency Core-part plagiarism detection

34 34 Effectiveness 2-hour manual plagiarism, but can be automated? GPLAG detects all plagiarized PDG pairs within 1 second PDG isomorphism also reveals what plagiarism disguises are applied

35 35 Efficiency Subject programs bc, less and tar. Exact copy as plagiarism. Lossless and lossy filter Pruning PDG-pairs. Implication to overall time cost.

36 36 Pruning Uninteresting PDG-pairs Lossless only Lossless and lossy

37 37 Implication to Overall Time Cost Time-out for subgraph isomorphism testing, time hogs. Lossless filter does not save much time. Lossy filter significantly reduces the time cost. Major time saving comes from the avoidance of time hogs.

38 38 Detection of Core-part Plagiarism Lower time cost with lossy filter. Lower false positives with lossy filter.

39 39 Outline Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

40 40 Conclusions We developed a new algorithm GPLAG for software plagiarism detection It is more effective to fight against professional plagiarists We developed a statistical lossy filter, which improves the efficiency of GPLAG We experimentally verified the effectiveness and efficiency of GPLAG

41 41 Q & A Thank You!

42 42 References [1] B. S. Baker. On finding duplication and near duplication in large software systems. In Proc. of 2 nd Working Conf. on Reverse Engineering, [2] I. D. Baxter, A. Yahin, L. Moura, M. Sant Anna, and L. Bier. Clone detection using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, [3] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3 rd Workshop on AI and Software Engineering, [4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token- based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), [5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), [6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. SIGMOD, [7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13th Int. Symp. on the Foundations of Software Engineering, [8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc SIAM Int. Conf. on Data Mining, [9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for backtrace of noncrashing bugs. In SDM, 2005.


Download ppt "1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at."

Similar presentations


Ads by Google