Matching Program Versions

Matching Program Versions
CS590 Z Matching Program Versions Xiangyu Zhang

Problem Statement Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping Non-trivial Name comparison? What if Clone analysis, comparison checking

Motivations Validate compiler transformations
Facilitate regression testing Reverse obfuscation Information propagation Debugging Code plagiarism detection Information Assurance

Approaches Static Approaches Dynamic Approaches (not today)
Entity name based String based (MOSS) AST based (DECKARD) CFG based (JDIFF) PDG based (PDIFF) Binary based (BMAT) Log based (editor plugin, comparison checking) Dynamic Approaches (not today)

Static Approaches Entity name matching String matching
Model a function/field as tuples Coarse grained matching String matching Diff (CVS, Subservion) Longest common subsequence (LCS) Available operations are addition and deletion Matched pairs can not cross one another Programs are far more complicated than strings Copy, paste, move CP-Miner (scale to linux kernel clone detection) Frequent subsequence mining If two strings are considered, LCS has polynomial complexity (by dynamic programming)

MOSS Code plagiarism detection Challenges Problem statement
It also handles other digital contents Challenges White space (variable name) Noise (“the”, “int i”); Order scrambling (paragraph reorders) Problem statement Given a set of documents, identify substring matches that satisfy two properties: If there is a substring match at least as long as the guarantee threshold t, then this match is detected; Do not detect any matches shorter than the noise threshold, k.

MOSS k-gram A continuous substring of length k

MOSS Incremental hashing
Hashing strings of length k is expensive for large k. “rolling” hash function The (i+1)th k-gram hash = F (the ith k-gram hash, …)

MOSS Fingerprint selection A subset of hash values
Our goals: find all matching substrings >t; ignore matchings <k) One of every tth hash values 0 mod p

MOSS Winnowing Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen Have a sliding window with size w=t-k+1 In each window select the minimum hash value, break ties by select the rightmost occurrence.

MOSS Algorithm Build an index mapping fingerprints to locations for all documents. Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. Sort (d,d1,fx), (d, d2,fy) by the first two elements. Matches between documents are rank-ordered by size (number of fingerprints)

MOSS Advantages Limitations
Guarantee to detect any >t substring matches Limitations Minor edits fail MOSS. x= a*b + c vs. z= c + a*b Insertion, deletion

AST based matching [YANG, 1991, Software Practice and Experience]
Given two functions, build the ASTs Match the roots If so, apply LCS to align subtrees Continue recursively Fragile

DECKARD (ICSE 2007)

DECKARD Advantages Limitations Scalability
Insensitive to minor structural changes such as reordering, insertion, deletion Limitations Structural similarity only Insertion that incurs structure change.

CFG matching Hammock graph (JDIFF ,ASE 2004) Match classes by names
Match fields by types Match methods by signatures Match instruction in methods by hammock graphs A hammock is a single entry single exit subgraph of a CFG.

CFG matching Pros Cons Orthogonal Simple Coarse grained matching only
Can be combined with other matching techniques Simple Cons Coarse grained matching only Not good at clone detection In case of code transformation

Semantic Based Matched
Using PDG (SAS’01)

Semantic Based

Semantic Based Pros Cons Non-contiguous, intertwined, reordered
Insensitive to code transformations. Cons Scalability Points-to analysis Starting from a matching pair seems to be a problem

Wrap Up For clone detection
Maybe structural / text similarity is a good idea For whole program matching / method matching with code transformations Semantic based is more appropriate Scalability PDG < CFG | AST < STRING < NAME

Matching Program Versions

Similar presentations

Presentation on theme: "Matching Program Versions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Matching Program Versions

Similar presentations

Presentation on theme: "Matching Program Versions"— Presentation transcript:

Similar presentations

About project

Feedback