Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.

Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University

Introduction Near-Duplicate Detection To identify and organize “nearly-identical” documents Different definition of “similarity” from other fields –Database: Almost-identical documents Finger-prints based approaches Only allow small changes to the texts Sensitive to text positions –Information Retrieval: Relevant documents Bag-of-word approaches Measure overlap of the vocabulary Focus more on semantic similarity while near- duplicates more on syntactic (surface text) similarity Cannot identify near-duplicates when they only share a small amount of text

Near-Duplicate Detection in eRulemaking U.S. regulatory agencies receive and deal with large amount of public comments everyday –By law, they need to read each of them Many of them are “Form Letters” –Generate comments based on form letters provided by online special interest groups http://www.moveon.org http://www.getactive.com Need to automate the duplicate detection process and save human effort

Editing Styles Block Added: Add one or more paragraphs (<200 words) to a document; Block Deleted: Remove one or more paragraphs (<200 words) from a document; Key Block: Contains at least one paragraph from a document; Minor Change: A few words altered within a paragraph (<5% or 15 word change in a paragraph) ; Minor Change & Block Edit: A combination of minor change and block edit; Block Reordering: Reorder the same set of paragraphs; Repeated: Repeat the entire document several times in another document; Bag-of-word similar: >80% word overlap (not in above categories); and Exact: 100% word overlap.

“Key Block” Problem

Need More Flexible Framework Need to use additional knowledge from the document collection Instance-level Constrained Clustering –A semi-supervised clustering approach to incorporate additional knowledge Document attributes Content structure Pair-wise relationships

Instance-level Constrained Clustering Instance-level Constraints –Pair-wise –Easy to generate –Cannot generate class labels –Weaker condition than semi-supervised classification Types of Constraints –Must-links, cannot-links, family-links

Must-links Two instances must be in the same cluster Created when –complete containment of the reference copy (key block), –word overlap > 95% (minor change).

Cannot-links Two instances cannot be in the same cluster Created when two documents –cite different docket identification numbers People submitted comments to wrong place

Family-links Two instances are likely to be in the same cluster Created when two documents have –the same email relayer, –similar file sizes, or –the same footer block.

Must-links Group the Corrects + + + + + + + + + + + +

Cannot-links Push Away Wrongs + + + + - + + + + + - +

Family-links Attract the Similars + + + + + + + + + + + +

Constraint Transitive Closure An initial set of constraints are created for pairs of documents Taking transitive closure over the constraints –Must-link transitive closure: d a = m d b, d b = m d c => d a = m d c –Cannot-link transitive closure: d a = c d b, d b = m d c => d a = c d c –Family-link transitive closure: d a = f d b, d b = m d c => d a = f d c d a = f d b, d b = c d c => d a = c d c d a = f d b, d b = f d c => d a = f d c ( = m, = c and = f indicate must-link, cannot-link and family- link respectively.)

Constraint Transitive Closure Example:

Document-Space With Initial Links F F F F F F Form letter Cannot link Must link Family link

Document-Space After Link Propagation F F F F F F Form letter Cannot link Must link Family link

Incorporating the Constraints When forming clusters, –if two documents have a must-link, they must be put into same group, even if their text similarity is low –if two documents have a cannot-link, they cannot be put into same group, even if their text similarity is high –if two documents have a family-link, increase their text similarity score, so that their chance of being in the same group increases.

Redundancy-based Reference Copy Detection Apply hash function to the document string (all words in a document concatenated together) –NIST’s security hash function: SHA1 –For each document, there is a unique hash value for it Sort the tuples by the hash value –Same hash values stay together Linear scan to the sorted list –Same hash value indicates exact duplicates The reference copies are selected as the one with the earliest timestamp in an exact duplicate group size bigger than 5

Evaluation Assessors (from coding lab in University of Pittsburgh) manually organized documents into near-duplicate clusters Compare human-human agreement to human- computer agreement

Experimental Results -Comparing with human-human intercoder agreement -Metric: AC1 -A modified version of Kappa

Experimental Results -Comparing with other duplicate detection Algorithms -Metric: F1

Impact of Instance-level Constraints Number of Constraints vs. F1.

Conclusion Near-duplicate detection on large public comment datasets is practical Instance-based constrained clustering/semi-supervised clustering –Efficient –Greater control over the clustering –Encourages use of other forms of evidence –Easily applied to other datasets

Thank You! Questions?

Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.

Similar presentations

Presentation on theme: "Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.

Similar presentations

Presentation on theme: "Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback