Detecting Phrase-Level Duplication on the World Wide Web

Detecting Phrase-Level Duplication on the World Wide Web
Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel

Introduction Problem Example
Identify instances “slice and dice” generation Example German spammer 1 million URLs originating from single IP (but use of many host names) Pages changed completely on every download Pages consisted of grammatically well-formed sentences stitched together at random

Goal Find instances of sentence level synthesis of web pages
More generally, of pages with an unusually large number of popular phrases

The Data Datasets DS1 DS2 BFS crawl starting at www.yahoo.com
151 million HTML pages DS2 Large crawl conducted by MSN search 96 million HTML pages chosen at random

Finding Phrase Replication
Sampling Reduce each document to a feature vector Employ a variant of the shingling algorithm of Broder et al. Significantly reduces the data volume

Sampling method Replace all HTML markup by white-space
k-phrases of a document: all sequences of k consecutive words Treat the document as a circle: last word followed by first word n word document has exactly n phrases

Sampling method Exploit properties of Rabin fingerprints
Rabin fingerprints support efficient extension and prefix deletion Fingerprints of distinct bit patterns are distinct

Computing feature vectors
Fingerprint each word in the document - gives n tokens Compute fingerprint of each k-token phrase - gives n phrase fingerprints Apply m different fingerprint functions Retain the smallest of the n resulting values for each function Vector of m fingerprints representative of document (elements referred to as shingles)

Duplicate Suppression
Replication rampant on the web Clustered all pages in data set into equivalence classes Each class contains all pages that are exact or near duplicates of one another

Popular phrases Occur in more documents than would be expected by chance Assumptions: “Normal” web pages characterized by a generative model Sought web pages - copying model (need to consider number of phrases, length of typical documents…)

Popular Phrases Limit attention to the shingles chosen by sampling functions Phrase is popular if selected as shingle in sufficiently many documents To determine popular phrases, consider triplets (i,s,d)

Popular Phrases First 24 most popular phrases not very interesting
Starting from the 36th phrase, discover phrases caused by machine generated content Templatic form: common text, “fill in the blank” slots and optional 60th phrase - instance of idiomatic phrase

Zipfian Distribution

Histogram of popular shingles per doc

Covering set Covering sets for shingles of each page
Approximate a minimum covering set using a greedy heuristic

Distribution of covering set sizes

German spammer

Looking for likely sources

Conclusion Power law distribution Popular phrases
Often limited by design choices Legal disclaimers Navigational phrases “fill in the blanks” More replicated than original content

Detecting Phrase-Level Duplication on the World Wide Web

Similar presentations

Presentation on theme: "Detecting Phrase-Level Duplication on the World Wide Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Detecting Phrase-Level Duplication on the World Wide Web

Similar presentations

Presentation on theme: "Detecting Phrase-Level Duplication on the World Wide Web"— Presentation transcript:

Similar presentations

About project

Feedback