Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Phrase-Level Duplication on the World Wide Web

Similar presentations


Presentation on theme: "Detecting Phrase-Level Duplication on the World Wide Web"— Presentation transcript:

1 Detecting Phrase-Level Duplication on the World Wide Web
Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel

2 Introduction Problem Example
Identify instances “slice and dice” generation Example German spammer 1 million URLs originating from single IP (but use of many host names) Pages changed completely on every download Pages consisted of grammatically well-formed sentences stitched together at random

3 Goal Find instances of sentence level synthesis of web pages
More generally, of pages with an unusually large number of popular phrases

4 The Data Datasets DS1 DS2 BFS crawl starting at www.yahoo.com
151 million HTML pages DS2 Large crawl conducted by MSN search 96 million HTML pages chosen at random

5 Finding Phrase Replication
Sampling Reduce each document to a feature vector Employ a variant of the shingling algorithm of Broder et al. Significantly reduces the data volume

6 Sampling method Replace all HTML markup by white-space
k-phrases of a document: all sequences of k consecutive words Treat the document as a circle: last word followed by first word n word document has exactly n phrases

7 Sampling method Exploit properties of Rabin fingerprints
Rabin fingerprints support efficient extension and prefix deletion Fingerprints of distinct bit patterns are distinct

8 Computing feature vectors
Fingerprint each word in the document - gives n tokens Compute fingerprint of each k-token phrase - gives n phrase fingerprints Apply m different fingerprint functions Retain the smallest of the n resulting values for each function Vector of m fingerprints representative of document (elements referred to as shingles)

9 Duplicate Suppression
Replication rampant on the web Clustered all pages in data set into equivalence classes Each class contains all pages that are exact or near duplicates of one another

10 Popular phrases Occur in more documents than would be expected by chance Assumptions: “Normal” web pages characterized by a generative model Sought web pages - copying model (need to consider number of phrases, length of typical documents…)

11 Popular Phrases Limit attention to the shingles chosen by sampling functions Phrase is popular if selected as shingle in sufficiently many documents To determine popular phrases, consider triplets (i,s,d)

12 Popular Phrases First 24 most popular phrases not very interesting
Starting from the 36th phrase, discover phrases caused by machine generated content Templatic form: common text, “fill in the blank” slots and optional 60th phrase - instance of idiomatic phrase

13 Zipfian Distribution

14 Histogram of popular shingles per doc

15 Covering set Covering sets for shingles of each page
Approximate a minimum covering set using a greedy heuristic

16 Distribution of covering set sizes

17 German spammer

18 Looking for likely sources

19 Conclusion Power law distribution Popular phrases
Often limited by design choices Legal disclaimers Navigational phrases “fill in the blanks” More replicated than original content


Download ppt "Detecting Phrase-Level Duplication on the World Wide Web"

Similar presentations


Ads by Google