SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Sigir 2008, Singapore.

Slides:



Advertisements
Similar presentations
EER to Relation Models Mapping
Advertisements

Monday, January 13, Instructor Development Lesson 9.
Monday, January 13, Instructor Development Lesson 6 Instructor Resources.
Rev Monday, January 13, Foundations, Technology, Skills Tools.
an innovative solution by central control
National Seminar on Developing a Program for the Implementation of the 2008 SNA and Supporting Statistics in Turkey Arzu TOKDEMİR 10 September 2013 Ankara.
Module 6 – Evaluation Methods and Techniques. 13/02/20142 Questions and criteria Methods and techniques Quality How the evaluation will be done Overview.
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft.
Welcome Welcome to the next session in the professional development program focused around the 9-12 Mathematics Standards. 3/1/20141Geometry.
Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
SAP-Customizing SAP-Customizing.

Linear network code for erasure broadcast channel with feedback Presented by Kenneth Shum Joint work with Linyu Huang, Ho Yuet Kwan and Albert Sung 1Mar.
Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
A DIY Accessible Hardware Solution for Students with Disabilities | Ryan Vernon BC College and Institute.
Sampling of Coal Dr kalyan sen Director, Central Fuel Research Institute, Dhanbad, 2003.
6/2/20141 A Short Tour of our School. 6/2/20142 Hazelwood is a part of the Edmonds School District th Street SW Lynnwood, WA (425)
Virtual Network Embedding with Coordinated Node and Link Mapping N. M. Mosharaf Kabir Chowdhury Muntasir Raihan Rahman and Raouf Boutaba University of.
6/10/20141 Top-Down Clustering Method Based On TV-Tree Zbigniew W. Ras.
6/12/20141 Money and Banking Chapter Outline The Functions of Money The Functions of Money The Components of Money Supply The Components of Money.
Please, select a question: How does a personal account work? How to apply to a job offer? How to send a spontaneous application? How to recover your password?
8/25/ Click on Member Billing Icon 2 8/25/2014 Recent Updates to the Program 3.
UUCS Congregational Meeting December 5, /25/20141.
Indexing DNA Sequences Using q-Grams
Session Agenda  What is WebCRD?  The four ways to place an order  Placing an order from an application  Uploading a document  Placing a Catalog order.
10/12/20141Chem-160. Covalent Bonds 10/12/20142Chem-160.
10/22/20141 GDP and Economic Growth Chapter /22/20142 Outline Gross Domestic Product Gross Domestic Product Economic Growth Economic Growth.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Chapter 5: Introduction to Information Retrieval
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
Near Duplicate Detection
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir.
Finding Similar Items.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung.
Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Martin Theobald, Jonathan Siddharth, and Andreas Paepcke Dept. of Computer.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
Optimizing Parallel Algorithms for All Pairs Similarity Search
Indexing & querying text
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Accessing nearby copies of replicated objects
Hash-Based Indexes Chapter 10
Locality Sensitive Hashing
Detecting Phrase-Level Duplication on the World Wide Web
Hash-Based Indexes Chapter 11
Minwise Hashing and Efficient Search
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
Presentation transcript:

SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Sigir 2008, Singapore Stanford University … or Are stopwords finally good for something? (Standord InfoBlog entry, search for SpotSigs) … or Are stopwords finally good for something? (Standord InfoBlog entry, search for SpotSigs)

Near-Duplicate News Articles (I) June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Near-Duplicate News Articles (II) June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Stanford WebBase Project June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Long-running project for Web archival at Stanford – Periodic, selective crawls of the Web, in particular news sites, government sites, etc. – E.g.: daily crawls of 350 news sites after Hurricane Katrina, Virginia Tech shooting, U.S. elections 2008 – 117 TB data (incl. media files), 1.5 TB text (since 2001) – Can request special-interest crawls – Also early Google crawls used WebBase (late 90s)

Our Tagging Application June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

… but Many different news sites get their core articles delivered by the same sources (e.g., Associated Press) Even within a news site, often more than 30% of articles are near duplicates (dynamically created content, navigational pages, advertisements, etc.) Near-duplicate detection muuuuch harder than exact-duplicate detection! June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

What does SpotSigs do? Robust signature extraction – Stopword-based signatures favor natural-language contents of web pages over navigational banners and advertisements Efficient near-duplicate matching – Self-tuning, highly parallelizable clustering algorithm – Threshold-based collection partitioning and inverted index pruning June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Case (I): Whats different about the core contents? = stopword occurrences: the, that, {be}, {have} June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Case(II): Do not consider for deduplication! no occurrences of: the, that, {be}, {have} June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Spot Signature Extraction Localized Spot Signatures: n-grams close to a stopword antecedent – E.g.: that:presidential:campaign:hit antecedent nearby n-gram Spot Signature s June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Spot Signature Extraction Localized Spot Signatures: n-grams close to a stopword antecedent – E.g.: that:presidential:campaign:hit – Parameters: Predefined list of (stopword) antecedents Spot distance d, chain length c Spot Signatures occur uniformly and frequently throughout any piece of natural-language text Hardly occur in navigational web page components or ads Spot Signatures occur uniformly and frequently throughout any piece of natural-language text Hardly occur in navigational web page components or ads June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Signature Extraction Example Consider the text snippet: At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism. (for antecedents {a, the, is}, uniform spot distance d=1, chain length c=2) June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play}

Signature Extraction Algorithm Simple & efficient sliding window technique O(|tokens|) runtime Largely independent of input format (maybe remove markup) No expensive and error-prone layout analysis required O(|tokens|) runtime Largely independent of input format (maybe remove markup) No expensive and error-prone layout analysis required June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Choice of Antecedents F1 measure for different stopword antecedents over a Gold Set of 2,160 manually selected near-duplicate news articles, 68 topics June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

June 1, Done?

How to deduplicate a large collection efficiently? Given {S 1,…,S N } Spot Signature sets – For each S i, find all similar signature sets S i1,…,S ik with similarity sim(S i, S ij ) τ Common similarity measures: – Jaccard, Cosine, Kullback-Leibler, … Common matching algorithms: – Various clustering techniques, similarity hashing, … June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Which documents (not) to compare? Idea: Two signature sets A, B can only have high (Jaccard) similarity if they are of similar cardinality! Given 3 Spot Signature sets: A with |A| = 345 B with |B| = 1045 C with |C| = 323 Which pairs would you compare first? Which pairs could you spare? ? June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

June 1, ) Pick a specific similarity measure: Jaccard 2) Develop an upper bound for the similarity of two documents based on a simple property: signature set lengths 3) Partition the collection into near-duplicate candidates using this bound 4) For each document: process only the relevant partitions and prune as early as possible General Idea: Divide & Conquer

Upper bound for Jaccard Consider Jaccard similarity Upper bound Never compare signature sets A, B with |A|/|B| (1-τ) |B| Never compare signature sets A, B with |A|/|B| (1-τ) |B| (for |B||A|, w.l.o.g.) June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Multi-set Generalization Consider weighted Jaccard similarity Upper bound Still skip pairs A, B with |B|-|A| > (1-τ) |B| |A||B| June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Partitioning the Collection #sigs per doc … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions Given a similarity threshold τ, there is no contiguous partitioning (based on signature set lengths), s.t. (A) any potentially similar pair is within the same partition, and (B) any non-similar pair cannot be within the same partition June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SiSi SjSj ?

Partitioning the Collection #sigs per doc … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions … but: there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions Given a similarity threshold τ, there is no contiguous partitioning (based on signature set lengths), s.t. (A) any potentially similar pair is within the same partition, and (B) any non-similar pair cannot be within the same partition June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SiSi SjSj Also: Partition widths should be a function of τ

Optimal Partitioning Given τ, find partition boundaries p 0,…,p k, s.t. (A) all similar pairs (based on signature length) are mapped into at most two neighboring partitions (no false negatives) (B) no non-similar pair (based on signature length) is mapped into the same partition (no false positives) (C) all partitions widths are minimized w.r.t. (A) & (B) (minimality) But still expensive to solve exactly … June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Approximate Solution Converges to optimal partitioning when distribution is dense Web collections typically skewed towards shorter document lengths Progressively increasing bucket widths are even beneficial for more uniform bucket sizes (next slide!) Converges to optimal partitioning when distribution is dense Web collections typically skewed towards shorter document lengths Progressively increasing bucket widths are even beneficial for more uniform bucket sizes (next slide!) June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Starting with p 0 = 1, for any given p k, choose p k+1 as the smallest integer p k+1 > p k s.t. p k+1 p k > (1 τ) p k+1 E.g. (for τ=0.7): p 0 =1, p 1 =3, p 2 =6, p 3 =10,…, p 7 =43, p 8 =59,…

Partitioning Effects Optimal partitioning approach even smoothes skewed bucket sizes (plot for 1,274,812 TREC WT10g docs with at least 1 Spot Signature) Uniform bucket widthsProgressively increasing bucket widths #docs #sigs #docs #sigs June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

… but Comparisons within partitions still quadratic! Can do better: – Create auxiliary inverted indexes within partitions – Prune inverted index traversals using the very same threshold-based pruning condition as for partitioning Can do better: – Create auxiliary inverted indexes within partitions – Prune inverted index traversals using the very same threshold-based pruning condition as for partitioning June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Inverted Index Pruning Pass 1: – For each partition, create an inverted index as follows: For each Spot Signature s j – Create inverted list L j with pointers to documents d i containing s j – Sort inverted list in descending order of freq i (s j ) in d i Pass 1: – For each partition, create an inverted index as follows: For each Spot Signature s j – Create inverted list L j with pointers to documents d i containing s j – Sort inverted list in descending order of freq i (s j ) in d i Pass 2: – For each document d i, find its partition, then: Process lists in descending order of |L j | Maintain two thresholds: δ 1 – Minimum length distance to any document in the next list δ 2 – Minimum length distance to next document within the current list Break if δ 1 + δ 2 > (1- τ )|d i |, also iterate into right neighbor partition Pass 2: – For each document d i, find its partition, then: Process lists in descending order of |L j | Maintain two thresholds: δ 1 – Minimum length distance to any document in the next list δ 2 – Minimum length distance to next document within the current list Break if δ 1 + δ 2 > (1- τ )|d i |, also iterate into right neighbor partition June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections the:campaign an:attack d 7 :8 d 6 :6d 7 :4 … …d 2 :6d 5 :3d 1 :3 d 1 :5d 5 :4 Partition k

Deduplication Example Deduplicate d 1 S 3 : 1) δ 1 =0, δ 2 =1 sim(d 1,d 3 ) = 0.8 2) d 1 =d 1 continue 3) δ 1 =4, δ 2 =0 break! June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections pkpk p k+1 … … d 3 :5 d 2 :8d 3 :4d 1 :5 d 1 :4 Partition k s3s3 s1s1 d 3 :5d 3 :4d 1 :4 s2s2 δ 2 =1 δ 1 =4 Given: d 1 = {s 1 :5, s 2 :4, s 3 :4}, |d 1 |=13 d 2 = {s 1 :8, s 2 :4}, |d 2 |=12 d 3 = {s 1 :4, s 2 :5, s 3 :5}, |d 3 |=13 Threshold: τ = 0.8 Break if: δ 1 + δ 2 > (1- τ)|d i |

SpotSigs Deduplication Algorithm Still O(n 2 m) worst case runtime Empirically much better, may outperform hashing Tuning parameters: none Still O(n 2 m) worst case runtime Empirically much better, may outperform hashing Tuning parameters: none June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Experiments Collections – Gold Set of 2,160 manually selected near-duplicate news articles from various news sites, 68 clusters – TREC WT10g reference collection (1.6 Mio Web docs) Hardware – Dual Xeon 3GHz, 32 GB RAM – 8 threads for sorting, hashing & deduplication For all approaches – Remove HTML markup – Simple IDF filter for signatures, remove most frequent & infrequent signatures June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Competitors Shingling [Broder, Glassman, Manasse & Zweig 97] – N-gram sets/vectors compared with Jaccard/Cosine similarity in between O (n 2 m) and O (n m) runtime (if using LSH for matching) I-Match [Chowdhury, Frieder, Grossman & McCabe 02] – Employs a single SHA-1 hash function – Hardly tunable O (n m) runtime Locality Sensitive Hashing (LSH) [Indyk, Gionis & Motwani 99], [Broder et al. 03] – Employs l (random) hash functions, each concatenating k MinHash signatures – Highly sensitive to tuning, probabilistic guarantees only O ( k l n m) runtime Hybrids of I-Match and LSH with Spot Signatures (I-Match-S & LSH-S) June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Gold Set of News Articles Manually selected set of 2,160 near-duplicate news articles (LA Times, SF Chronicle, Huston Chronicle, etc.), manually clustered into 68 topic directories Huge variations in layout and ads added by different sites Macro- Avg. Cosine 0.64 June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

SpotSigs vs. Shingling – Gold Set Using (weighted) Jaccard similarityUsing Cosine similarity (no pruning!) June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Runtime Results – TREC WT10g June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections SpotSigs vs. LSH-S using I-Match-S as recall base

Tuning I-Match & LSH on the Gold Set SpotSigs does not need this tuning step! June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Tuning I-Match-S: varying IDF thresholdsTuning LSH-S: varying #hash functions l for fixed #MinHashes k = 6

Summary – Gold Set June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Summary of algorithms at their best F1 spots (τ = 0.44 for SpotSigs & LSH)

Summary – TREC WT10g June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections Relative recall of SpotSigs & LSH using I-Match-S as recall base at τ = 1.0 and τ = 0.9

Conclusions & Outlook Robust Spot Signatures favor natural-language page components Full-fledged clustering algorithm, returns complete graph of all near-duplicate pairs Efficient & self-tuning collection partitioning and inverted index pruning, highly parallelizable deduplication step Surprising: May outperform linear-time similarity hashing approaches for reasonably high similarity thresholds (despite of being an exact- match algorithm!) Future Work: – Efficient (sequential) index structures for disk-based storage – Tight bounds for more similarity metrics, e.g., Cosine measure – More distribution June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections

Related Work Shingling [Broder, Glassman, Manasse & Zweig 97], [Broder 00], [Hod & Zobel 03] Random Projection [Charikar 02], [Henzinger 06] Signatures & Fingerprinting [Manbar 94], [Brin, Davis & Garcia-Molina 95], [Shivakumar 95], [Manku 06] Constraint-based Clustering [Klein, Kamvar & Manning 02], [Yang & Callan 06] Similarity Hashing I-Match: [Chowdhury, Frieder, Grossman & McCabe 02], [Chowdhury 04] LSH: [Indyk & Motwani 98], [Indyk, Gionis & Motwani 99] MinHashing: [Indyk 01], [Broder, Charikar & Mitzenmacher 03] Various filtering techniques Entropy-based: [Büttcher & Clarke 06] IDF, rules & constraints, … June 1, SpotSigs: Robust & Efficient Near Duplicate Detection in Large Web Collections