SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Martin Theobald, Jonathan Siddharth, and Andreas Paepcke Dept. of Computer.

Slides:

Advertisements

Similar presentations

SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Sigir 2008, Singapore.

Advertisements

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Near-Duplicates Detection

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Theoretical Analysis. Objective Our algorithm use some kind of hashing technique, called random projection. In this slide, we will show that if a user.

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 

High Dimensional Search Min-Hashing Locality Sensitive Hashing

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

1 Lecture 18 Syntactic Web Clustering CS

1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Near Duplicate Detection

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

SpotSigs Robust & Efficient Near Duplicate Detection in Large Web Collections Martin Theobald Jonathan Siddharth Andreas Paepcke Stanford University Sigir.

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Finding Similar Items.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

A Web Crawler Design for Data Mining

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.

Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea.

Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

KNN & Naïve Bayes Hongning Wang

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)

Big Data Infrastructure

Cohesive Subgraph Computation over Large Graphs

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

Optimizing Parallel Algorithms for All Pairs Similarity Search

Near Duplicate Detection

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Hash-Based Indexes Chapter 11

Hash-Based Indexes Chapter 10

Locality Sensitive Hashing

Detecting Phrase-Level Duplication on the World Wide Web

Hash-Based Indexes Chapter 11

Minwise Hashing and Efficient Search

Chapter 11 Instructor: Xin Zhang

Three Essential Techniques for Similar Documents

Presentation transcript:

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Martin Theobald, Jonathan Siddharth, and Andreas Paepcke Dept. of Computer Science, Stanford University SIGIR Presented by JongHeum Yeon, IDS Lab., Seoul National University Original PPT :

Copyright  2009 by CEBT Motivation 2 Same (Almost)

Copyright  2009 by CEBT Motivation (cont’d) 3 Same (Almost)

Copyright  2009 by CEBT Motivation (cont’d)  Many different news sites get their core articles delivered by the same sources e.g., Associated Press  Even within a news site, often more than 30% of articles are near duplicates dynamically created content, navigational pages, advertisements, etc.  Robust signature extraction Stopword-based signatures favor natural-language contents of web pages over navigational banners and advertisements  Efficient near-duplicate matching Self-tuning, highly parallelizable clustering algorithm Threshold-based collection partitioning and inverted index pruning 4

Copyright  2009 by CEBT Stopword Signature 5 = stopword occurrences: the, that, {be}, {have}  Spot Signatures occur uniformly and frequently throughout any piece of natural-language text Hardly occur in navigational web page components or ads

Copyright  2009 by CEBT Ignore Case 6 no occurrences of: the, that, {be}, {have}

Copyright  2009 by CEBT Spot Signature Extraction  “Localized” signatures n-grams close to a stopword antecedent e.g., that:presidential:campaign:hit 7 antecedent nearby n-gram Spot Signature : s  Parameters Predefined list of (stopword) antecedents Spot distance d, chain length c

Copyright  2009 by CEBT Signature Extraction Example  “At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.”  S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play} (for antecedents {a, the, is}, uniform spot distance d=1, chain length c=2) 8

Copyright  2009 by CEBT Signature Extraction Algorithm  Simple & efficient sliding window technique  O(|tokens|) runtime  Largely independent of input format (maybe remove markup)  No expensive and error-prone layout analysis required 9

Copyright  2009 by CEBT Choice of Antecedents  F1 measure for different antecedents over a “Gold Set” of 2,160 manually selected near-duplicate news articles, 68 clusters 10

Copyright  2009 by CEBT Efficiency Problem  Given {S1,…,SN} Spot Signature sets For each Si, find all similar signature sets Si1,…,Sik with similarity sim(Si, Sij ) ≥ τ  Common similarity measures: Jaccard, Cosine, Kullback-Leibler, …  Common matching algorithms: Various clustering techniques, similarity hashing, … 11

Copyright  2009 by CEBT Upper bound for Jaccard  Given 3 Spot Signature sets A with |A| = 345 B with |B| = 1045 C with |C| = 323  Which pairs would you compare first?  Which pairs could you spare?  Two signature sets A, B can only have high (Jaccard) similarity if they are of similar cardinality 12 ?

Copyright  2009 by CEBT Upper bound for Jaccard (cont’d)  Consider Jaccard similarity  Upper bound (for |B|≥|A|, w.l.o.g.)  Never compare signature sets A, B with |A|/|B| (1-τ) |B| 13

Copyright  2009 by CEBT Partitioning the Collection  Given a similarity threshold τ, there is no contiguous partitioning (based on signature set lengths), s.t. (A) any potentially similar pair is within the same partition, and (B) any non-similar pair cannot be within the same partition  However, there are many possible partitionings, s.t. (A) any similar pair is (at most) mapped into two neighboring partitions  Also, Partition widths should be a function of τ 14 #sigs per doc SiSi SjSj

Copyright  2009 by CEBT Optimal Partitioning  Given τ, find partition boundaries p0,…,pk, s.t. (A) all similar pairs (based on length) are mapped into at most two neighboring partitions (no false negatives) (B) no non-similar pair (based on length) is mapped into the same partition (no false positives) (C) all partitions’ widths are minimized w.r.t. (A) & (B) (minimality)  But, expensive to solve exactly 15

Copyright  2009 by CEBT Approximate Solution  “Starting with p 0 = 1, for any given p k, choose p k+1 as the smallest integer p k+1 > p k s.t. p k+1 − p k > (1 − τ )p k+1 ” e.g. (for τ=0.7): p 0 =1, p 1 =3, p 2 =6, p 3 =10,…, p 7 =43, p 8 =59,…  Converges to optimal partitioning when distribution is dense  Web collections typically skewed towards shorter document lengths  Progressively increasing bucket widths are even beneficial for more uniform bucket sizes 16

Copyright  2009 by CEBT Partitioning Effects  Optimal partitioning approach even smoothes skewed bucket sizes (plot for 1,274,812 TREC WT10g docs with at least 1 Spot Signature) 17 Uniform bucket widthsProgressively increasing bucket widths #docs #sigs #docs #sigs

Copyright  2009 by CEBT Inverted Index Pruning  Pass 1 For each partition, create an inverted index – For each Spot Signature sj Create inverted list Lj with pointers to documents di containing sj Sort inverted list in descending order of freqi(sj) in di  Pass 2 For each document di, find its partition, then – Process lists in descending order of |Lj| – Maintain two thresholds δ1 – Minimum length distance to any document in the next list δ2 – Minimum length distance to next document within the current list – Break if δ1 + δ2 > (1- τ)|di|, also iterate into right neighbor partition 18 the:campaign an:attack d 7 :8 d 6 :6d 7 :4 … …d 2 :6d 5 :3d 1 :3 d 1 :5d 5 :4 Partition k

Copyright  2009 by CEBT Deduplication Example 19 Deduplicate d 1 S 3 : 1) δ 1 =0, δ 2 =1 → sim(d 1,d 3 ) = 0.8 2) d 1 =d 1 → continue 3) δ 1 =4, δ 2 =0 → break! pkpk p k+1 … … d 3 :5 d 2 :8d 3 :4d 1 :5 d 1 :4 Partition k s3s3 s1s1 d 3 :5d 3 :4d 1 :4 s2s2 δ 2 =1 δ 1 =4 Given: d 1 = {s 1 :5, s 2 :4, s 3 :4}, |d 1 |=13 d 2 = {s 1 :8, s 2 :4}, |d 2 |=12 d 3 = {s 1 :4, s 2 :5, s 3 :5}, |d 3 |=13 Threshold: τ = 0.8 Break if: δ 1 + δ 2 > (1- τ)|d i |

Copyright  2009 by CEBT SpotSigs Deduplication Algorithm 20   Still O(n 2 m) worst case runtime   Empirically much better, may outperform hashing   Tuning parameters: none   Still O(n 2 m) worst case runtime   Empirically much better, may outperform hashing   Tuning parameters: none

Copyright  2009 by CEBT Experiments  Collections “Gold Set” of 2,160 manually selected near-duplicate news articles from various news sites, 68 clusters TREC WT10g reference collection (1.6 Mio docs)  Hardware Dual Xeon 3GHz, 32 GB RAM 8 threads for sorting, hashing & deduplication  For all approaches Remove HTML markup Simple IDF filter for signatures, remove most frequent & infrequent signatures 21

Copyright  2009 by CEBT Competitors  Shingling [Broder, Glassman, Manasse & Zweig ‘97] N-gram sets/vectors compared with Jaccard/Cosine similarity in between O (n 2 m) and O (n m) runtime (using LSH for matching)  I-Match [Chowdhury, Frieder, Grossman & McCabe ‘02] Employs a single SHA-1 hash function O (n m) runtime  Locality Sensitive Hashing (LSH) [Indyk, Gionis & Motwani ‘99], [Broder et al. ‘03] Employs l (random) hash functions, each concatenating k MinHash signatures O (k l n m) runtime  Hybrids of I-Match and LSH with Spot Signatures (I-Match-S & LSH- S) 22

Copyright  2009 by CEBT “Gold Set” of News Articles  Manually selected set of 2,160 near-duplicate news articles (LA Times, SF Chronicle, Huston Chronicle, etc.), manually clustered into 68 topic directories  Huge variations in layout and ads added by different sites 23 Macro- Avg. Cosine ≈0.64

Copyright  2009 by CEBT SpotSigs vs. Shingling – Gold Set 24 Using (weighted) Jaccard similarity Using Cosine similarity (no pruning!)

Copyright  2009 by CEBT Runtime Results – TREC WT10g 25 SpotSigs vs. LSH-S using I-Match-S as recall base

Copyright  2009 by CEBT Summary – Gold Set 26 Summary of algorithms at their best F1 spots (τ = 0.44 for SpotSigs & LSH)

Copyright  2009 by CEBT Conclusion  Robust Spot Signatures favor natural-language page components  Full-fledged clustering algorithm, returns complete graph of all near-duplicate pairs  Efficient & self-tuning collection partitioning and inverted index pruning, highly parallelizable deduplication step  Surprising: May outperform linear-time similarity hashing approaches for reasonably high similarity thresholds  Future Work Efficient (sequential) index structures for disk-based storage Tight bounds for more similarity metrics, e.g., Cosine measure 27