Near Duplicate Detection

Slides:



Advertisements
Similar presentations
Lecture outline Nearest-neighbor search in low dimensions
Advertisements

Similarity and Distance Sketching, Locality Sensitive Hashing
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
High Dimensional Search Min-Hashing Locality Sensitive Hashing
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Randomized / Hashing Algorithms
Min-Hashing, Locality Sensitive Hashing Clustering
Large-scale matching CSE P 576 Larry Zitnick
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Modern Information Retrieval
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Near Duplicate Detection
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Finding Similar Items.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Similarity and Distance Sketching, Locality Sensitive Hashing
CS276 Lecture 16 Web characteristics II: Web size measurement Near-duplicate detection.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Locality Sensitive Hashing Basics and applications.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Shingling Minhashing Locality-Sensitive Hashing
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Mining Data Streams (Part 1)
CS276 Information Retrieval and Web Search
CS276A Text Information Retrieval, Mining, and Exploitation
Modified by Dongwon Lee from slides by
15-499:Algorithms and Applications
Finding Similar Items Jeffrey D. Ullman Application: Similar Documents
Crawler (AKA Spider) (AKA Robot) (AKA Bot)
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Sketching, Locality Sensitive Hashing
Information Retrieval Christopher Manning and Prabhakar Raghavan
Lecture 16: Web search/Crawling/Link Analysis
Finding Similar Items: Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Theory of Locality Sensitive Hashing
Relational Algebra Chapter 4, Part A
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Representation of documents and queries
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Implementation of Relational Operations
Minwise Hashing and Efficient Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Near Neighbor Search in High Dimensional Data (1)
Information Retrieval and Web Design
Three Essential Techniques for Similar Documents
Locality-Sensitive Hashing
Presentation transcript:

Near Duplicate Detection Slides adapted from Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan CS345A, Winter 2009: Data Mining. Stanford University, Anand Rajaraman, Jeffrey D. Ullman

Duplication is a problem Sec. 19.4.1 Duplication is a problem User Web spider queries Indexer Search links Indexes The Web Ad indexes

Duplicate documents The web is full of duplicated content Sec. 19.6 Duplicate documents The web is full of duplicated content About 30% are duplicates Duplicates need to be removed for Crawling Indexing Statistical studies Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., Last modified date the only difference between two copies of a page Other minor difference such as web master, logo, …

Other applications Many Web-mining problems can be expressed as finding “similar” sets: Topic classification--Pages with similar words, Mirror web sites, Similar news articles Recommendation systems--NetFlix users with similar tastes in movies. movies with similar sets of fans. Images of related things. Community in online social networks Plagiarism

Algorithms for finding similarities Edit distance Distance between A and B is defined as the minimal number of operations to edit A into B Mathematically elegant Many applications (like auto-correction of spelling) Not efficient Shingling

Techniques for Similar Documents Shingling : convert documents, emails, etc., to sets. Minhashing : convert large sets to short signatures, while preserving similarity. Candidate pairs : those pairs of signatures that we need to test for similarity. Min-hash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of terms of length k that appear in the document From Anand Rajaraman (anand @ kosmix dt com), Jeffrey D. Ullman

Shingles A k -shingle (or k -gram) for a document is a sequence of k terms that appears in the document. Example: a rose is a rose is a rose → a rose is a rose is a rose is a rose is The set of shingles is {a rose is a, rose is a rose, is a rose is, a rose is a} Note that “a rose is a rose” is repeated twice, but only appear once in the set Option: regard shingles as a bag, and count “a rose is a” twice. Represent a doc by its set of k-shingles. Documents that have lots of shingles in common have similar text, even if the text appears in different order. Careful: you must pick k large enough. If k=1, most documents overlap a lot.

Jaccard similarity 2 in intersection. 7 in union. Jaccard similarity a rose is a rose is a rose  {a rose is a, rose is a rose, is a rose is, a rose is a} A rose is a rose that is it  {a rose is a, rose is a rose, is a rose that, a rose that is, rose that is it} 2 in intersection. 7 in union. Jaccard similarity = 2/7 Is a rose is Is a rose that Rose is a rose A rose that is a rose is a A rose is a rose that is it

The size is the problem The shingle set can be very large There are many documents (many shingle sets) to compare Billions of documents and shingles Problems: Memory: When the shingle sets are so large or so many that they cannot fit in main memory. Time: Or, when there are so many sets that comparing all pairs of sets takes too much time. Or both.

Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive/intractable Approximate using a cleverly chosen subset of shingles from each (a sketch) Estimate (size_of_intersection / size_of_union) based on a short sketch Doc A Shingle set A Sketch A Doc B Shingle set B Sketch B Jaccard

Set Similarity of sets Ci , Cj Sec. 19.6 Set Similarity of sets Ci , Cj View sets as columns of a matrix A; one row for each element in the universe. aij = 1 indicates presence of shingle i in set (document) j Example C1 C2 0 1 1 0 1 1 0 0 Jaccard(C1,C2) = 2/5 = 0.4

Key Observation For columns C1, C2, four types of rows Sec. 19.6 Key Observation For columns C1, C2, four types of rows C1 C2 A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim

Estimating Jaccard similarity Sec. 19.6 C1 C2 0 1 1 0 1 1 0 0 Estimating Jaccard similarity Randomly permute rows Hash h(Ci) = index of first row with 1 in column Ci Property Why? Both are A/(A+B+C) Look down columns C1, C2 until first non-Type-D row h(Ci) = h(Cj)  type A row

Representing documents and shingles To compress long shingles, we can hash them to (say) 4 bytes. Represent a doc by the set of hash values of its k-shingles. Represent the documents as a matrix 4 documents 7 shingles in total Column is a document Each row is a shingle In real application the matrix is sparse—there are many empty cells doc1 doc2 doc3 Doc4 Shingle 1 1 Shingle 2 Shingle 3 Shingle 4 Shingle 5 Shingle 6 Shingle 7

Random permutation Input matrix Signature matrix Hashed Sorted 1 3 4 7 6 1 2 5 3 4 7 6 1 2 5 2 1 3 6 4 7 5 2 1 min 4 docs Similarities: 1~3: 1 2~4: 1 1~4: 0 sort Hash

Repeat the previous process Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5

More Hashings produce better result Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5 Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0

Sec. 19.6 Sketch of a document Create a “sketch vector” (of size ~200) for each document Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates For doc D, sketchD[ i ] is as follows: Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting) Let pi be a random permutation on 0..2m Pick MIN {pi(f(s))} over all shingles s in D

Computing Sketch[i] for Doc1 Sec. 19.6 Computing Sketch[i] for Doc1 Document 1 264 Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value 264 264 264

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Sec. 19.6 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 Document 1 264 264 264 264 264 264 A B 264 264 Are these equal? Test for 200 random permutations: p1, p2,… p200

Summary Conceptually Computationally Characterize documents by shingles Each shingle is represented by a unique integer Reduce the resemblance problem to the set intersection problem Jaccard similarity coefficient Intersection is estimated using random sampling Randomly select 200 shingles in doc1, for each check whether it is also in Doc2 Computationally Documents represented by a sketch (a small set (~200) of shingles) Each shingle is produced by min-hash. Computed once Set intersection is computed on the sketch