Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

Slides:



Advertisements
Similar presentations
Lecture outline Nearest-neighbor search in low dimensions
Advertisements

Algorithms of Google News An Analysis of Google News Personalization Scalable Online Collaborative Filtering 1.
Similarity and Distance Sketching, Locality Sensitive Hashing
Data Mining of Very Large Data
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
CSCE 3400 Data Structures & Algorithm Analysis
High Dimensional Search Min-Hashing Locality Sensitive Hashing
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Randomized / Hashing Algorithms
Min-Hashing, Locality Sensitive Hashing Clustering
Maths for Computer Graphics
Large-scale matching CSE P 576 Larry Zitnick
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Association Rule Mining
My Favorite Algorithms for Large-Scale Data Mining
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 20 (Dec 7, 2005) Data Mining: Association Rules Rajeev Motwani (partially based on notes.
1 Lecture 18 Syntactic Web Clustering CS
Near Duplicate Detection
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Finding Similar Items.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
Recommendation Systems
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.

Similarity and Distance Sketching, Locality Sensitive Hashing
CS276 Lecture 16 Web characteristics II: Web size measurement Near-duplicate detection.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
1 Low-Support, High-Correlation Finding rare, but very similar items.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
CS4432: Database Systems II Query Processing- Part 3 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
DATA MINING LECTURE 5 MinHashing, Locality Sensitive Hashing, Clustering.
Netflix Challenge: Combined Collaborative Filtering Greg Nelson Alan Sheinberg.
Equivalence Relations
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Shingling Minhashing Locality-Sensitive Hashing
Table of Contents Matrices - Definition and Notation A matrix is a rectangular array of numbers. Consider the following matrix: Matrix B has 3 rows and.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Big Data Infrastructure
CS276A Text Information Retrieval, Mining, and Exploitation
Near Duplicate Detection
Finding Similar Items Jeffrey D. Ullman Application: Similar Documents
Streaming & sampling.
Sketching, Locality Sensitive Hashing
Lecture 16: Web search/Crawling/Link Analysis
Finding Similar Items: Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Shingling Minhashing Locality-Sensitive Hashing
Theory of Locality Sensitive Hashing
Lecture 11: Nearest Neighbor Search
CSCE 3110 Data Structures & Algorithm Analysis
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Minwise Hashing and Efficient Search
Near Neighbor Search in High Dimensional Data (1)
Three Essential Techniques for Similar Documents
Locality-Sensitive Hashing
Presentation transcript:

Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

Set Similarity Set Similarity (Jaccard measure) View sets as columns of a matrix; one row for each element in the universe. a ij = 1 indicates presence of item i in set j Example C 1 C sim J (C 1,C 2 ) = 2/5 =

Identifying Similar Sets? Signature Idea Hash columns C i to signature sig(C i ) sim J (C i,C j ) approximated by sim H (sig(C i ),sig(C j )) Naïve Approach Sample P rows uniformly at random Define sig(C i ) as P bits of C i in sample Problem sparsity  would miss interesting part of columns sample would get only 0’s in columns

Key Observation For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim

Min Hashing Randomly permute rows Hash h(C i ) = index of first row with 1 in column C i Suprising Property Why? Both are A/(A+B+C) Look down columns C i, C j until first non-Type- D row h(C i ) = h(C j )  type A row

Min-Hash Signatures Pick – P random row permutations MinHash Signature sig(C) = list of P indexes of first rows with 1 in column C Similarity of signatures Let sim H (sig(C i ),sig(C j )) = fraction of permutations where MinHash values agree Observe E[sim H (sig(C i ),sig(C j ))] = sim J (C i,C j )

Example C 1 C 2 C 3 R R R R R Signatures S 1 S 2 S 3 Perm 1 = (12345) Perm 2 = (54321) Perm 3 = (34512) Similarities Col-Col Sig-Sig

Implementation Trick Permuting rows even once is prohibitive Row Hashing Pick P hash functions h k : {1,…,n}  {1,…,O(n)} Ordering under h k gives random row permutation One-pass Implementation For each C i and h k, keep “slot” for min-hash value Initialize all slot(C i,h k ) to infinity Scan rows in arbitrary order looking for 1’s Suppose row R j has 1 in column C i For each h k, if h k (j) < slot(C i,h k ), then slot(C i,h k )  h k (j)

Example C 1 C 2 R R R R R h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 11- g(1) = 33- h(2) = 212 g(2) = 030 h(3) = 312 g(3) = 220 h(4) = 412 g(4) = 420 h(5) = 010 g(5) = 120 C 1 slots C 2 slots

Comparing Signatures Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures Compute – Pair-wise similarity of signature columns Problem MinHash fits column signatures in memory But comparing signature-pairs takes too much time Technique to limit candidate pairs? A-Priori does not work Locality Sensitive Hashing (LSH)