Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

Slides:

Advertisements

Similar presentations

Lecture outline Nearest-neighbor search in low dimensions

Advertisements

Algorithms of Google News An Analysis of Google News Personalization Scalable Online Collaborative Filtering 1.

Similarity and Distance Sketching, Locality Sensitive Hashing

Data Mining of Very Large Data

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

CSCE 3400 Data Structures & Algorithm Analysis

High Dimensional Search Min-Hashing Locality Sensitive Hashing

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Randomized / Hashing Algorithms

Min-Hashing, Locality Sensitive Hashing Clustering

Maths for Computer Graphics

Large-scale matching CSE P 576 Larry Zitnick

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Association Rule Mining

My Favorite Algorithms for Large-Scale Data Mining

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 20 (Dec 7, 2005) Data Mining: Association Rules Rajeev Motwani (partially based on notes.

1 Lecture 18 Syntactic Web Clustering CS

Near Duplicate Detection

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.

Finding Similar Items.

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Recommendation Systems

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.

Similarity and Distance Sketching, Locality Sensitive Hashing

CS276 Lecture 16 Web characteristics II: Web size measurement Near-duplicate detection.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.

1 Low-Support, High-Correlation Finding rare, but very similar items.

1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.

CS4432: Database Systems II Query Processing- Part 3 1.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.

DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

DATA MINING LECTURE 5 MinHashing, Locality Sensitive Hashing, Clustering.

Netflix Challenge: Combined Collaborative Filtering Greg Nelson Alan Sheinberg.

Equivalence Relations

DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.

Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:

Shingling Minhashing Locality-Sensitive Hashing

Table of Contents Matrices - Definition and Notation A matrix is a rectangular array of numbers. Consider the following matrix: Matrix B has 3 rows and.

Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.

Big Data Infrastructure

CS276A Text Information Retrieval, Mining, and Exploitation

Near Duplicate Detection

Finding Similar Items Jeffrey D. Ullman Application: Similar Documents

Streaming & sampling.

Sketching, Locality Sensitive Hashing

Lecture 16: Web search/Crawling/Link Analysis

Finding Similar Items: Locality Sensitive Hashing

Finding Similar Items: Locality Sensitive Hashing

Shingling Minhashing Locality-Sensitive Hashing

Theory of Locality Sensitive Hashing

Lecture 11: Nearest Neighbor Search

CSCE 3110 Data Structures & Algorithm Analysis

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Minwise Hashing and Efficient Search

Near Neighbor Search in High Dimensional Data (1)

Three Essential Techniques for Similar Documents

Locality-Sensitive Hashing

Presentation transcript:

Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

Set Similarity Set Similarity (Jaccard measure) View sets as columns of a matrix; one row for each element in the universe. a ij = 1 indicates presence of item i in set j Example C 1 C sim J (C 1,C 2 ) = 2/5 =

Identifying Similar Sets? Signature Idea Hash columns C i to signature sig(C i ) sim J (C i,C j ) approximated by sim H (sig(C i ),sig(C j )) Naïve Approach Sample P rows uniformly at random Define sig(C i ) as P bits of C i in sample Problem sparsity  would miss interesting part of columns sample would get only 0’s in columns

Key Observation For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim

Min Hashing Randomly permute rows Hash h(C i ) = index of first row with 1 in column C i Suprising Property Why? Both are A/(A+B+C) Look down columns C i, C j until first non-Type- D row h(C i ) = h(C j )  type A row

Min-Hash Signatures Pick – P random row permutations MinHash Signature sig(C) = list of P indexes of first rows with 1 in column C Similarity of signatures Let sim H (sig(C i ),sig(C j )) = fraction of permutations where MinHash values agree Observe E[sim H (sig(C i ),sig(C j ))] = sim J (C i,C j )

Example C 1 C 2 C 3 R R R R R Signatures S 1 S 2 S 3 Perm 1 = (12345) Perm 2 = (54321) Perm 3 = (34512) Similarities Col-Col Sig-Sig

Implementation Trick Permuting rows even once is prohibitive Row Hashing Pick P hash functions h k : {1,…,n}  {1,…,O(n)} Ordering under h k gives random row permutation One-pass Implementation For each C i and h k, keep “slot” for min-hash value Initialize all slot(C i,h k ) to infinity Scan rows in arbitrary order looking for 1’s Suppose row R j has 1 in column C i For each h k, if h k (j) < slot(C i,h k ), then slot(C i,h k )  h k (j)

Example C 1 C 2 R R R R R h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 11- g(1) = 33- h(2) = 212 g(2) = 030 h(3) = 312 g(3) = 220 h(4) = 412 g(4) = 420 h(5) = 010 g(5) = 120 C 1 slots C 2 slots

Comparing Signatures Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures Compute – Pair-wise similarity of signature columns Problem MinHash fits column signatures in memory But comparing signature-pairs takes too much time Technique to limit candidate pairs? A-Priori does not work Locality Sensitive Hashing (LSH)