Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

Slides:

Advertisements

Similar presentations

Lecture outline Nearest-neighbor search in low dimensions

Advertisements

Applications Shingling Minhashing Locality-Sensitive Hashing

Similarity and Distance Sketching, Locality Sensitive Hashing

Data Mining of Very Large Data

Near-Duplicates Detection

CSCE 3400 Data Structures & Algorithm Analysis

Data Structures Using C++ 2E

High Dimensional Search Min-Hashing Locality Sensitive Hashing

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Indian Statistical Institute Kolkata

Randomized / Hashing Algorithms

Min-Hashing, Locality Sensitive Hashing Clustering

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Association Rule Mining

My Favorite Algorithms for Large-Scale Data Mining

Near Duplicate Detection

Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Finding Similar Items.

1 Near-Neighbor Search Applications Matrix Formulation Minhashing.

Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications.

1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.

Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.

CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Similarity and Distance Sketching, Locality Sensitive Hashing

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.

Frequent Itemsets and Association Rules 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 3: Frequent Itemsets.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.

1 Low-Support, High-Correlation Finding rare, but very similar items.

Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.

DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Jeffrey D. Ullman Stanford University.  2% of your grade will be for answering other students’ questions on Piazza.  18% for Gradiance.  Piazza code.

DATA MINING LECTURE 5 MinHashing, Locality Sensitive Hashing, Clustering.

1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.

Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.

Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:

Shingling Minhashing Locality-Sensitive Hashing

Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)

Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.

CS276A Text Information Retrieval, Mining, and Exploitation

Near Duplicate Detection

Finding Similar Items Jeffrey D. Ullman Application: Similar Documents

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

Sketching, Locality Sensitive Hashing

Finding Similar Items: Locality Sensitive Hashing

Finding Similar Items: Locality Sensitive Hashing

Shingling Minhashing Locality-Sensitive Hashing

Theory of Locality Sensitive Hashing

Evaluation of Relational Operations

Hash-Based Improvements to A-Priori

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Minwise Hashing and Efficient Search

Near Neighbor Search in High Dimensional Data (1)

Three Essential Techniques for Similar Documents

Locality-Sensitive Hashing

Presentation transcript:

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining Massive Datasets

Finding Similar Items 2 Outline  Introduction  Shingling  Minhashing  Locality-Sensitive Hashing

Finding Similar Items 3 Goals  Many Web-mining problems can be expressed as finding “ similar ” sets: 1.Pages with similar words, e.g., for classification by topic. 2.NetFlix users with similar tastes in movies, for recommendation systems. 3.Dual: movies with similar sets of fans. 4.Images of related things. Introduction

Finding Similar Items 4 Example Problem: Comparing Documents  Goal: common text.  Special cases are easy, e.g., identical documents, or one document contained character-by-character in another.  General case, where many small pieces of one doc appear out of order in another, is very hard. Introduction

Finding Similar Items 5 Similar Documents – (2)  Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.:  Mirror sites, or approximate mirrors.  Application: Don ’ t want to show both in a search.  Plagiarism, including large quotations.  Similar news articles at many news sites.  Application: Cluster articles by “ same story. ” Introduction

Finding Similar Items 6 Three Essential Techniques for Similar Documents 1.Shingling : convert documents, s, etc., to sets. 2.Minhashing : convert large sets to short signatures, while preserving similarity. 3.Locality-sensitive hashing : focus on pairs of signatures likely to be similar. Introduction

Finding Similar Items 7 The Big Picture Shingling Docu- ment The set of strings of length k that appear in the document Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Introduction

Finding Similar Items 8 Outline  Introduction  Shingling  Minhashing  Locality-Sensitive Hashing

Finding Similar Items 9 Shingles  A k -shingle (or k -gram) for a document is a sequence of k characters that appears in the document.  Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}.  Option: regard shingles as a bag, and count ab twice.  Represent a doc by its set of k-shingles. Shingling

Finding Similar Items 10 Working Assumption  Documents that have lots of shingles in common have similar text, even if the text appears in different order.  Careful: you must pick k large enough, or most documents will have most shingles.  k = 5 is OK for short documents; k = 10 is better for long documents. Shingling

Finding Similar Items 11 Shingles: Compression Option  To compress long shingles, we can hash them to (say) 4 bytes (integer).  Represent a doc by the set of hash values of its k- shingles.  Two documents could rarely appear to have shingles in common, when in fact only the hash- values were shared. Shingling

Finding Similar Items 12 Outline  Introduction  Shingling  Minhashing  Locality-Sensitive Hashing

Finding Similar Items 13 Basic Data Model: Sets  Many similarity problems can be couched as finding subsets of some universal set that have significant intersection.  Examples include: 1.Documents represented by their sets of shingles (or hashes of those shingles). 2.Similar customers or products. Minhashing

Finding Similar Items 14 Jaccard Similarity of Sets  The Jaccard similarity of two sets is the size of their intersection divided by the size of their union.  Sim (C 1, C 2 ) = |C 1  C 2 |/|C 1  C 2 |. Minhashing

Finding Similar Items 15 Example: Jaccard Similarity 3 in intersection. 8 in union. Jaccard similarity = 3/8 Minhashing

Finding Similar Items 16 From Sets to Boolean Matrices  Rows = elements of the universal set.  Columns = sets.  1 in row e and column S if and only if e is a member of S.  Column similarity is the Jaccard similarity of the sets of their rows with 1.  Typical matrix is sparse. Minhashing

Finding Similar Items 17 Example: Jaccard Similarity of Columns C 1 C Sim (C 1, C 2 ) = 2/5 = * * * * * * * Minhashing

Finding Similar Items 18 Aside  We might not really represent the data by a boolean matrix.  Sparse matrices are usually better represented by the list of places where there is a non-zero value.  But the matrix picture is conceptually useful. Minhashing

Finding Similar Items 19 When Is Similarity Interesting? 1.When the sets are so large or so many that they cannot fit in main memory. 2.Or, when there are so many sets that comparing all pairs of sets takes too much time. 3.Or both. Minhashing

Finding Similar Items 20 Outline: Finding Similar Columns 1.Compute signatures of columns = small summaries of columns. 2.Examine pairs of signatures to find similar signatures.  Essential: similarities of signatures and columns are related. 3.Optional: check that columns with similar signatures are really similar. Minhashing

Finding Similar Items 21 Warnings 1.Comparing all pairs of signatures may take too much time, even if not too much space.  A job for Locality-Sensitive Hashing. 2.These methods can produce false negatives, and even false positives (if the optional check is not made). Minhashing

Finding Similar Items 22 Signatures  Key idea: “ hash ” each column C to a small signature Sig (C), such that: 1.Sig (C) is small enough that we can fit a signature in main memory for each column. 2.Sim (C 1, C 2 ) is the same as the “ similarity ” of Sig (C 1 ) and Sig (C 2 ). Minhashing

Finding Similar Items 23 Four Types of Rows  Given columns C 1 and C 2, rows may be classified as: C 1 C 2 a11 b10 c01 d00  Also, a = # rows of type a, etc.  Note Sim (C 1, C 2 ) = a /(a +b +c ). Minhashing

Finding Similar Items 24 Minhashing  Imagine the rows permuted randomly.  Define “ hash ” function h (C ) = the number of the first (in the permuted order) row in which column C has 1.  Use several (e.g., 100) independent hash functions to create a signature. Minhashing

Finding Similar Items 25 Minhashing Example Input matrix Signature matrix M Minhashing

Finding Similar Items 26 Surprising Property  The probability (over all permutations of the rows) that h (C 1 ) = h (C 2 ) is the same as Sim (C 1, C 2 ).  Both are a /(a +b +c )!  Why?  Look down the permuted columns C 1 and C 2 until we see a 1.  If it ’ s a type-a row, then h (C 1 ) = h (C 2 ). If a type-b or type-c row, then not. Minhashing

Finding Similar Items 27 Similarity for Signatures  The similarity of signatures is the fraction of the hash functions in which they agree. Minhashing

Finding Similar Items 28 Min Hashing – Example Input matrix Signature matrix M Similarities: Col/Col Sig/Sig Minhashing

Finding Similar Items 29 Minhash Signatures  Pick (say) 100 random permutations of the rows.  Think of Sig (C) as a column vector.  Let Sig (C)[i] = according to the i th permutation, the number of the first row that has a 1 in column C. Minhashing

Finding Similar Items 30 Implementation – (1)  Suppose 1 billion rows.  Hard to pick a random permutation from 1 … billion.  Representing a random permutation requires 1 billion entries.  Accessing rows in permuted order leads to thrashing. Minhashing

Finding Similar Items 31 Implementation – (2)  A good approximation to permuting rows: pick 100 (?) hash functions.  For each column c and each hash function h i, keep a “ slot ” M (i, c ).  Intent: M (i, c ) will become the smallest value of h i (r ) for which column c has 1 in row r.  I.e., h i (r ) gives order of rows for i th permuation. Minhashing

Finding Similar Items 32 Implementation – (3) Initialize M(i,c) to ∞ for all i and c for each row r for each column c if c has 1 in row r for each hash function h i do if h i (r ) is a smaller value than M (i, c ) then M (i, c ) := h i (r ); Minhashing

Finding Similar Items 33 Example RowC1C h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 11- g(1) = 33- h(2) = 212 g(2) = 030 h(3) = 312 g(3) = 220 h(4) = 412 g(4) = 420 h(5) = 010 g(5) = 120 Sig1Sig2 Minhashing

Finding Similar Items 34 Implementation – (4)  Often, data is given by column, not row.  E.g., columns = documents, rows = shingles.  If so, sort matrix once so it is by row.  And always compute h i (r ) only once for each row. Minhashing

Finding Similar Items 35 Outline  Introduction  Shingling  Minhashing  Locality-Sensitive Hashing

Finding Similar Items 36 Finding Similar Pairs  Suppose we have, in main memory, data representing a large number of objects.  May be the objects themselves.  May be signatures as in minhashing.  We want to compare each to each, finding those pairs that are sufficiently similar. Locality-Sensitive Hashing

Finding Similar Items 37 Checking All Pairs is Hard  While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns.  Example: 10 6 columns implies 5*10 11 column- comparisons.  At 1 microsecond/comparison: 6 days. Locality-Sensitive Hashing

Finding Similar Items 38 Locality-Sensitive Hashing  General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.  For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs. Locality-Sensitive Hashing

Finding Similar Items 39 Candidate Generation From Minhash Signatures  Pick a similarity threshold s, a fraction < 1.  A pair of columns c and d is a candidate pair if their signatures agree in at least fraction s of the rows.  I.e., M (i, c ) = M (i, d ) for at least fraction s values of i. Locality-Sensitive Hashing

Finding Similar Items 40 LSH for Minhash Signatures  Big idea: hash columns of signature matrix M several times.  Arrange that (only) similar columns are likely to hash to the same bucket.  Candidate pairs are those that hash at least once to the same bucket. Locality-Sensitive Hashing

Finding Similar Items 41 Partition Into Bands Matrix M r rows per band b bands One signature Locality-Sensitive Hashing

Finding Similar Items 42 Partition into Bands – (2)  Divide matrix M into b bands of r rows.  For each band, hash its portion of each column to a hash table with k buckets.  Make k as large as possible.  Candidate column pairs are those that hash to the same bucket for ≥ 1 band.  Tune b and r to catch most similar pairs, but few dissimilar pairs. Locality-Sensitive Hashing

Finding Similar Items 43 Matrix M r rows b bands Buckets Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Locality-Sensitive Hashing

Finding Similar Items 44 Simplifying Assumption  There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band.  Hereafter, we assume that “ same bucket ” means “ identical in that band. ” Locality-Sensitive Hashing

Finding Similar Items 45 Example: Effect of Bands  Suppose 100,000 columns.  Signatures of 100 integers.  Therefore, signatures take 40Mb.  Want all 80%-similar pairs.  5,000,000,000 pairs of signatures can take a while to compare.  Choose 20 bands of 5 integers/band. Locality-Sensitive Hashing

Finding Similar Items 46 Suppose C 1, C 2 are 80% Similar  Probability C 1, C 2 identical in one particular band: (0.8) 5 =  Probability C 1, C 2 are not similar in any of the 20 bands: ( ) 20 =  i.e., about 1/3000th of the 80%-similar column pairs are false negatives. Locality-Sensitive Hashing

Finding Similar Items 47 Suppose C 1, C 2 Only 30% Similar  Probability C 1, C 2 identical in any one particular band: (0.3) 5 =  Probability C 1, C 2 identical in ≥ 1 of 20 bands: ≤ 20 * =  In other words, approximately 4.86% pairs of docs with similarity 30% end up becoming candidate pairs  False positives Locality-Sensitive Hashing

Finding Similar Items 48 LSH Involves a Tradeoff  Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.  Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up. Locality-Sensitive Hashing

Finding Similar Items 49 Analysis of LSH – What We Want Similarity s of two sets Probability of sharing a bucket t No chance if s < t Probability = 1 if s > t Locality-Sensitive Hashing

Finding Similar Items 50 What One Band of One Row Gives You Similarity s of two sets Probability of sharing a bucket t Remember: probability of equal hash-values = similarity Locality-Sensitive Hashing

Finding Similar Items 51 What b Bands of r Rows Gives You Similarity s of two sets Probability of sharing a bucket t s rs r All rows of a band are equal 1 - Some row of a band unequal ()b)b No bands identical 1 - At least one band identical t ~ (1/b) 1/r Locality-Sensitive Hashing

Finding Similar Items 52 Example: b = 20; r = 5 s 1-(1-s r ) b Locality-Sensitive Hashing

Finding Similar Items 53 LSH Summary  Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures.  Check in main memory that candidate pairs really do have similar signatures.  Optional: In another pass through data, check that the remaining candidate pairs really represent similar sets. Locality-Sensitive Hashing

Finding Similar Items 54 Acknowledgement  Slides are from  Prof. Jeffrey D. Ullman  Dr. Anand Rajaraman  Dr. Jure Leskovec