Minwise Hashing and Efficient Search

Slides:



Advertisements
Similar presentations
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Advertisements

Hashing.
Indexing DNA Sequences Using q-Grams
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Similarity Search in High Dimensions via Hashing
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Lecture 18 Syntactic Web Clustering CS
Near Duplicate Detection
Hash Tables1 Part E Hash Tables  
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Locality-sensitive hashing and its applications
Large Scale Search: Inverted Index, etc.
SIMILARITY SEARCH The Metric Space Approach
COMP 53 – Week Eleven Hashtables.
Lower bounds for approximate membership dynamic data structures
Near Duplicate Detection
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Streaming & sampling.
Sketching, Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Theory of Locality Sensitive Hashing
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Database Implementation Issues
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
mEEC: A Novel Error Estimation Code with Multi-Dimensional Feature
Locality Sensitive Hashing
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Searching Similar Segments over Textual Event Sequences
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
DATABASE IMPLEMENTATION ISSUES
2018, Spring Pusan National University Ki-Joune Li
CPS216: Advanced Database Systems
Lecture 13: Query Execution
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
On the resemblance and containment of documents (MinHash)
Database Implementation Issues
CSE 326: Data Structures Lecture #14
Database Implementation Issues
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Minwise Hashing and Efficient Search

Picture Taken from Internet Example Problem Large scale search: We have a query image Want to search a giant database (internet) to find similar images Fast Accurate Picture Taken from Internet

Large Scale Image Search in Database Find similar images in a large database Kristen Grauman et al

Large scale image search Representation must fit in memory (disk too slow) Facebook has ~10 billion images (1010) PC has ~10 Gbytes of memory (1011 bits)  Images are very high-dimensional objects. Fergus et al

Solution: Hashing Algorithms Simple Problem: Given an array of n integers. [1,2,3,1,5,2,1,3,2,3,7]. Remove Duplicates 𝑂 𝑛 2 , 𝑂 𝑛 log 𝑛 𝑜𝑟 𝑂 𝑛 ? What if we want near duplicates, so 1.001 is considered duplicate of 1? (more realistic)

Similarity or Near-Neighbor Search Given a (relatively fixed) collection C and a similarity (or distance) metric. For any query q, compute x ∗ =argmin 𝑥∈𝐶 𝑠𝑖𝑚(𝑞,𝑥) O(nD) per query, where n is size of C and D is dimentions. Querying is a very frequent operations. n and D are large Goal: Need something more efficient Hope: Preprocess Collections Approximate solutions are perfectly fine.

Space Partitioning Methods Partition the space and organize database into trees In high dimensions, space partitioning is not at all efficient. Even D > 10, leads to near exhaustive Picture Taken from Internet

Motivating Problem: Search Engines Task: Correcting a user typed query in real-time. Take a database D of statistically significant query strings observed in the past. (around 50 million). Given a new user typed query q, find the closest string 𝑠∈𝐷 (in some distance) to q and return associated results. Latency: 50 million distance computation per query. A cheap distance function it takes 400s or 7min on a reasonable CPU. If you used edit distance, it will be hours. Latency Limit is roughly < 20ms. Can we do better? Exact solution: No Approximation: Yes. We can do it in 2ms or around 210000x faster!!

Locality Sensitive Hashing Classical Hashing if x = y  h(x) = h(y) if x ≠ y  h(x) ≠ h(y) Locality Sensitive Hashing (randomized relaxation) if sim(x,y) is high  Probability of h(x) = h(y) is high if sim(x,y) is low  Probability of h(x) = h(y) is low Many known families of similarity We will see one today. Conversely, h(x) = h(y) implies sim(x,y) is high (probabilistically)

Our Notion of Similarity: Jaccard Given two sets 𝑆 1 𝑎𝑛𝑑 𝑆 2 . Jaccard similarity is defined as J= sim 𝑆 1 , 𝑆 2 = | 𝑆 1 ∩ 𝑆 2 | 𝑆 1 ∪ 𝑆 2 Simple Example: 𝑆 1 = 3, 10, 15, 19 , 𝑆 2 = 4,10,15 , 𝑤ℎ𝑎𝑡 𝑖𝑠 𝐽? 𝟐 𝟓 What about strings? Weren’t we looking at query strings? 𝑆 1 ∩ 𝑆 2

N-grams are set! We will use character 3-gram representations Takes a string and converts it into set of all contiguous 3-characters tokens. String 𝑖𝑝ℎ𝑜𝑛𝑒 6→ 𝑖𝑝ℎ, 𝑝ℎ𝑜, ℎ𝑜𝑛, 𝑜𝑛𝑒, 𝑛𝑒 ,𝑒 6 What is the Jaccard distance (assuming character 3-gram representation) a𝑚𝑎𝑧𝑜𝑛 𝑣𝑠 𝑎𝑛𝑎𝑧𝑜𝑛 𝑎𝑚𝑎, 𝑚𝑎𝑧, 𝑎𝑧𝑎, 𝑧𝑜𝑛 𝑣𝑠 𝑎𝑛𝑎, 𝑛𝑎𝑧, 𝑎𝑧𝑎,𝑧𝑜𝑛 = 2 6 = 1 3 $𝑎𝑚, 𝑎𝑚𝑎, 𝑚𝑎𝑧, 𝑎𝑧𝑎, 𝑧𝑜𝑛, 𝑜𝑛. 𝑣𝑠 $𝑎𝑛, 𝑎𝑛𝑎, 𝑛𝑎𝑧, 𝑎𝑧𝑎,𝑧𝑜𝑛, 𝑜𝑛. = 3 9 = 1 3 𝑎𝑚𝑎𝑧𝑜𝑛 𝑣𝑠 𝑎𝑚𝑎𝑧𝑜𝑚 𝑎𝑚𝑎, 𝑚𝑎𝑧, 𝑎𝑧𝑎, 𝑧𝑜𝑛 𝑣𝑠 𝑎𝑚𝑎, 𝑚𝑎𝑧, 𝑎𝑧𝑎,𝑧𝑜𝑚 = 3 5 $𝑎𝑚, 𝑎𝑚𝑎, 𝑚𝑎𝑧, 𝑎𝑧𝑎, 𝑧𝑜𝑛, 𝑜𝑛. 𝑣𝑠 $𝑎𝑚, 𝑎𝑚𝑎, 𝑚𝑎𝑧, 𝑎𝑧𝑎,𝑧𝑜𝑚, 𝑜𝑚. = 4 8 = 1 2 a𝑚𝑎𝑧𝑜𝑛 𝑣𝑠 𝑟𝑎𝑛𝑑𝑜𝑚

Random Sampling using universal hashing Given a set {Ama, maz, azo, zon} Given a random hash function 𝑈:𝑠𝑡𝑟𝑖𝑛𝑔→ 0−𝑅 . Pr(h(s) = c) = 1/R Problem: Using 𝑈, can we get a random element of the set? Probability of getting any element is equally likely (1/4) ? Solution: Hash every token using U, pick the token that has minimum (or maximum) hash value. Example: {U(Ama), U(maz), U(azo), U(zon)} = {10, 2005, 199, 2}. Random Sampled element is “zon”. Proof?

Minwise hashing (Minhash) Document : S = {ama, maz, azo, zon, on.}. Generate Random 𝑈 𝑖 :𝑆𝑡𝑟𝑖𝑛𝑔𝑠→𝑁. Example: Murmurhash3 with new random seed i. 𝑈 𝑖 𝑆 ={ 𝑈 𝑖 (ama), 𝑈 𝑖 (maz), 𝑈 𝑖 (azo), 𝑈 𝑖 (zon), 𝑈 𝑖 (on.)} Lets say 𝑈 𝑖 𝑆 = {153, 283, 505, 128, 292} Then Minhash: ℎ 𝑖 = min ℎ 𝑖 𝑆 = 128. New seed for hash function 𝑈 𝑗 , a new minhash.

Properties of Minwise Hashing Can be applied to any set. Minhash: Set → [0-R] (R is large enough) For any set 𝑆 1 𝑎𝑛𝑑 𝑆 2 Pr 𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 1 =𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 2 = 𝑆 1 ∩ 𝑆 2 | 𝑆 1 ∪ 𝑆 2 | Proof: Under randomness of hash function U Fact1: For any set, the element with minimum hash is a random sample. Consider set 𝑆 1 ∪ 𝑆 2 , and sample a random element e using U. Claim1: 𝑒∈ 𝑆 1 ∩ 𝑆 2 if and only if 𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 1 =𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 2 𝑆 1 ∩ 𝑆 2

Estimate Similarity Efficiently Pr 𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 1 =𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 2 = 𝑆 1 ∩ 𝑆 2 | 𝑆 1 ∪ 𝑆 2 | = J Given 50 minhashes of 𝑆 1 𝑎𝑛𝑑 𝑆 2 . How can we estimate J? Memory is 50 numbers. Variance = J(1-J)/50, J = 0.8 ..roughly 0.16 50 ≈0.05 How about random sampling?

Parity of MinHash is good too Document : S = {ama, maz, azo, zon, on.}. Generate Random 𝑈 𝑖 :𝑆𝑡𝑟𝑖𝑛𝑔𝑠→𝑁. Example: Murmurhash3 with new random seed i. 𝑈 𝑖 𝑆 ={ 𝑈 𝑖 (ama), 𝑈 𝑖 (maz), 𝑈 𝑖 (azo), 𝑈 𝑖 (zon), 𝑈 𝑖 (on.)} Lets say 𝑈 𝑖 𝑆 = {153, 283, 505, 128, 292} Then Minhash: ℎ 𝑖 = min ℎ 𝑖 𝑆 = 128. Parity = 0 (even) (1-bit information) Pr 𝑃𝑎𝑟𝑖𝑡𝑦 𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 1 =𝑃𝑎𝑟𝑖𝑡𝑦(𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 2 ) = 𝑆 1 ∩ 𝑆 2 | 𝑆 1 ∪ 𝑆 2 | + 1 − 𝑆 1 ∩ 𝑆 2 𝑆 1 ∪ 𝑆 2 ∗0.5=0.5∗ 1+ 𝑆 1 ∩ 𝑆 2 𝑆 1 ∪ 𝑆 2

Parity of Minhash: Compression Given 50 parity of minhashes. How to estimate J? Memory is 50 bits or < 7 bytes (2 integers) Error for J = 0.8 is little worse than 0.05 (how to compute ?) Only depends on similarity and not on how heavy the set is!! Completely different tradeoff Set can have 100, 1000 or 10,000 elements, but the storage cost is the same for similarity estimation.

Minwise Hashing is Locality Sensitive Pr 𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 1 =𝑀𝑖𝑛ℎ𝑎𝑠ℎ 𝑆 2 = 𝑆 1 ∩ 𝑆 2 | 𝑆 1 ∪ 𝑆 2 | = J J is high → probability of collision is high J is low → probability of collision is low. Minhash is integer, can be used for indexing. Even parity can be used.

Locality Sensitive Hashing Classical Hashing if x = y  h(x) = h(y) if x ≠ y  h(x) ≠ h(y) Locality Sensitive Hashing (randomized relaxation) if sim(x,y) is high  Probability of h(x) = h(y) is high if sim(x,y) is low  Probability of h(x) = h(y) is low We will see how to have h for jaccard distance! Conversely, h(x) = h(y) implies Jaccard sim(x,y) is high (probabilistically)

Why is it helpful in search? Access to h(x) (with random seed), such that h(x) = h(y)  Noisy indicator that Sim(x,y) is high.

Why is it helpful in search? Access to h(x) (with random seed), such that h(x) = h(y)  Noisy indicator that Sim(x,y) is high. Given a query q, compute ℎ 1 (𝑞) = 01 and ℎ 2 𝑞 =11. Consider bucket 0111 as good candidate set. (Why?) We can turn this idea into a sound algorithm later.

Create multiple independent Hash Tables For every query, get the union of L buckets. K controls the quality of bucket. L controls failure probability. Optimal choice of K and L is provably efficient.

The LSH Algorithm Choose K and L (parameters). Generate K x L random seeds (for hash functions) Create L Independent Hash tables Preprocess Database D: For every 𝑥∈𝐷, index it with location { ℎ 𝑖−1 𝐾+1 (𝑥), ℎ 𝑖−1 𝐾+1 𝑥 ,…, ℎ 𝑖𝐾 (𝑥)} in hash table i. O(LK) Query with q: Take union of L buckets from each hash table: Bucket { ℎ 𝑖−1 𝐾+1 𝑞 ; ℎ 𝑖−1 𝐾+1 𝑞 ;…, ℎ 𝑖𝐾 (𝑞)} in table i. Get the best elements from the union based on similarity with q

One implementation details { ℎ 𝑖−1 𝐾+1 (𝑥), ℎ 𝑖−1 𝐾+1 𝑥 ,…, ℎ 𝑖𝐾 (𝑥)} is a bucket? Use a universal hashing that take K integers and maps it to [0-R] Typically use 𝑎 1 𝑥 1 + 𝑎 2 𝑥 2 + …+ 𝑎 𝐾 𝑥 𝐾 +𝑐 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑅, choose 𝑎 𝑖 s randomly. Negligible random collisions!! Insertion and deletion is straightforward!!

A bit of analysis 𝑝 𝑥 = Pr (𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑥 =𝑚𝑖𝑛ℎ𝑎𝑠ℎ(𝑞)) =𝐽𝑎𝑐𝑐𝑎𝑟𝑑 Probability of collision in a hash table? 𝑝 𝑥 𝐾 (probability that x is in bucket mapped by q in hash table 1) 1- 𝑝 𝑥 𝐾 is probability of x not in bucket mapped by q in hash table 1 (1−𝑝 𝑥 𝐾 ) 𝐿 is probability x is not in any of the L buckets. Or x is not retrieved 1− (1−𝑝 𝑥 𝐾 ) 𝐿 is the probability that x is retrieved. 1− (1−𝑝 𝑥 𝐾 ) 𝐿 is monotonic function of 𝑝 𝑥 Ex: K = 5, L =10 𝑝 𝑥 >0.8, Probability of retrieval > 0.98 𝑝 𝑥 <0.5 Probability of retrieval < 0.2

More Theory Linear search O(n) LSH based search O( 𝑛 𝜌 ), 𝜌 is function of similarity threshold and gap. Ignore the match Rule of Thumb: If we want very high similarity its very efficient (approaches O(1) in limit). Limits make sense? Rule of Thumb for Parameters: 𝐾≈ log 𝑛 ; 𝐿≈ 𝑛 (n is size of Database D) Increase in K decreases candidates retrieved exponentially Increase in L increases candidates linearly. Practice and play with different datasets and similarity levels to become expert.