Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)

Slides:



Advertisements
Similar presentations
Nearest Neighbor Search
Advertisements

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
 Distance Problems: › Post Office Problem › Nearest Neighbors and Closest Pair › Largest Empty and Smallest Enclosing Circle  Sub graphs of Delaunay.
Voronoi Diagrams in n· 2 O(√lglg n ) Time Timothy M. ChanMihai Pătraşcu STOC’07.
Multidimensional Indexing
Searching on Multi-Dimensional Data
Similarity Search in High Dimensions via Hashing
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)
Spectral Approaches to Nearest Neighbor Search Alex Andoni Joint work with:Amirali Abdullah Ravi Kannan Robi Krauthgamer.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
Coherency Sensitive Hashing (CSH) Simon Korman and Shai Avidan Dept. of Electrical Engineering Tel Aviv University ICCV2011 | 13th International Conference.
Nearest Neighbor Search in High-Dimensional Spaces Alexandr Andoni (Microsoft Research Silicon Valley)
Lower Bounds for NNS and Metric Expansion Rina Panigrahy Kunal Talwar Udi Wieder Microsoft Research SVC TexPoint fonts used in EMF. Read the TexPoint manual.
Algorithms for Nearest Neighbor Search Piotr Indyk MIT.
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
Approximate Nearest Subspace Search with Applications to Pattern Recognition Ronen Basri, Tal Hassner, Lihi Zelnik-Manor presented by Andrew Guillory and.
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
1 Lecture 18 Syntactic Web Clustering CS
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
FLANN Fast Library for Approximate Nearest Neighbors
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Multi-dimensional Search Trees
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Princeton/CCI → MSR SVC) Barriers II August 30, 2010.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Autonomous Robots Robot Path Planning (2) © Manfred Huber 2008.
Tight Lower Bounds for Data- Dependent Locality-Sensitive Hashing Alexandr Andoni (Columbia) Ilya Razenshteyn (MIT CSAIL)
Data-dependent Hashing for Similarity Search
Spatial Indexing I Point Access Methods.
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Localizing the Delaunay Triangulation and its Parallel Implementation
Sketching and Embedding are Equivalent for Norms
Randomized Algorithms CS648
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Lecture 16: Earth-Mover Distance
Near-Optimal (Euclidean) Metric Compression
עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
Clustering.
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)

Nearest Neighbor Search (NNS)

Approximate NNS c-approximate q r p cr

Heuristic for Exact NNS q r p cr c-approximate

Locality-Sensitive Hashing q p 1 [Indyk-Motwani’98] q “ not-so-small ”

Locality sensitive hash functions 6

Full algorithm 7

Analysis of LSH Scheme 8

Analysis: Correctness 9

Analysis: Runtime 10

NNS for Euclidean space 11 [Datar-Immorlica-Indyk-Mirrokni’04]

Regular grid → grid of balls p can hit empty space, so take more such grids until p is in a ball Need (too) many grids of balls Start by projecting in dimension t Analysis gives Choice of reduced dimension t? Tradeoff between # hash tables, n , and Time to hash, t O(t) Total query time: dn 1/c 2 +o(1) Optimal* LSH 2D p p RtRt [A-Indyk’06]

x Proof idea Claim:, i.e., P(r)=probability of collision when ||p-q||=r Intuitive proof: Projection approx preserves distances [JL] P(r) = intersection / union P(r)≈random point u beyond the dashed line Fact (high dimensions): the x-coordinate of u has a nearly Gaussian distribution → P(r)  exp(-A·r 2 ) p q r q P(r) u p

LSH Zoo 14 To be or not to be To Simons or not to Simons …21102… be to or not Simons …01122… be to or not Simons …11101… …01111… {be,not,or,to}{not,or,to, Simons} 1 1 not beto

LSH in the wild 15 safety not guaranteed fewer false positives fewer tables

Time-Space Trade-offs [AI’06] [KOR’98, IM’98, Pan’06] [Ind’01, Pan’06] SpaceTimeCommentReference [DIIM’04, AI’06] [IM’98] query time space medium low high low ω(1) memory lookups [AIP’06] ω(1) memory lookups [PTW’08, PTW’10] [MNP’06, OWZ’11] 1 mem lookup

LSH is tight… leave the rest to cell-probe lower bounds?

Data-dependent Hashing! 18 [A-Indyk-Nguyen-Razenshteyn’14]

A look at LSH lower bounds 19 [O’Donnell-Wu-Zhou’11]

Why not NNS lower bound? 20

Intuition 21

Nice Configuration: “sparsity” 22

Reduction: into spherical LSH 23

Two-level algorithm

Details Inside a bucket, need to ensure “sparse” case 1) drop all “far pairs” 2) find minimum enclosing ball (MEB) 3) partition by “sparsity” (distance from center) 25

1) Far points 26

2) Minimum Enclosing Ball 27

3) Partition by “sparsity” 28

Practice of NNS 29 Data-dependent partitions… Practice: Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees… often no guarantees Theory? assuming more about data: PCA-like algorithms “work” [Abdullah-A-Kannan- Krauthgamer’14]

Finale 30

Open question: [Prob. needle of length 1 is not cut] [Prob needle of length c is not cut] ≥ 1/c 2