Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Slides:



Advertisements
Similar presentations
Polylogarithmic Private Approximations and Efficient Matching
Advertisements

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.
Xiaoming Sun Tsinghua University David Woodruff MIT
When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)
Shortest Vector In A Lattice is NP-Hard to approximate
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Sparse Recovery (Using Sparse Matrices)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.
Dimensionality Reduction
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive.
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
Probably Approximately Correct Model (PAC)
The Goldreich-Levin Theorem: List-decoding the Hadamard code
1 Lecture 18 Syntactic Web Clustering CS
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Random Projections of Signal Manifolds Michael Wakin and Richard Baraniuk Random Projections for Manifold Learning Chinmay Hegde, Michael Wakin and Richard.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Volume distortion for subsets of R n James R. Lee Institute for Advanced Study & University of Washington Symposium on Computational Geometry, 2006; Sedona,
Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.
CHAPTER FIVE Orthogonality Why orthogonal? Least square problem Accuracy of Numerical computation.
Binomial Coefficients, Inclusion-exclusion principle
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Locality Sensitive Hashing Basics and applications.
13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Sampling in Graphs Alexandr Andoni (Microsoft Research)
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Andrey Markov, public domain image
Locality-sensitive hashing and its applications
Information Complexity Lower Bounds
Estimating L2 Norm MIT Piotr Indyk.
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Lecture 10: Sketching S3: Nearest Neighbor Search
Randomized Algorithms CS648
Near(est) Neighbor in High Dimensions
Linear sketching with parities
Y. Kotidis, S. Muthukrishnan,
The Curve Merger (Dvir & Widgerson, 2008)
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

Geometric Problems in High Dimensions: Sketching Piotr Indyk

Lars Arge External memory data structures 2 Dimensionality Reduction in Hamming Metric Theorem: For any r and eps>0 (small enough), there is a distribution of mappings G: {0,1} d → {0,1} t, such that for any two points p, q the probability that: –If D(p,q)< r then D(G(p), G(q)) < (c+eps/10)t –If D(p,q)>(1+eps)r then D(G(p), G(q)) >(c+eps/20)t is at least 1-P, as long as t=C*log(2/P)/eps 2, C large constant. Given n points, we can reduce the dimension to O(log n), and still approximately preserve the distances between them The mapping works (with high probability) even if you don’t know the points in advance

Lars Arge External memory data structures 3 Proof Mapping: G(p) = (g 1 (p), g 2 (p),…,g t (p)), where g j (p)=f j (p |Ij ) –I: a multiset of s indices taken independently uniformly at random from {1…d} –p |I : projection of p –f: a random function into {0,1} Example: p=01101, s=3, I={2,2,4} → p |I = 110

Lars Arge External memory data structures 4 Analysis What is Pr[p |I =q |I ] ? It is equal to (1-D(p,q)/d) s We set s=d/r. Then Pr[p |I =q |I ] = e -D(p,q)/r, which looks more or less like this: Thus –If D(p,q) 1/e –If D(p,q)>(1+eps)r then Pr[p |I =q |I ] < 1/e – eps/3

Lars Arge External memory data structures 5 Analysis II What is Pr[g(p) <> g(q)] ? It is equal to Pr[p |I =q |I ]*0 + (1- Pr[p |I =q |I ]) *1/2 = (1- Pr[p |I =q |I ])/2 Thus –If D(p,q) g(q)] < (1-1/e)/2 = c –If D(p,q)>(1+eps)r then Pr[g(p) <> g(q)] > c+eps/6

Lars Arge External memory data structures 6 Analysis III What is D(G(p),G(q)) ? Since G(p)=(g 1 (p), g 2 (p),…,g t (p)), we have: D(G(p),G(q))=Σ j [g j (p)<> g j (q)] By linearity of expectations E[D(G(p),G(q))]= Σ j Pr[g j (p) <> g j (q)] = t Pr[g j (p) <> g j (q)] To get the high probability bound, use Chernoff inequality

Lars Arge External memory data structures 7 Chernoff bound Let X 1, X 2 …X t be independent random 0-1 variables, such that Pr[X i =1]=r. Let X= Σ j X j. Then for any 0<b<1: Pr[ |X –t r| > b t r] <2e -b 2 tr/3 Proof I: Cormen, Leiserson, Rivest, Stein, Appendix C Proof II: attend one of David Karger’s classes. Proof III: do it yourself.

Lars Arge External memory data structures 8 Analysis IV In our case X j =[g j (p)<> g j (q)], X=D(G(p),G(q)). Therefore: –For r=c: Pr[X>(c+eps/20)t] eps/20 tc] <2e -(eps/20) 2 tc/3 –For r=c+eps/6: Pr[X eps/20 tc]<2e -(eps/20) 2 t(c+eps/6)/3 In both cases, the probability of failure is at most 2e -(eps/20) 2 tc/3

Lars Arge External memory data structures 9 Finally… 2e -(eps/20) 2 tc/3 =2e -(eps/20) 2 c/3 C* log(2/P)/eps 2 = 2e -log(2/P)c*C/1200 Take C so that c*C/1200 = 1. We get 2e -log(2/P)c*C/1200 = 2e -log(2/P) = P Thus, the probability of failure is at most P.

Lars Arge External memory data structures 10 Algorithmic Implications Approximate Near Neighbor: –Given: A set of n points in {0,1} d, eps>0, r>0 –Goal: A data structure that for any query q: *if there is a point p within distance r from q, then report p’ within distance (1+eps)r from q Can solve Approximate Nearest Neighbor by taking r=1,(1+eps),…

Lars Arge External memory data structures 11 Algorithm I - Practical Set probability of error to 1/poly(n) → t=O(log n/eps 2 ) Map all points p to G(p) To answer a query q: –Compute G(q) –Find the nearest neighbor G(p) of G(q) –If D(p,q) < r(1+eps), report p Query time: O(n log n/eps 2 )

Lars Arge External memory data structures 12 Algorithm II - Theoretical The exact nearest neighbor problem in {0,1} t can be solved with –2 t space –O(t) query time (just store pre-computed answers to all queries) By applying mapping G(.), we solve approximate near neighbor with: –n O(1/eps 2 ) space –O(d log n/eps 2 ) time

Lars Arge External memory data structures 13 Another Sketching Method In many applications, the points tend to be quite sparse –Large dimension –Very few 1’s Easier to think about them as sets. E.g., consider a set of words in a document. The previous method would require very large s For two sets A,B, define Sim(A,B)=|A ∩ B|/|A U B| –If A=B, Sim(A,B)=1 –If A,B disjoint, Sim(A,B)=0 How to compute short sketches of sets that preserve Sim(.) ?

Lars Arge External memory data structures 14 “Min Approach” Mapping: g(A)=min a in A h(a), where h is a random permutation of the elements in the universe Fact: Pr[g(A)=g(B)]=Sim(A,B) Proof: Where is min( h(A) U h(B) ) ?

Lars Arge External memory data structures 15 Min Sketching Define G(A)=(g 1 (A), g 2 (A),…, g t (A) ) By Chernoff bound, we can conclude that if t=C log(1/P)/eps 2, then for any A,B, the number of j’s such that g j (A)= g j (B) is equal to t [Sim(A,B) +/- eps ] with probability at least 1-P