Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Slides:

Advertisements

Similar presentations

Shortest Vector In A Lattice is NP-Hard to approximate

Advertisements

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

On the Power of Adaptivity in Sparse Recovery Piotr Indyk MIT Joint work with Eric Price and David Woodruff, 2011.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Voronoi Diagrams in n· 2 O(√lglg n ) Time Timothy M. ChanMihai Pătraşcu STOC’07.

Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Searching on Multi-Dimensional Data

Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.

From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.

Lecture outline Dimensionality reduction

Computational Support for RRTs David Johnson. Basic Extend.

Algorithms for Nearest Neighbor Search Piotr Indyk MIT.

Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.

Approximate Range Searching in the Absolute Error Model Guilherme D. da Fonseca CAPES BEX Advisor: David M. Mount.

Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.

Tradeoffs in Approximate Range Searching Made Simpler Sunil Arya Hong Kong University of Science and Technology Guilherme D. da Fonseca Universidade Federal.

Dimensionality Reduction

Dimensionality Reduction

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive.

Approximate Nearest Subspace Search with Applications to Pattern Recognition Ronen Basri, Tal Hassner, Lihi Zelnik-Manor presented by Andrew Guillory and.

1 Lecture 18 Syntactic Web Clustering CS

On the ICP Algorithm Esther Ezra, Micha Sharir Alon Efrat.

6 6.3 © 2012 Pearson Education, Inc. Orthogonality and Least Squares ORTHOGONAL PROJECTIONS.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Dimensionality Reduction

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Random Projections of Signal Manifolds Michael Wakin and Richard Baraniuk Random Projections for Manifold Learning Chinmay Hegde, Michael Wakin and Richard.

Volume distortion for subsets of R n James R. Lee Institute for Advanced Study & University of Washington Symposium on Computational Geometry, 2006; Sedona,

Module 04: Algorithms Topic 07: Instance-Based Learning

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

External Memory Algorithms for Geometric Problems Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)

B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

AN ORTHOGONAL PROJECTION

Locality Sensitive Hashing Basics and applications.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Data Stream Algorithms Lower Bounds Graham Cormode

Chapter 13 (Prototype Methods and Nearest-Neighbors )

1 Point Location Strategies Idit Haran

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Locality-sensitive hashing and its applications

Information Complexity Lower Bounds

Sublinear Algorithmic Tools 3

Orthogonal Range Searching and Kd-Trees

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

Lecture 10: Sketching S3: Nearest Neighbor Search

Randomized Algorithms CS648

Lecture 7: Dynamic sampling Dimension Reduction

Near(est) Neighbor in High Dimensions

Near-Optimal (Euclidean) Metric Compression

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Locality Sensitive Hashing

CS5112: Algorithms and Data Structures for Applications

Lecture 15: Least Square Regression Metric Embeddings

President’s Day Lecture: Advanced Nearest Neighbor Search

Sublinear Algorihms for Big Data

Presentation transcript:

Geometric Problems in High Dimensions: Sketching Piotr Indyk

Lars Arge External memory data structures 2 High Dimensions We have seen several algorithms for low-dimensional problems (d=2, to be specific): –data structure for orthogonal range queries (kd-tree) –data structure for approximate nearest neighbor (kd-tree) –algorithms for reporting line intersections Many more interesting algorithms exist (see Computational Geometry course next year) Time to move on to high dimensions –Many (not all) low-dimensional problems make sense in high d: *nearest neighbor: YES (multimedia databases, data mining, vector quantization, etc..) *line intersection: probably NO –Techniques are very different

Lars Arge External memory data structures 3 What’s the Big Deal About High Dimensions ? Let’s see how kd-tree performs in R d …

Lars Arge External memory data structures 4 Déjà vu I: Approximate Nearest Neighbor Packing argument: –All cells C seen so far have diameter > eps*r –The number of cells with diameter eps*r, bounded aspect ratio, and touching a ball of radius r is at most O(1/eps 2 ) In R d, this gives O(1/eps d ). E.g., take eps=1, r=1. There are 2 d unit cubes touching the origin, and thus intersecting the unit ball:

Lars Arge External memory data structures 5 Déjà vu II: Orthogonal Range Search What is the max number Q(n) of regions in an n-point kd-tree intersecting a vertical line ? –If we split on x, Q(n)=1+Q(n/2) –If we split on y, Q(n)=2*Q(n/2)+2 –Since we alternate, we can write Q(n)=3+2Q(n/4), which solves O(sqrt{n}) In R d we need to take Q(n) to be the number of regions intersecting a (d-1)-dimensional hyperplane orthogonal to one of the directions We get Q(n)=2 d-1 Q(n/2 d )+stuff For constant d, this solves to O(n (d-1)/d )=O(n 1-1/d )

Lars Arge External memory data structures 6 High Dimensions Problem: when d > log n, query time is essentially O(dn) Need to use different techniques: –Dimensionality reduction, a.k.a. sketching: *Since d is high, let’s reduce it while preserving the important data set properties –Algorithms with “moderate” dependence on d (e.g., 2 d but not n d )

Lars Arge External memory data structures 7 Hamming Metric Points: from {0,1} d (or {0,1,2,…,q} d ) Metric: D(p,q) equals to the number of positions on which p,q differ Simplest high-dimensional setting Still useful in practice In theory, as hard (or easy) as Euclidean space Trivial in low d Example (d=3): {000, 001, 010, 011, 100, 101, 110, 111}

Lars Arge External memory data structures 8 Dimensionality Reduction in Hamming Metric Theorem: For any r and eps>0 (small enough), there is a distribution of mappings G: {0,1} d → {0,1} t, such that for any two points p, q the probability that: –If D(p,q)< r then D(G(p), G(q)) <(c +eps/20)t –If D(p,q)>(1+eps)r then D(G(p), G(q)) >(c+eps/10)t is at least 1-P, as long as t=O(log(1/P)/eps 2 ). Given n points, we can reduce the dimension to O(log n), and still approximately preserve the distances between them The mapping works (with high probability) even if you don’t know the points in advance

Lars Arge External memory data structures 9 Proof Mapping: G(p) = (g 1 (p), g 2 (p),…,g t (p)), where g(p)=f(p |I ) –I: a multiset of s indices taken independently uniformly at random from {1…d} –p |I : projection of p –f: a random function into {0,1} Example: p=01101, s=3, I={2,2,4} → p |I = 110

Lars Arge External memory data structures 10 Analysis What is Pr[p |I =q |I ] ? It is equal to (1-D(p,q)/d) s We set s=d/r. Then Pr[p |I =q |I ] = e -D(p,q)/r, which looks more or less like this: Thus –If D(p,q) 1/e –If D(p,q)>(1+eps)r then Pr[p |I =q |I ] < 1/e – eps/3

Lars Arge External memory data structures 11 Analysis II What is Pr[g(p) <> g(q)] ? It is equal to Pr[p |I =q |I ]*0 + (1- Pr[p |I =q |I ]) *1/2 = (1- Pr[p |I =q |I ])/2 Thus –If D(p,q) g(q)] < (1-1/e)/2 = c –If D(p,q)>(1+eps)r then Pr[g(p) <> g(q)] > c+eps/6 By linearity of expectations E[D(G(p),G(q))]= Pr[g(p) <> g(q)] t To get the high probability bound, use Chernoff inequality

Lars Arge External memory data structures 12 Algorithmic Implications Approximate Near Neighbor: –Given: A set of n points in {0,1} d, eps>0, r>0 –Goal: A data structure that for any query q: *if there is a point p within distance r from q, then report p’ within distance (1+eps)r from q Can solve Approximate Nearest Neighbor by taking r=1,(1+eps),…

Lars Arge External memory data structures 13 Algorithm I - Practical Set probability of error to 1/poly(n) → t=O(log n/eps 2 ) Map all points p to G(p) To answer a query q: –Compute G(q) –Find the nearest neighbor of G(q) among all points G(p) –Check the distance; if less than r(1+eps), report Query time: O(n log n/eps 2 )

Lars Arge External memory data structures 14 Algorithm II - Theoretical The exact nearest neighbor problem in {0,1} t can be solved with –2 t space –O(t) query time (just store pre-computed answers to all queries) By applying mapping G(.), we solve approximate near neighbor with: –n O(1/eps 2 ) space –O(d log n/eps 2 ) time

Lars Arge External memory data structures 15 Another Sketching Method In many applications, the points tend to be quite sparse –Large dimension –Very few 1’s Easier to think about them as sets. E.g., consider a set of words in a document. The previous method would require very large s For two sets A,B, define Sim(A,B)=|A ∩ B|/|A U B| –If A=B, Sim(A,B)=1 –If A,B disjoint, Sim(A,B)=0 How to compute short sketches of sets that preserve Sim(.) ?

Lars Arge External memory data structures 16 “Min Approach” Mapping: G(A)=min a in A g(a), where g is a random permutation of the elements Fact: Pr[G(A)=G(B)]=Sim(A,B) Proof: Where is min( g(A) U g(B) ) ?