Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

Slides:



Advertisements
Similar presentations
Scalability, from a database systems perspective Dave Abel.
Advertisements

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
Nonparametric Methods: Nearest Neighbors
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Curse of Dimensionality Prof. Navneet Goyal Dept. Of Computer Science & Information Systems BITS - Pilani.
Multidimensional Indexing
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Similarity Search in High Dimensions via Hashing
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Coherency Sensitive Hashing (CSH) Simon Korman and Shai Avidan Dept. of Electrical Engineering Tel Aviv University ICCV2011 | 13th International Conference.
Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.
Algorithms for Nearest Neighbor Search Piotr Indyk MIT.
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
Instance Based Learning
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
1 Nearest Neighbor Learning Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AI.
Nearest Neighbour Condensing and Editing David Claus February 27, 2004 Computer Vision Reading Group Oxford.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
FLANN Fast Library for Approximate Nearest Neighbors
Indexing Techniques Mei-Chen Yeh.
Data Mining Techniques
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Richard Kelley Motion Planning on a GPU. Last Time Nvidia’s white paper Productive discussion.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques Nearest Neighbor Editing and Condensing Techniques 1.Nearest Neighbor.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Clustering Prof. Ramin Zabih
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.
KNN & Naïve Bayes Hongning Wang
Spatial Data Management
Visualizing High-Dimensional Data
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Lecture 16: Earth-Mover Distance
Preprocessing to accelerate 1-NN
Locality Sensitive Hashing
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
Exact Nearest Neighbor Algorithms
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012

Motivation Modern CS is largely about “big data” Examples: web, genome, Walmart, AT&T call graph, Flickr, Facebook, Twitter, etc How to find patterns in such large datasets? Example: frequently purchased item sets Today: A classical algorithm, its application to some image editing problems, and some state of the art research techniques Slide credits: Alyosha Efros, Piotr Indyk and their students

Motivating example: sink that ship! Credit:

How do we sink that ship?

Another example: building removal Credit: (Hays & Efros)

Hays and Efros, SIGGRAPH 2007

… 200 scene matches Hays and Efros, SIGGRAPH 2007

… 200 scene matches Hays and Efros, SIGGRAPH 2007

Nearest Neighbor Search Given: a set P of n points in R d Goal: a data structure, which given a query point q, finds the nearest neighbor p of q in P q p

Variants of nearest neighbor k-Nearest Neighbors: find the k nearest neighbors Near neighbor (range search): find one/all points in P within distance r from q Approximate near neighbor: find one/all points p’ in P, whose distance to q is at most (1+  ) times the distance from q to its nearest neighbor Decision problem: is there a point within r of q? Note: there can be many notions of distance!

Some early NN algorithms One can always do exhaustive search Note that a low-resolution version of the input data will allow you to ignore lots of points! Linear in n and d (why?) There are two simple data structure solutions: quad trees (and their variants, like k-d trees) Voronoi diagrams I won’t talk about these in any detail but will give you the basic idea

A quad tree 16

NN search in a quad tree 17

Voronoi diagram Basic idea: represent decision regions, i.e. areas that have a given point as the NN Every cell has 1 data point, and each location in that cell has that data point as its NN 18

Asymptotic behavior Quadtree: space and query time both exponential in dimension d kd-tree (smart variant) has linear space but still can take exponential query time in d So you should run exhaustive search too! Voronoi diagram generalization uses exponential space in d also 19

The curse of dimensionality In high dimensional data almost everything is far away from anything else The data tends to become sparse So, your nearest neighbor is pretty dis-similar, simply because everything is dis-similar The curse is not well defined, but for example you can compare the volume of a unit sphere to a unit hypercube As dimension increases the ratio goes to zero, so “everything is far away from the center” 20

The curse of dimensionality (2) Concretely, exact NN methods tend to scale poorly with dimension Either space or query time goes up exponentially with dimension So you are better off just using exhaustive search! Alternatively: settle for approximate nearest neighbors, which is usually good enough After all, your input data was noisy to start with! 21

Locality sensitive hashing (LSH) Suppose we had a hash function such that: 1.Nearby points tend to hash to the same bucket 2.Far away points tend to hash to different buckets We need a family of such hash functions, but if we have them we can achieve a cool guarantee: Any data point within a specified distance of the query will be reported with a specified probability Need to know distance r in advance Space is O(n ρ ), time O(n 1+ ρ ), where ρ <1 depends on how accurate we need to be 22

Visualizing LSH 23 q p

How do we find such hashing functions? Not clear they always exist, but they do for lots of important problems Consider binary vectors under the Hamming distance D(p,q), aka the XOR count Given a parameter k < d, we create a hash function by picking k random coordinates and reporting the values of our input on them Example: Pick 2 nd and 4 th coordinates of input = , answer = 10 This family of hash functions is locality sensitive! 24

Hash function behavior Consider two binary vectors p,q. Hamming distance is D(p,q), so chance of picking a bit where they differ is D(p,q)/d Probability that they have the same hash value is (1 - D(p,q)/d) k Some special cases: 25 k=1k=2 distance Pr

Dimensionality reduction High dimensional data is painful. But perhaps your data isn’t really high dimensional? There are many fascinating variants of this problem For example, we could discover that some dimensions simply don’t matter at all They could be uniform in all points, or random! Simplest solution: project points onto a line

Swissroll example 27

Distance-preserving embeddings Another variant: embed your data in a lower dimensional space so that the distances between pairs of points don’t change very much After all, the distances are really the data Multi-dimensional scaling (MDS) There is an area of pure math that studies this kind of problem, called distance geometry Bob Connelly (Cornell math) is a leader Lots of interesting distances also, such as the Earth Mover’s Distance (EMD) Related to L1 metric, can use LSH for this