Dimensionality Reduction

Slides:



Advertisements
Similar presentations
Dimensionality Reduction Techniques Dimitrios Gunopulos, UCR.
Advertisements

Eigen Decomposition and Singular Value Decomposition
Eigen Decomposition and Singular Value Decomposition
Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Searching on Multi-Dimensional Data
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Multimedia DBs.
Principal Component Analysis
Dimensionality Reduction and Embeddings
Dimensionality Reduction
Dimensional reduction, PCA
1 Visualizing the Legislature Howard University - Systems and Computer Science October 29, 2010 Mugizi Robert Rwebangira.
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
Multimedia DBs. Time Series Data
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Dimensionality Reduction
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
Shape Analysis and Retrieval Statistical Shape Descriptors Notes courtesy of Funk et al., SIGGRAPH 2004.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
Manifold learning: MDS and Isomap
Non-Linear Dimensionality Reduction
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Linear Subspace Transforms PCA, Karhunen- Loeve, Hotelling C306, 2000.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Principal Components Analysis ( PCA)
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
Support Vector Machines (SVM)
LECTURE 10: DISCRIMINANT ANALYSIS
You can check broken videos in this slide here :
Dipartimento di Ingegneria «Enzo Ferrari»,
LSI, SVD and Data Management
Dimensionality Reduction and Embeddings
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Locality Sensitive Hashing
Recitation: SVD and dimensionality reduction
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Lecture 15: Least Square Regression Metric Embeddings
Presentation transcript:

Dimensionality Reduction

Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc) Answering similarity queries in high-dimensions is a difficult problem due to “curse of dimensionality” A solution is to use Dimensionality reduction

High-dimensional datasets Range queries have very small selectivity Surface is everything Partitioning the space is not so easy: 2d cells if we divide each dimension once Pair-wise distances of points is very skewed freq distance

Dimensionality Reduction The main idea: reduce the dimensionality of the space. Project the d-dimensional points in a k-dimensional space so that: k << d distances are preserved as well as possible Solve the problem in low dimensions (the GEMINI idea of course…)

MDS Map the items in a k-dimensional space trying to minimize the stress But the running time is O(N2) and O(N) to add a new item

FastMap Find two objects that are far away Project all points on the line the two objects define, to get the first coordinate

FastMap - next iteration

Example ~100 1 100 O5 O4 O3 O2 O1 ~1

Example Pivot Objects: O1 and O4 X1: O1:0, O2:0.005, O3:0.005, O4:100, O5:99 For the second iteration pivots are: O2 and O5

Results Documents /cosine similarity -> Euclidean distance (how?)

Results bb reports recipes

FastMap Extensions If the original space is not a Euclidean space, then we may have a problem: The projected distance may be a complex number! A solution to that problem is to define: di(a,b) = sign(di(a,b)) (| di(a,b) |2)1/2 where, di(a,b) = di-1(a,b)2 – (xia-xbi)2

Other DR methods Using SVD and project the items on the first k eigenvectors Random projections

SVD decomposition - the Karhunen-Loeve transform Intuition: find the axis that shows the greatest variation, and project all points into this axis f2 e1 e2 f1

SVD: The mathematical formulation Find the eigenvectors of the covariance matrix These define the new space Sort the eigenvalues in “goodness” order f2 f1 e1 e2

SVD: The mathematical formulation Let A be the N x M matrix of N M-dimensional points The SVD decomposition of A is: = U x L x VT, But running time O(Nm2). How many I/Os? Idea: Find the eigenvectors of C=ATA (MxM) If M<<N, then C can be kept in main memory and can be constructed in one pass. (so we can find V and L) One more pass to construct U

SVD Cont’d Advantages: Optimal dimensionality reduction (for linear projections) Disadvantages: Computationally hard, especially if the time series are very long. Does not work for subsequence indexing

SVD Extensions On-line approximation algorithm [Ravi Kanth et al, 1998] Local diemensionality reduction: Cluster the time series, solve for each cluster [Chakrabarti and Mehrotra, 2000], [Thomasian et al]

Random Projections Based on the Johnson-Lindenstrauss lemma: For: any (sufficiently large) set S of M points in Rn k = O(e-2lnM) There exists a linear map f:S Rk, such that (1- e) D(S,T) < D(f(S),f(T)) < (1+ e)D(S,T) for S,T in S Random projection is good with constant probability

Random Projection: Application Set k = O(e-2lnM) Select k random n-dimensional vectors (an approach is to select k gaussian distributed vectors with variance 0 and mean value 1: N(1,0) ) Project the time series into the k vectors. The resulting k-dimensional space approximately preserves the distances with high probability Monte-Carlo algorithm: we do not know if correct

Random Projection A very useful technique, Especially when used in conjunction with another technique (for example SVD) Use Random projection to reduce the dimensionality from thousands to hundred, then apply SVD to reduce dimensionality farther