CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Dimensionality Reduction CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu
CS685 : Special Topics in Data Mining, UKY Overview What is Dimensionality Reduction? – Simplifying complex data – Using dimensionality reduction as a Data Mining “tool” – Useful for both “data modeling” and “data analysis” – Tool for “clustering” and “regression” Linear Dimensionality Reduction Methods – Principle Component Analysis (PCA) – Multi-Dimensional Scaling (MDS) Non-Linear Dimensionality Reduction – Next week
CS685 : Special Topics in Data Mining, UKY What is Dimensionality Reduction? Given N objects, each with M measurements, find the best D-dimensional parameterization Goal: Find a “compact parameterization” or “Latent Variable” representation Given N examples of find where Underlying assumptions to DimRedux – Measurements over-specify data, M > D – The number of measurements exceed the number of “true” degrees of freedom in the system – The measurements capture all of the significant variability
CS685 : Special Topics in Data Mining, UKY Uses for DimRedux Build a “compact” model of the data – Compression for storage, transmission, & retrieval – Parameters for indexing, exploring, and organizing – Generate “plausible” new data Answer fundamental questions about data – What is its underlying dimensionality? How many degrees of freedom are exhibited? How many “latent variables”? – How independent are my measurements? – Is there a projection of my data set where important relationships stand out?
CS685 : Special Topics in Data Mining, UKY DimRedux in Data Modeling Data Clustering - Continuous to Discrete – The curse of dimensionality: the sampling density is proportional to N 1/p. – Need a mapping to a lower-dimensional space that preserves “important” relations Regression Modeling – Continuous to Continuous – A functional model that generates input data – Useful for interpolation Embedding Space
CS685 : Special Topics in Data Mining, UKY Today’s Focus Linear DimRedux methods – PCA – Pearson (1901); Hotelling (1935) – MDS – Torgerson (1952), Shepard (1962) “Linear” Assumption – Data is a linear function of the parameters (latent variables) – Data lies on a linear (Affine) subspace where the matrix M is m x d
CS685 : Special Topics in Data Mining, UKY PCA: What problem does it solve? Minimizes “least-squares” (Euclidean) error – The D-dimensional model provided by PCA has the smallest Euclidean error of any D-parameter linear model. where is the model predicted by the D- dimensional PCA. Projects data s.t. the variance is maximized Find an optimal “orthogonal” basis set for describing the given data
CS685 : Special Topics in Data Mining, UKY Principle Component Analysis Also known to engineers as the Karhunen-Loéve Transform (KLT) Rotate data points to align successive axes with directions of greatest variance – Subtract mean from data – Normalize variance along each direction, and reorder according to the variance magnitude from high to low – Normalized variance direction = principle component Eigenvectors of system’s Covariance Matrix permute to order eigenvectors in descending order
CS685 : Special Topics in Data Mining, UKY Simple PCA Example Simple 3D example >> x = rand(2, 500); >> z = [1,0; 0,1; -1,-1] * x + [0;0;1] * ones(1, 500); >> m = (100 * rand(3,3)) * z + rand(3, 500); >> scatter3(m(1,:), m(2,:), m(3,:), 'filled');
CS685 : Special Topics in Data Mining, UKY Simple PCA Example (cont) >> mm = (m- mean(m')' * ones(1, 500));; >> [E,L] = eig(cov(mm ‘ )); >> E E = >> L L = >> newm = E’ * (m - mean(m’)’' * ones(1, 500)); >> scatter3(newm(1,:), newm(2,:), newm(3,:), 'filled'); axis([-50,50, -50,50, -50,50]);
CS685 : Special Topics in Data Mining, UKY Simple PCA Example (cont)
CS685 : Special Topics in Data Mining, UKY PCA Applied to Reillumination Illumination can be modeled as an additive linear system. )(R ixy
CS685 : Special Topics in Data Mining, UKY Simulating New Lighting We can simulate the appearance of a model under new illumination by combining images taken from a set of basis lights We can then capture real-world lighting and use it to modulate our basis lighting functions
CS685 : Special Topics in Data Mining, UKY Problems There are too many basis lighting functions – These have to be stored in order to use them – The resulting lighting model can be huge, in particular when representing high frequency lighting – Lighting differences can be very subtle The cost of modulation is excessive – Every basis image must be scaled and added together – Each image requires a high-dynamic range Is there a more compact representation? – Yes, use PCA.
CS685 : Special Topics in Data Mining, UKY PCA Applied to Illumination More than 90% variance is captured in the first five principle components Generate new illumination by combining only 5 basis images V0 for n lights
CS685 : Special Topics in Data Mining, UKY Results Video
CS685 : Special Topics in Data Mining, UKY Results Video
CS685 : Special Topics in Data Mining, UKY Results Video
CS685 : Special Topics in Data Mining, UKY MDS: What problem does it solve? Takes as input a dissimilarity matrix M, containing pairwise dissimilarities between N-dimensional data points Finds the best D-dimensional linear parameterization compatible with M (in other words, outputs a projection of data in D-dimensional space where the pairwise distances match the original dissimilarities as faithfully as possible)
CS685 : Special Topics in Data Mining, UKY Multidimensional Scaling (MDS) Dissimilarities can be metric or non-metric Useful when absolute measurements are unavailable; uses relative measurements Computation is invariant to dimensionality of data
CS685 : Special Topics in Data Mining, UKY An example: map of the US Given only the distance between a bunch of cities
CS685 : Special Topics in Data Mining, UKY An example: map of the US MDS finds suitable coordinates for the points of the specified dimension.
CS685 : Special Topics in Data Mining, UKY MDS Properties Parameterization is not unique – Axes are meaningless – Not surprising since Euclidean transformations and reflections preserve distances between points Useful for visualizing relationships in high dimensional data. – Define a dissimilarity measure – Map to a lower-dimensional space using MDS Common preprocess before cluster analysis – Aids in understanding patterns and relationships in data Widely used in marketing and psychometrics
CS685 : Special Topics in Data Mining, UKY Dissimilarities Dissimilarities are distance-like quantities that satisfy the following conditions: A dissimilarity is metric if, in addition, it satisfies: “The triangle inequality”
CS685 : Special Topics in Data Mining, UKY Relating MDS to PCA Special case: when distances are Euclidean PCA = eigendecomposition of covariance matrix M T M Convert the pair-wise distance matrix to the covariance matrix
CS685 : Special Topics in Data Mining, UKY How to get M T M from Euclidean Pair-wise Distances Eigendecomposition on b to get VSV T VS 1/2 = matrix of new coordinates j i k Law of cosines Definition of a dot product
CS685 : Special Topics in Data Mining, UKY Algebraically… The distance between points p i and p j The *Column Average* the average distance that a given point is from p j The *Row Average* the average distance that a given point is from p i The “Matrix Average” So we “centered” the matrix
CS685 : Special Topics in Data Mining, UKY MDS Mechanics Given a Dissimilarity matrix, D, the MDS model is computed as follows: Where, H, the so called “centering” matrix, is a scaled identity matrix computed as follows: MDS coordinates given by (in order of decreasing :
CS685 : Special Topics in Data Mining, UKY MDS Stress The residual variance of B (i.e. the sum of the remaining eigenvalues) indicate the goodness of fit for the selected d-dimensional model This term is often called MDS “stress” Examining the residual variance gives an indication of the inherent dimensionality
CS685 : Special Topics in Data Mining, UKY Reflectance Modeling Example From Pellacini, et. al. “Toward a Psychophysically-Based Light Reflection Model for Image Synthesis,” SIGGRAPH 2000 Objective – Find a perceptually meaningful parameterization for reflectance modeling The top row of white, grey, and black balls have the same “physical” reflectance parameters, however, the bottom row is “perceptually” more consistent.
CS685 : Special Topics in Data Mining, UKY Reflectance Modeling Example User Task – Subjects were presented with 378 pairs of rendered spheres an asked to rate their difference in “glossiness” on a scale of 0 (no difference) to 100. A dissimilarity 27 x 27 dissimilarity matrix was constructed and MDS applied
CS685 : Special Topics in Data Mining, UKY Reflectance Modeling Example Parameters of a 2D embedding space were determined Two axes of “gloss” were established
CS685 : Special Topics in Data Mining, UKY Limitations of Linear methods What if the data does not lie within a linear subspace? Do all convex combinations of the measurements generate plausible data? Low-dimensional non-linear Manifold embedded in a higher dimensional space Next time: Nonlinear Dimensionality Reduction
CS685 : Special Topics in Data Mining, UKY Summary Linear dimensionality reduction tools are widely used for – Data analysis – Data preprocessing – Data compression PCA transforms the measurement data s. t. successive directions of greatest variance are mapped to orthogonal axis directions (bases) – An D-dimensional embedding space (parameterization) can be established by modeling the data using only the first d of these basis vectors – Residual modeling error is the sum of the remaining eigenvalues
CS685 : Special Topics in Data Mining, UKY Summary (cont) MDS finds a d-dimensional parameterization that best preserves a given dissimilarity matrix – Resulting model can be Euclidean transformed to align data with a more intuitive parameterization – An D-dimensional embedding spaces (parameterization) are established by modeling the data using only the first d coordinates of the scaled eigenvectors – Residual modeling error (MDS stress) is the sum of the remaining eigenvalues – If Euclidean metric dissimilarity matrix is used for MDS the resulting d-dimensional model will match the PCA weights for the same dimensional model