Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan.

Similar presentations


Presentation on theme: "Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan."— Presentation transcript:

1 Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan

2 The Problem:  Data lies on or near a manifold.  Lower dimensionality than overall space.  Locally Euclidean.  Example: data on a 2D line in R 3, flat area on a sphere.  Goal: Learn a kernel that will let us work in the lower- dimensional space.  “Unfold” the manifold.  First we need to know what it is!  Its dimensionality.  How it can vary. 2D manifold on a sphere. (Wikipedia)

3 Background Assumptions:  Kernel Trick  Mercer’s Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel Functions can be represented as dot (inner) products in a high-dimensional space (Wikipedia; implied in paper).  So we replace the dot product with a kernel function.  Or “Gram Matrix”, K nm = φ (x n ) T * φ (x m ) = k(x n, x m )  Kernel provides mapping into high-dimensional space.  Consequence of Cover’s theorem: Nonlinear problem then becomes linear.  Example: SVMs: x i T * x j -> φ (x i ) T * φ (x j ) = k(x i, x j ).  Linear Dimensionality Reduction Techniques:  SVD, derived techniques (PCA, ICA, etc.) remove linear correlations.  This reduces the dimensionality.  Now combine these!  Kernel PCA for nonlinear dimensionality reduction!  Map input to a higher dimension using a kernel, then use PCA.

4 The (More Specific) Problem:  Data described by a manifold.  Using kernel PCA, discover the manifold.  There’s only one detail missing:  How do we find the appropriate kernel?  This forms the basis of the paper’s approach.  It is also a motivation for the paper…

5 Motivation:  Exploits properties of the data, not just its space.  Relates kernel discovery to manifold learning.  With the right kernel, kernel PCA will allow us to discover the manifold.  So it has implications for both fields.  Another paper by the same authors focuses on applicability to manifold learning; this paper focuses on kernel learning.  Unlike previous methods, this approach is unsupervised; the kernel is learned automatically.  Not specific to PCA; it can learn any kernel.

6 Methodology – Idea:  Semidefinite programming (optimization)  Look for a locally isometric mapping from the space to the manifold.  Preserves distance, angles between points.  Rotation and Translation on a neighborhood.  Fix the distance and angles between a point and its k nearest neighbors.  Intuition:  Represent points as a lattice of “steel balls”.  Neighborhoods connected by “rigid rods” that fix angles and distance (local isometry constraint).  Now pull the balls as far apart as possible (obj. function).  The lattice flattens -> Lower dimensionality!  The “balls” and “rods” represent the manifold...  If the data is well-sampled (Wikipedia).  Shouldn’t be a problem in practice.

7 Optimization Constraints:  Isometry:  For all neighbors x j, x k of point x i.  If x j and x k are neighbors of each other or another common point,  Let Gram matrices  We then have K ii + K jj - K ij - K ji = G ii + G jj - G ij - G ji.  Positive Semidefiniteness (required for kernel trick).  No negative eigenvalues.  Centered on the origin ( ).  So eigenvalues measure variance of PCs.  Dataset can be centered if not already.

8 Objective Function  We want to maximize pairwise distances.  This is an inversion of SSE/MSE!  So we have  Which is just Tr(K)!  Proof: (Not given in paper)

9 Semidefinite Embedding (SDE)  Maximize Tr(K) subject to:  K ≥ 0   K ii + K jj - K ij - K ji = G ii + G jj - G ij - G ji for all i,j that are neighbors of each other or a common point.  This optimization is convex, and thus has a unique solution.  Use semidefinite programming to perform the optimization (no SDP details in paper).  Once we have the optimal kernel, perform kPCA.  This technique (SDE) is this paper’s contribution.

10 Experimental Setup  Four kernels:  SDE (proposed)  Linear  Polynomial  Gaussian  “Swiss Roll” Dataset.  23 dimensions.  3 meaningful (top right).  20 filled with small noise (not shown).  800 inputs.  k = 4, p = 4, σ = 1.45 ( σ of 4-neighborhoods).  “Teapot” Dataset.  Same teapot, rotated 0 ≤ i < 360 degrees.  23,028 dimensions (76 x 101 x 3).  Only one degree of freedom (angle of rotation).  400 inputs.  k = 4, p = 4, σ = 1541.  “The handwriting dataset”.  No dimensionality or parameters specified (16x16x1 = 256D?)  953 images. No images or kernel matrix shown.

11 Results – Dimensionality Reduction  Two measures:  Learned Kernels (SDE):  “Eigenspectra”:  Variance captured by individual eigenvalues.  Normalized by trace (sum of eigenvalues).  Seems to indicate manifold dimensionality. “Swiss Roll”“Teapot”“Digits”

12 Results – Large Margin Classification  Used SDE kernels with SVMs.  Results were very poor.  Lowering dimensionality can impair separability. Error rates: 90/10 training/test split. Mean of 10 experiments. Decision boundary no longer linearly separable.

13 Strengths and Weaknesses  Strengths:  Unsupervised convex kernel optimization.  Generalizes well in theory.  Relates manifold learning and kernel learning.  Easy to implement; just solve optimization.  Intuitive (stretching a string).  Weaknesses:  May not generalize well in practice (SVMs).  Implicit assumption: lower dimensionality is better.  Not always the case (as in SVMs due to separability in higher dimensions).  Robustness – what if a neighborhood contains an outlier?  Offline algorithm – entire gram matrix required.  Only a problem if N is large.  Paper doesn’t mention SDP details.  No algorithm analysis, complexity, etc. Complexity is “relatively high”.  In fact, no proof of convergence (according to the authors’ other 2004 paper).  Isomap, LLE, et al. already have such proofs.

14 Possible Improvements  Introduce slack variables for robustness.  “Rods” not “rigid”, but punished for “bending”.  Would introduce a “C” parameter, as in SVMs.  Incrementally accept minors of K for large values of N, use incremental kernel PCA.  Convolve SDE kernel with others for SVMs?  SDE unfolds manifold, other kernel makes the problem linearly separable again.  Only makes sense if SDE simplifies the problem.  Analyze complexity of SDP.

15 Conclusions  Using SDP, SDE can learn kernel matrices to “unfold” data embedded in manifolds.  Without requiring parameters.  Kernel PCA then reduces dimensionality.  Excellent for nonlinear dimensionality reduction / manifold learning.  Dramatic results when difference in dimensionalities is high.  Poorly suited for SVM classification.


Download ppt "Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan."

Similar presentations


Ads by Google