Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

CHAPTER 13: Alpaydin: Kernel Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine

An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Isomap Algorithm.
Support Vector Machine
Support Vector Machines and Kernel Methods
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Dimensional reduction, PCA
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machines and Kernel Methods
Support Vector Machines
Support Vector Machines
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Lecture 10: Support Vector Machines
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Support Vector Machine & Image Classification Applications
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Manifold learning: MDS and Isomap
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
Non-Linear Dimensionality Reduction
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Nonlinear Dimension Reduction: Semi-Definite Embedding vs. Local Linear Embedding Li Zhang and Lin Liao.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Spectral Methods for Dimensionality
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Nonlinear Dimension Reduction:
NonLinear Dimensionality Reduction or Unfolding Manifolds
Presentation transcript:

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan

The Problem:  Data lies on or near a manifold.  Lower dimensionality than overall space.  Locally Euclidean.  Example: data on a 2D line in R 3, flat area on a sphere.  Goal: Learn a kernel that will let us work in the lower- dimensional space.  “Unfold” the manifold.  First we need to know what it is!  Its dimensionality.  How it can vary. 2D manifold on a sphere. (Wikipedia)

Background Assumptions:  Kernel Trick  Mercer’s Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel Functions can be represented as dot (inner) products in a high-dimensional space (Wikipedia; implied in paper).  So we replace the dot product with a kernel function.  Or “Gram Matrix”, K nm = φ (x n ) T * φ (x m ) = k(x n, x m )  Kernel provides mapping into high-dimensional space.  Consequence of Cover’s theorem: Nonlinear problem then becomes linear.  Example: SVMs: x i T * x j -> φ (x i ) T * φ (x j ) = k(x i, x j ).  Linear Dimensionality Reduction Techniques:  SVD, derived techniques (PCA, ICA, etc.) remove linear correlations.  This reduces the dimensionality.  Now combine these!  Kernel PCA for nonlinear dimensionality reduction!  Map input to a higher dimension using a kernel, then use PCA.

The (More Specific) Problem:  Data described by a manifold.  Using kernel PCA, discover the manifold.  There’s only one detail missing:  How do we find the appropriate kernel?  This forms the basis of the paper’s approach.  It is also a motivation for the paper…

Motivation:  Exploits properties of the data, not just its space.  Relates kernel discovery to manifold learning.  With the right kernel, kernel PCA will allow us to discover the manifold.  So it has implications for both fields.  Another paper by the same authors focuses on applicability to manifold learning; this paper focuses on kernel learning.  Unlike previous methods, this approach is unsupervised; the kernel is learned automatically.  Not specific to PCA; it can learn any kernel.

Methodology – Idea:  Semidefinite programming (optimization)  Look for a locally isometric mapping from the space to the manifold.  Preserves distance, angles between points.  Rotation and Translation on a neighborhood.  Fix the distance and angles between a point and its k nearest neighbors.  Intuition:  Represent points as a lattice of “steel balls”.  Neighborhoods connected by “rigid rods” that fix angles and distance (local isometry constraint).  Now pull the balls as far apart as possible (obj. function).  The lattice flattens -> Lower dimensionality!  The “balls” and “rods” represent the manifold...  If the data is well-sampled (Wikipedia).  Shouldn’t be a problem in practice.

Optimization Constraints:  Isometry:  For all neighbors x j, x k of point x i.  If x j and x k are neighbors of each other or another common point,  Let Gram matrices  We then have K ii + K jj - K ij - K ji = G ii + G jj - G ij - G ji.  Positive Semidefiniteness (required for kernel trick).  No negative eigenvalues.  Centered on the origin ( ).  So eigenvalues measure variance of PCs.  Dataset can be centered if not already.

Objective Function  We want to maximize pairwise distances.  This is an inversion of SSE/MSE!  So we have  Which is just Tr(K)!  Proof: (Not given in paper)

Semidefinite Embedding (SDE)  Maximize Tr(K) subject to:  K ≥ 0   K ii + K jj - K ij - K ji = G ii + G jj - G ij - G ji for all i,j that are neighbors of each other or a common point.  This optimization is convex, and thus has a unique solution.  Use semidefinite programming to perform the optimization (no SDP details in paper).  Once we have the optimal kernel, perform kPCA.  This technique (SDE) is this paper’s contribution.

Experimental Setup  Four kernels:  SDE (proposed)  Linear  Polynomial  Gaussian  “Swiss Roll” Dataset.  23 dimensions.  3 meaningful (top right).  20 filled with small noise (not shown).  800 inputs.  k = 4, p = 4, σ = 1.45 ( σ of 4-neighborhoods).  “Teapot” Dataset.  Same teapot, rotated 0 ≤ i < 360 degrees.  23,028 dimensions (76 x 101 x 3).  Only one degree of freedom (angle of rotation).  400 inputs.  k = 4, p = 4, σ =  “The handwriting dataset”.  No dimensionality or parameters specified (16x16x1 = 256D?)  953 images. No images or kernel matrix shown.

Results – Dimensionality Reduction  Two measures:  Learned Kernels (SDE):  “Eigenspectra”:  Variance captured by individual eigenvalues.  Normalized by trace (sum of eigenvalues).  Seems to indicate manifold dimensionality. “Swiss Roll”“Teapot”“Digits”

Results – Large Margin Classification  Used SDE kernels with SVMs.  Results were very poor.  Lowering dimensionality can impair separability. Error rates: 90/10 training/test split. Mean of 10 experiments. Decision boundary no longer linearly separable.

Strengths and Weaknesses  Strengths:  Unsupervised convex kernel optimization.  Generalizes well in theory.  Relates manifold learning and kernel learning.  Easy to implement; just solve optimization.  Intuitive (stretching a string).  Weaknesses:  May not generalize well in practice (SVMs).  Implicit assumption: lower dimensionality is better.  Not always the case (as in SVMs due to separability in higher dimensions).  Robustness – what if a neighborhood contains an outlier?  Offline algorithm – entire gram matrix required.  Only a problem if N is large.  Paper doesn’t mention SDP details.  No algorithm analysis, complexity, etc. Complexity is “relatively high”.  In fact, no proof of convergence (according to the authors’ other 2004 paper).  Isomap, LLE, et al. already have such proofs.

Possible Improvements  Introduce slack variables for robustness.  “Rods” not “rigid”, but punished for “bending”.  Would introduce a “C” parameter, as in SVMs.  Incrementally accept minors of K for large values of N, use incremental kernel PCA.  Convolve SDE kernel with others for SVMs?  SDE unfolds manifold, other kernel makes the problem linearly separable again.  Only makes sense if SDE simplifies the problem.  Analyze complexity of SDP.

Conclusions  Using SDP, SDE can learn kernel matrices to “unfold” data embedded in manifolds.  Without requiring parameters.  Kernel PCA then reduces dimensionality.  Excellent for nonlinear dimensionality reduction / manifold learning.  Dramatic results when difference in dimensionalities is high.  Poorly suited for SVM classification.