Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning Discussion led by Chunping Wang ECE, Duke University October 23, 2009

Outline Motivations Preliminary Foundations Reverse Supervised Least Squares Relationship between Unsupervised Least Squares and PCA, K-means, and Normalized Graph-cut Semi-supervised Least Squares Experiments Conclusions 1/31

Motivations 2/31 Lack of a foundational connection between supervised and unsupervised learning  Supervised learning: minimizing prediction error  Unsupervised learning: re-representing the input data For semi-supervised learning, one needs to consider both together The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption” A unification demonstrated in this paper leads to a novel semi- supervised principle

Preliminary Foundations  Forward Supervised Least Squares 3/31 Data: –a input matrix X, a output matrix Y, –t instances, n features, k responses –regression: –classification: –assumption: X, Y full rank, Problem: –Find parameters W minimizing least squares loss for a model

Preliminary Foundations 4/31 Linear Ridge regularization Kernelization Instance weighting

Preliminary Foundations  Principal Components Analysis - dimensionality reduction  k-means – clustering  Normalized Graph-cut – clustering 5/31 Weighted undirected graph nodes edges affinity matrix Graph partition problem: find a partition minimizing the total weight of edges connecting nodes in distinct subsets.

Preliminary Foundations  Normalized Graph-cut – clustering 6/31 Partition indicator matrix Z Weighted degree matrix Total cut Normalized cut constraint objective From Xing & Jordan, 2003

Supervised Least Squares Regression Principle Component Analysis Unsupervised K-means Graph Norm Cut Least Squares Classification First contribution In literature 7/31

This paper 7/31 Supervised Least Squares Regression Principle Component Analysis Unsupervised K-means Graph Norm Cut Least Squares Classification Unification First contribution

Reverse Supervised Least Squares 8/31 Traditional forward least squares: predict the outputs from the inputs Reverse least squares: predict the inputs from the outputs Given reverse solutions U, the corresponding forward solutions W can be recovered exactly.

Reverse Supervised Least Squares 9/31 Ridge regularization Kernelization Instance weighting Reverse problem: Recover: Reverse problem: Recover:

Reverse Supervised Least Squares 10/31 For supervised learning with least squares loss  forward and reverse perspectives are equivalent  each can be recovered exactly from the other  the forward and reverse losses are not identical since they are measured in different units – it is not principled to combine them directly!

Unsupervised Least Squares 11/31 Unsupervised learning: no training labels Y are given Principle: optimize over guessed labels Z forward reverse For any W, we can choose Z=XW to achieve zero loss It only gives trivial solutions It does not Work! It gives non-trivial solutions

Unsupervised Least Squares PCA 12/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. This connection has been made in Jong& Kotz, 1999, and the authors extend it to the kernelized cases Corollary 1 Kernelized reverse prediction is equivalent to kernel principal components analysis.

Unsupervised Least Squares PCA 13/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof

Unsupervised Least Squares PCA 13/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof Recall that The solution for Z is not unique

Unsupervised Least Squares PCA 14/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof Consider the SVD of Z: Then The objective Solution

Unsupervised Least Squares k-means 15/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. The connection between PCA and k-means clustering has been made in Ding & He, 2004, but the authors show the connection of both to supervised (reverse) least squares. Corollary 2 Constrained kernelized reverse prediction is equivalent to kernel k-means.

Unsupervised Least Squares k-means 16/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Equivalent problem Consider the difference Diagonal matrix Counts of data in each class matrix Each row: sum of data in each class

Unsupervised Least Squares k-means 17/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof means encoding

Unsupervised Least Squares k-means 18/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Therefore

Unsupervised Least Squares Norm-cut 19/31 Proposition 3 For a doubly nonnegative matrix K and weighting, weighted reverse prediction is equivalent to normalized graph-cut. Proof For any Z, the solution to the inner minimization Reduced objective

Unsupervised Least Squares Norm-cut 20/31 Proof Recall the normalized-cut (from Xing & Jordan, 2003) Proposition 3 For a doubly nonnegative matrix K and weighting, weighted reverse prediction is equivalent to normalized graph-cut. Since K doubly nonnegative, it could be a affinity matrix. The objective is equivalent to normalized graph-cut.

Unsupervised Least Squares Norm-cut 21/31 Corollary 3 The weighted least squares problem is equivalent to normalized graph-cut on if. With a specific K, we can relate normalized graph-cut to the reverse least squares.

Second contribution 22/31 Reverse Prediction Supervised Least Squares Learning Principle Component Analysis Unsupervised K-means Graph Norm Cut The figure is taken from Xu’s slides

22/31 Reverse Prediction Supervised Least Squares Learning Principle Component Analysis Unsupervised K-means Graph Norm Cut New Semi- Supervised The figure is taken from Xu’s slides Second contribution

Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses Unsupervised reverse losses

Semi-supervised Least Squares 24/31 Proposition 4 For any X, Y, and U where Supervised loss Unsupervised loss Squared distance Unsupervised loss depends only on the input data X; Squared distance depends on both X and Y. Note: we cannot get the true supervised loss since we don’t have all the labels Y. We may estimate it using only labeled data, or also using auxiliary unlabeled data.

Semi-supervised Least Squares 25/31 Corollary 4 For any U where Supervised loss estimate Unsupervised loss estimate Squared distance estimate Labeled data are scarce, but plenty of unlabeled data are available. The variance of the supervised loss estimate is strictly reduced by introducing the second term to get a better unbiased unsupervised loss estimate.

Semi-supervised Least Squares 26/31 A naive approach: Loss on labeled dataLoss on unlabeled data Advantages: The authors combine supervised and unsupervised reverse losses; while previous approaches combine unsupervised (reverse) loss with supervised (forward) loss, which are not in the same units. Compared to the principled approach, it admits more straightforward optimization procedures (alternating between U and Z)

Regression Experiments Least Squares + PCA 27/31 Two terms are not jointly convex no closed form solution Learning method: alternating (with a initial U got from supervised setting) Recovered forward solution Testing: given a new x, Can be kernelized Basic formulation

Regression Experiments Least Squares + PCA 28/31 Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; T L, T U ) are indicated for each data set. The table is taken from Xu’s paper

Classification Experiments Least Squares + k-means 29/31 Recovered forward solution Testing: given a new x,, predict max response Least Squares + Norm-cut

Classification Experiments Least Squares + k-means 30/31 Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; T L, T U ) are indicated for each data set. The table is taken from Xu’s paper

Conclusions 31/31 Two main contributions: 1.A unified framework based on reverse least squares loss is proposed for several existing supervised and unsupervised algorithms; 2.In the unified framework, a novel semi-supervised principle is proposed.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Similar presentations

Presentation on theme: "Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Similar presentations

Presentation on theme: "Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,"— Presentation transcript:

Similar presentations

About project

Feedback