Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Similar presentations


Presentation on theme: "Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,"— Presentation transcript:

1 Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning Discussion led by Chunping Wang ECE, Duke University October 23, 2009

2 Outline Motivations Preliminary Foundations Reverse Supervised Least Squares Relationship between Unsupervised Least Squares and PCA, K-means, and Normalized Graph-cut Semi-supervised Least Squares Experiments Conclusions 1/31

3 Motivations 2/31 Lack of a foundational connection between supervised and unsupervised learning  Supervised learning: minimizing prediction error  Unsupervised learning: re-representing the input data For semi-supervised learning, one needs to consider both together The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption” A unification demonstrated in this paper leads to a novel semi- supervised principle

4 Preliminary Foundations  Forward Supervised Least Squares 3/31 Data: –a input matrix X, a output matrix Y, –t instances, n features, k responses –regression: –classification: –assumption: X, Y full rank, Problem: –Find parameters W minimizing least squares loss for a model

5 Preliminary Foundations 4/31 Linear Ridge regularization Kernelization Instance weighting

6 Preliminary Foundations  Principal Components Analysis - dimensionality reduction  k-means – clustering  Normalized Graph-cut – clustering 5/31 Weighted undirected graph nodes edges affinity matrix Graph partition problem: find a partition minimizing the total weight of edges connecting nodes in distinct subsets.

7 Preliminary Foundations  Normalized Graph-cut – clustering 6/31 Partition indicator matrix Z Weighted degree matrix Total cut Normalized cut constraint objective From Xing & Jordan, 2003

8 Supervised Least Squares Regression Principle Component Analysis Unsupervised K-means Graph Norm Cut Least Squares Classification First contribution In literature 7/31

9 This paper 7/31 Supervised Least Squares Regression Principle Component Analysis Unsupervised K-means Graph Norm Cut Least Squares Classification Unification First contribution

10 Reverse Supervised Least Squares 8/31 Traditional forward least squares: predict the outputs from the inputs Reverse least squares: predict the inputs from the outputs Given reverse solutions U, the corresponding forward solutions W can be recovered exactly.

11 Reverse Supervised Least Squares 9/31 Ridge regularization Kernelization Instance weighting Reverse problem: Recover: Reverse problem: Recover:

12 Reverse Supervised Least Squares 10/31 For supervised learning with least squares loss  forward and reverse perspectives are equivalent  each can be recovered exactly from the other  the forward and reverse losses are not identical since they are measured in different units – it is not principled to combine them directly!

13 Unsupervised Least Squares 11/31 Unsupervised learning: no training labels Y are given Principle: optimize over guessed labels Z forward reverse For any W, we can choose Z=XW to achieve zero loss It only gives trivial solutions It does not Work! It gives non-trivial solutions

14 Unsupervised Least Squares PCA 12/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. This connection has been made in Jong& Kotz, 1999, and the authors extend it to the kernelized cases Corollary 1 Kernelized reverse prediction is equivalent to kernel principal components analysis.

15 Unsupervised Least Squares PCA 13/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof

16 Unsupervised Least Squares PCA 13/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof Recall that The solution for Z is not unique

17 Unsupervised Least Squares PCA 14/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof Consider the SVD of Z: Then The objective Solution

18 Unsupervised Least Squares k-means 15/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. The connection between PCA and k-means clustering has been made in Ding & He, 2004, but the authors show the connection of both to supervised (reverse) least squares. Corollary 2 Constrained kernelized reverse prediction is equivalent to kernel k-means.

19 Unsupervised Least Squares k-means 16/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Equivalent problem Consider the difference Diagonal matrix Counts of data in each class matrix Each row: sum of data in each class

20 Unsupervised Least Squares k-means 17/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof means encoding

21 Unsupervised Least Squares k-means 18/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Therefore

22 Unsupervised Least Squares Norm-cut 19/31 Proposition 3 For a doubly nonnegative matrix K and weighting, weighted reverse prediction is equivalent to normalized graph-cut. Proof For any Z, the solution to the inner minimization Reduced objective

23 Unsupervised Least Squares Norm-cut 20/31 Proof Recall the normalized-cut (from Xing & Jordan, 2003) Proposition 3 For a doubly nonnegative matrix K and weighting, weighted reverse prediction is equivalent to normalized graph-cut. Since K doubly nonnegative, it could be a affinity matrix. The objective is equivalent to normalized graph-cut.

24 Unsupervised Least Squares Norm-cut 21/31 Corollary 3 The weighted least squares problem is equivalent to normalized graph-cut on if. With a specific K, we can relate normalized graph-cut to the reverse least squares.

25 Second contribution 22/31 Reverse Prediction Supervised Least Squares Learning Principle Component Analysis Unsupervised K-means Graph Norm Cut The figure is taken from Xu’s slides

26 22/31 Reverse Prediction Supervised Least Squares Learning Principle Component Analysis Unsupervised K-means Graph Norm Cut New Semi- Supervised The figure is taken from Xu’s slides Second contribution

27 Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

28 Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

29 Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

30 Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

31 Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses Unsupervised reverse losses

32 Semi-supervised Least Squares 24/31 Proposition 4 For any X, Y, and U where Supervised loss Unsupervised loss Squared distance Unsupervised loss depends only on the input data X; Squared distance depends on both X and Y. Note: we cannot get the true supervised loss since we don’t have all the labels Y. We may estimate it using only labeled data, or also using auxiliary unlabeled data.

33 Semi-supervised Least Squares 25/31 Corollary 4 For any U where Supervised loss estimate Unsupervised loss estimate Squared distance estimate Labeled data are scarce, but plenty of unlabeled data are available. The variance of the supervised loss estimate is strictly reduced by introducing the second term to get a better unbiased unsupervised loss estimate.

34 Semi-supervised Least Squares 26/31 A naive approach: Loss on labeled dataLoss on unlabeled data Advantages: The authors combine supervised and unsupervised reverse losses; while previous approaches combine unsupervised (reverse) loss with supervised (forward) loss, which are not in the same units. Compared to the principled approach, it admits more straightforward optimization procedures (alternating between U and Z)

35 Regression Experiments Least Squares + PCA 27/31 Two terms are not jointly convex no closed form solution Learning method: alternating (with a initial U got from supervised setting) Recovered forward solution Testing: given a new x, Can be kernelized Basic formulation

36 Regression Experiments Least Squares + PCA 28/31 Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; T L, T U ) are indicated for each data set. The table is taken from Xu’s paper

37 Classification Experiments Least Squares + k-means 29/31 Recovered forward solution Testing: given a new x,, predict max response Least Squares + Norm-cut

38 Classification Experiments Least Squares + k-means 30/31 Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; T L, T U ) are indicated for each data set. The table is taken from Xu’s paper

39 Conclusions 31/31 Two main contributions: 1.A unified framework based on reverse least squares loss is proposed for several existing supervised and unsupervised algorithms; 2.In the unified framework, a novel semi-supervised principle is proposed.


Download ppt "Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,"

Similar presentations


Ads by Google