Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Slides:

Advertisements

Similar presentations

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.

Advertisements

Ordinary Least-Squares

Chapter Outline 3.1 Introduction

Component Analysis (Review)

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

A Geometric Perspective on Machine Learning 何晓飞浙江大学计算机学院 1.

Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.

The Structure of Polyhedra Gabriel Indik March 2006 CAS 746 – Advanced Topics in Combinatorial Optimization.

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Lecture 21: Spectral Clustering

Introduction to Predictive Learning

Spectral Clustering Course: Cluster Analysis and Other Unsupervised Learning Methods (Stat 593 E) Speakers: Rebecca Nugent1, Larissa Stanberry2 Department.

A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Visual Recognition Tutorial

Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Radial Basis Function Networks

Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.

An Introduction to Support Vector Machines Martin Law.

Evaluating Performance for Data Mining Techniques

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Summarized by Soo-Jin Kim

Presented By Wanchen Lu 2/25/2013

Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.

Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

An Introduction to Support Vector Machines (M. Law)

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

An Efficient Greedy Method for Unsupervised Feature Selection

Lecture 2: Statistical learning primer for biologists

Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.

CpSc 881: Machine Learning

Estimating variable structure and dependence in multi-task learning via gradients By: Justin Guinney, Qiang Wu and Sayan Mukherjee Presented by: John Paisley.

ViSOM － A Novel Method for Multivariate Data Projection and Structure Visualization Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Hujun Yin.

Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.

Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.

Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

PREDICT 422: Practical Machine Learning

Semi-Supervised Clustering

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

Linli Xu Martha White Dale Schuurmans University of Alberta

Constrained Clustering -Semi Supervised Clustering-

Outlier Processing via L1-Principal Subspaces

Probabilistic Models with Latent Variables

Dimension reduction : PCA and Clustering

Biointelligence Laboratory, Seoul National University

Using Manifold Structure for Partially Labeled Classification

Machine Learning – a Probabilistic Perspective

Non-Negative Matrix Factorization

Presentation transcript:

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning Discussion led by Chunping Wang ECE, Duke University October 23, 2009

Outline Motivations Preliminary Foundations Reverse Supervised Least Squares Relationship between Unsupervised Least Squares and PCA, K-means, and Normalized Graph-cut Semi-supervised Least Squares Experiments Conclusions 1/31

Motivations 2/31 Lack of a foundational connection between supervised and unsupervised learning  Supervised learning: minimizing prediction error  Unsupervised learning: re-representing the input data For semi-supervised learning, one needs to consider both together The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption” A unification demonstrated in this paper leads to a novel semi- supervised principle

Preliminary Foundations  Forward Supervised Least Squares 3/31 Data: –a input matrix X, a output matrix Y, –t instances, n features, k responses –regression: –classification: –assumption: X, Y full rank, Problem: –Find parameters W minimizing least squares loss for a model

Preliminary Foundations 4/31 Linear Ridge regularization Kernelization Instance weighting

Preliminary Foundations  Principal Components Analysis - dimensionality reduction  k-means – clustering  Normalized Graph-cut – clustering 5/31 Weighted undirected graph nodes edges affinity matrix Graph partition problem: find a partition minimizing the total weight of edges connecting nodes in distinct subsets.

Preliminary Foundations  Normalized Graph-cut – clustering 6/31 Partition indicator matrix Z Weighted degree matrix Total cut Normalized cut constraint objective From Xing & Jordan, 2003

Supervised Least Squares Regression Principle Component Analysis Unsupervised K-means Graph Norm Cut Least Squares Classification First contribution In literature 7/31

This paper 7/31 Supervised Least Squares Regression Principle Component Analysis Unsupervised K-means Graph Norm Cut Least Squares Classification Unification First contribution

Reverse Supervised Least Squares 8/31 Traditional forward least squares: predict the outputs from the inputs Reverse least squares: predict the inputs from the outputs Given reverse solutions U, the corresponding forward solutions W can be recovered exactly.

Reverse Supervised Least Squares 9/31 Ridge regularization Kernelization Instance weighting Reverse problem: Recover: Reverse problem: Recover:

Reverse Supervised Least Squares 10/31 For supervised learning with least squares loss  forward and reverse perspectives are equivalent  each can be recovered exactly from the other  the forward and reverse losses are not identical since they are measured in different units – it is not principled to combine them directly!

Unsupervised Least Squares 11/31 Unsupervised learning: no training labels Y are given Principle: optimize over guessed labels Z forward reverse For any W, we can choose Z=XW to achieve zero loss It only gives trivial solutions It does not Work! It gives non-trivial solutions

Unsupervised Least Squares PCA 12/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. This connection has been made in Jong& Kotz, 1999, and the authors extend it to the kernelized cases Corollary 1 Kernelized reverse prediction is equivalent to kernel principal components analysis.

Unsupervised Least Squares PCA 13/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof

Unsupervised Least Squares PCA 13/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof Recall that The solution for Z is not unique

Unsupervised Least Squares PCA 14/31 Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof Consider the SVD of Z: Then The objective Solution

Unsupervised Least Squares k-means 15/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. The connection between PCA and k-means clustering has been made in Ding & He, 2004, but the authors show the connection of both to supervised (reverse) least squares. Corollary 2 Constrained kernelized reverse prediction is equivalent to kernel k-means.

Unsupervised Least Squares k-means 16/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Equivalent problem Consider the difference Diagonal matrix Counts of data in each class matrix Each row: sum of data in each class

Unsupervised Least Squares k-means 17/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof means encoding

Unsupervised Least Squares k-means 18/31 Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Therefore

Unsupervised Least Squares Norm-cut 19/31 Proposition 3 For a doubly nonnegative matrix K and weighting, weighted reverse prediction is equivalent to normalized graph-cut. Proof For any Z, the solution to the inner minimization Reduced objective

Unsupervised Least Squares Norm-cut 20/31 Proof Recall the normalized-cut (from Xing & Jordan, 2003) Proposition 3 For a doubly nonnegative matrix K and weighting, weighted reverse prediction is equivalent to normalized graph-cut. Since K doubly nonnegative, it could be a affinity matrix. The objective is equivalent to normalized graph-cut.

Unsupervised Least Squares Norm-cut 21/31 Corollary 3 The weighted least squares problem is equivalent to normalized graph-cut on if. With a specific K, we can relate normalized graph-cut to the reverse least squares.

Second contribution 22/31 Reverse Prediction Supervised Least Squares Learning Principle Component Analysis Unsupervised K-means Graph Norm Cut The figure is taken from Xu’s slides

22/31 Reverse Prediction Supervised Least Squares Learning Principle Component Analysis Unsupervised K-means Graph Norm Cut New Semi- Supervised The figure is taken from Xu’s slides Second contribution

Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses

Semi-supervised Least Squares 23/31 A principled approach: reverse loss decomposition The figure is taken from Xu’s slides Supervised reverse losses Unsupervised reverse losses

Semi-supervised Least Squares 24/31 Proposition 4 For any X, Y, and U where Supervised loss Unsupervised loss Squared distance Unsupervised loss depends only on the input data X; Squared distance depends on both X and Y. Note: we cannot get the true supervised loss since we don’t have all the labels Y. We may estimate it using only labeled data, or also using auxiliary unlabeled data.

Semi-supervised Least Squares 25/31 Corollary 4 For any U where Supervised loss estimate Unsupervised loss estimate Squared distance estimate Labeled data are scarce, but plenty of unlabeled data are available. The variance of the supervised loss estimate is strictly reduced by introducing the second term to get a better unbiased unsupervised loss estimate.

Semi-supervised Least Squares 26/31 A naive approach: Loss on labeled dataLoss on unlabeled data Advantages: The authors combine supervised and unsupervised reverse losses; while previous approaches combine unsupervised (reverse) loss with supervised (forward) loss, which are not in the same units. Compared to the principled approach, it admits more straightforward optimization procedures (alternating between U and Z)

Regression Experiments Least Squares + PCA 27/31 Two terms are not jointly convex no closed form solution Learning method: alternating (with a initial U got from supervised setting) Recovered forward solution Testing: given a new x, Can be kernelized Basic formulation

Regression Experiments Least Squares + PCA 28/31 Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; T L, T U ) are indicated for each data set. The table is taken from Xu’s paper

Classification Experiments Least Squares + k-means 29/31 Recovered forward solution Testing: given a new x,, predict max response Least Squares + Norm-cut

Classification Experiments Least Squares + k-means 30/31 Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; T L, T U ) are indicated for each data set. The table is taken from Xu’s paper

Conclusions 31/31 Two main contributions: 1.A unified framework based on reverse least squares loss is proposed for several existing supervised and unsupervised algorithms; 2.In the unified framework, a novel semi-supervised principle is proposed.