Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Xiaolong Wang and Daniel Khashabi
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Hierarchical Dirichlet Processes
Pattern Recognition and Machine Learning
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Supervised Learning Recap
Segmentation and Fitting Using Probabilistic Methods
Learning Scalable Discriminative Dictionaries with Sample Relatedness a.k.a. “Infinite Attributes” Jiashi Feng, Stefanie Jegelka, Shuicheng Yan, Trevor.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Visual Recognition Tutorial
Part IV: Monte Carlo and nonparametric Bayes. Outline Monte Carlo methods Nonparametric Bayesian models.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Visual Recognition Tutorial
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Principled Regularization for Probabilistic Matrix Factorization Robert Bell, Suhrid Balakrishnan AT&T Labs-Research Duke Workshop on Sensing and Analysis.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Model Inference and Averaging
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Variational Inference for the Indian Buffet Process
Bayesian Multivariate Logistic Regression by Sean O’Brien and David Dunson (Biometrics, 2004 ) Presented by Lihan He ECE, Duke University May 16, 2008.
Stick-Breaking Constructions
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Gaussian Processes For Regression, Classification, and Prediction.
Stick-breaking Construction for the Indian Buffet Process Duke University Machine Learning Group Presented by Kai Ni July 27, 2007 Yee Whye The, Dilan.
Matrix Factorization and its applications By Zachary 16 th Nov, 2010.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
NTU & MSRA Ming-Feng Tsai
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Variational Infinite Hidden Conditional Random Fields with Coupled Dirichlet Process Mixtures K. Bousmalis, S. Zafeiriou, L.-P. Morency, M. Pantic, Z.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Learning Deep Generative Models by Ruslan Salakhutdinov
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Multimodal Learning with Deep Boltzmann Machines
Latent Variables, Mixture Models and EM
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Michal Rosen-Zvi University of California, Irvine
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Primal Sparse Max-Margin Markov Networks
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University

Matrix Factorization and M 3 F (I) Setting: fit a partially observed matrix with subject to certain constraints Examples Singular Value Decomposition (SVD): when is fully observed, approximate it with the leading components (hence and minimizes -loss) Probabilistic Matrix Factorization (PMF): assume with Gaussian prior and likelihood (equivalent to -loss minimization with F-norm regularizer) Max-Margin Matrix Factorization (M 3 F): hinge loss minimization with nuclear norm regularizer on (or equivalently, F-norm regularizer on and ) observed entries (I)(II)

Matrix Factorization and M 3 F (II) Benefits of M 3 F max-margin approach, more applicable to binary, ordinal or categorical data (e.g. ratings) the nuclear norm regularizer (I) allows flexible latent dimensionality Limitations scalability vs. flexibility: SDP solvers for (I) scale poorly; while the more scalable (II) requires a pre-specified fixed finite efficiency vs. approximation: gradient descent solvers for (II) require a smooth hinge; while bilinear SVM solvers can be time-consuming Motivations: to build a M 3 F model that is both scalable, flexible and admits highly efficient solvers.

Roadmap M3FM3F Gibbs M 3 F Gibbs iPM 3 F Data augmentation

Setting: fit training data with model Regularized Risk Minimization (RRM): Maximum a Posteriori (MAP): For discriminative models loss function discriminant function posterior prior likelihood RRM as MAP, A New Look (I) label feature regularizerempirical risk

RRM as MAP, A New Look (II) Bridge RRM and MAP via delegate prior (likelihood) jointly intact: and induce exactly the same joint distribution (and thus the same posterior) singly relaxed: free from the normalization constraints (and thus no longer probability densities) The transition:

Delegate prior & likelihood Consider a simplest case: genuine pair: delegate pair: can be completely different from when viewed as functions of the model Both and are scaled (up to a constant) for better visualization.

M 3 F as MAP: the full model We consider M 3 F for ordinal ratings Risk: introduce thresholds and sum over the binary M 3 F losses for each Regularizer:, where MAP: with hyper-parameters ?

Data augmentation in general introduce auxiliary variables to facilitate Bayesian inference on the original variables of interest inject independence: e.g. EM algorithm (joint); stick-breaking construction (conditional) exchange for a much simpler conditional representation: e.g. slice-sampling; data augmentation strategy for logistic models and that for SVMs Lemma (location-scale mixture of Gaussians): Gaussian density function Data Augmentation for M 3 F (I)

Benefit of the augmented representation : Gaussian, “conjugate” to Gaussian “prior” : Generalized inverse Gaussian : inverse Gaussian Data Augmentation for M 3 F (II)

Data Augmentation for M 3 F (III) M 3 F before augmentation: where and M 3 F after augmentation (auxiliary variables ): where

Data Augmentation for M 3 F (IV) Posterior inference via Gibbs sampling Draw from for Draw likewise for Draw from for For details, please refer to our paper

Nonparametric M 3 F (I) We want to automatically infer from data the latent dimensionality The Indian buffet process induces a distribution on binary matrices with an unbounded number of columns follows a culinary metaphor e.g. cross validation in an elegant way behavioral pattern of the ith customer: for kth sampled dish: sample according to popularity then sample a number of new dishes

Nonparametric M 3 F (II) IBP enjoys several nice properties favors sparse matrices finite columns for finite customers (with probability one) exchangeability  Gibbs sampling would be easy We replace with and change the delegate prior with hyper-parameters

Nonparametric M 3 F (III) Inference via Gibbs sampling Draw from Draw from where Draw from Draw and

Experiments and Discussions Datasets: MovieLens 1M & EachMovie Test error (NMAE): Training time:

Experiments and Discussions Convergence: single samples vs. averaged samples RRM objective Validation error (NMAE)

Experiments and Discussions Latent dimensionality:

Thanks!