Fast Maximum Margin Matrix Factorization for Collaborative Prediction Jason Rennie MIT Nati Srebro Univ. of Toronto.

Slides:



Advertisements
Similar presentations
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Advertisements

Learning User Preferences Jason Rennie MIT CSAIL Advisor: Tommi Jaakkola.
Regularized risk minimization
Regularization David Kauchak CS 451 – Fall 2013.
Projection-free Online Learning
Optimization Tutorial
CMPUT 466/551 Principal Source: CMU
1/11 Tea Talk: Weighted Low Rank Approximations Ben Marlin Machine Learning Group Department of Computer Science University of Toronto April 30, 2003.
Continuous optimization Problems and successes
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
2. Introduction Multiple Multiplicative Factor Model For Collaborative Filtering Benjamin Marlin University of Toronto. Department of Computer Science.
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
Direct Convex Relaxations of Sparse SVM Antoni B. Chan, Nuno Vasconcelos, and Gert R. G. Lanckriet The 24th International Conference on Machine Learning.
Modeling User Rating Profiles For Collaborative Filtering
Unconstrained Optimization Problem
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Group Norm for Learning Latent Structural SVMs Overview Daozheng Chen (UMD, College Park), Dhruv Batra (TTI Chicago), Bill Freeman (MIT), Micah K. Johnson.
Online Learning Algorithms
An Introduction to Support Vector Machines Martin Law.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Collaborative Filtering Matrix Factorization Approach
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Yan Yan, Mingkui Tan, Ivor W. Tsang, Yi Yang,
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Non Negative Matrix Factorization
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
An Introduction to Support Vector Machines (M. Law)
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Combining Audio Content and Social Context for Semantic Music Discovery José Carlos Delgado Ramos Universidad Católica San Pablo.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
CoFi Rank : Maximum Margin Matrix Factorization for Collaborative Ranking Markus Weimer, Alexandros Karatzoglou, Quoc Viet Le and Alex Smola NIPS’07.
Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun
CpSc 881: Machine Learning
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
A global approach Finding correspondence between a pair of epipolar lines for all pixels simultaneously Local method: no guarantee we will have one to.
ICONIP 2010, Sydney, Australia 1 An Enhanced Semi-supervised Recommendation Model Based on Green’s Function Dingyan Wang and Irwin King Dept. of Computer.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Collaborative filtering applied to real- time bidding.
Strong Supervision From Weak Annotation Interactive Training of Deformable Part Models ICCV /05/23.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
Regularized Least-Squares and Convex Optimization.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
BSP: An iterated local search heuristic for the hyperplane with minimum number of misclassifications Usman Roshan.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning & Deep Learning
Learning Recommender Systems with Adaptive Regularization
Empirical risk minimization
Optimization in Machine Learning
Group Norm for Learning Latent Structural SVMs
An Introduction to Support Vector Machines
Machine Learning Today: Reading: Maria Florina Balcan
Advanced Artificial Intelligence
Collaborative Filtering Matrix Factorization Approach
CSCI B609: “Foundations of Data Science”
L5 Optimal Design concepts pt A
Kai-Wei Chang University of Virginia
The European Conference on e-learing ,2017/10
Empirical risk minimization
Presentation transcript:

Fast Maximum Margin Matrix Factorization for Collaborative Prediction Jason Rennie MIT Nati Srebro Univ. of Toronto

Collaborative Prediction Based on partially observed matrix: ) Predict unobserved entries ? ?? ? ?? ? ? ? ? “Will user i like movie j?” users movies

Problems to Address Underlying representation of preferences –Norm constrained matrix factorization (MMMF) Discrete, ordered labels –Threshold-based ordinal regression Scaling-up MMMF to large problems –Factorized objective, gradient descent Ratings may not be missing at random

Linear Factor Model v1v1 v2v2 v3v3 comic value dramatic value violence movies × × × Features Preferences movies Ratings movies User Preference Weights Feature Values

Ordinal Regression v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 w1 Feature Vectors Preference Weights Preference ScoresRatings 

Matrix Factorization v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w Feature Vectors Preference Weights Preference Scores Ratings 

Matrix Factorization U V’ XY 

Ordinal Regression v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 w1 Feature Vectors Preference Weights Preference ScoresRatings 

Max-Margin Ordinal Regression [Shashua & Levin, NIPS 2002]

Absolute Difference Shashua & Levin’s loss bounds the misclassification error Ordinal Regression: we want to minimize the absolute difference between labels

All-Thresholds Loss [Srebro et al., NIPS 2004] [Chu & Keerthi, ICML 2005]

All-Thresholds Loss Experiments comparing: –Least squares regression –Multi-class classification –Shashua & Levin’s Max-Margin OR –All-Thresholds OR All-Thresholds Ordinal Regression –Lowest misclassification error –Lowest absolute difference error [Rennie & Srebro, IJCAI Wkshp 2005]

Learning Weights & Features v1 v2 v3 v4v5v6v7v8v9v10 Feature Vectors w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Preference Weights Ratings Preference Scores 

Low Rank Matrix Factorization V’ U × ¼ X rank k = Y Use SVD to find Global Optimum Non-convex No explicit soln. Sum-Squared Loss Fully Observed Y Classification Error Loss Partially Observed Y

V’ UX low norm [Fazel et al., 2001] Norm Constrained Factorization ||X|| tr = min U,V (||U|| Fro 2 + ||V|| Fro 2 )/2 ||U|| Fro 2 = ∑ i,j U ij 2

MMMF Objective min X ||X|| tr + c loss(X,Y) All-Thresholds Original Objective [Srebro et al., NIPS 2004] X V’ U low norm ||U|| Fro 2 = ∑ i,j U ij 2 min U,V (||U|| Fro 2 + ||V|| Fro 2 )/2 + c loss(UV’,Y) Factorized Objective All-Thresholds

Smooth Hinge Hinge Gradient Smooth Hinge Function Value Hinge Smooth Hinge

Collaborative Prediction Results size, sparsity: EachMovie 36656x1648, 96% MovieLens 6040x3952, 96% Algorithm Weak Error Strong Error Weak Error Strong Error URP Attitude MMMF URP & Attitude Results: [Marlin, 2004]

Local Minima? min U,V (||U|| Fro 2 + ||V|| Fro 2 )/2 + c loss(UV’,Y) Factorized Objective  Optimize X with SDP Optimize U,V with grad. descent

Local Minima? Matrix Difference  Y X Data: 100 x 100 MovieLens, 65% sparse Objective Difference 

Summary We scaled MMMF to large problems by optimizing the Factorized Objective Empirical tests indicate that local minima issues are rare or absent Results on large-scale data show substantial improvements over state-of- the-art D’Aspremont & Srebro: large-scale SDP optimization methods. Train on 1.5 million binary labels in 20 hours.