Collaborative Filtering Matrix Factorization Approach

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
Linear Regression.
Regularization David Kauchak CS 451 – Fall 2013.
Longin Jan Latecki Temple University
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
x – independent variable (input)
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
1cs542g-term Notes  Extra class this Friday 1-2pm  If you want to receive s about the course (and are auditing) send me .
Motion Analysis (contd.) Slides are from RPI Registration Class.
Methods For Nonlinear Least-Square Problems
Unconstrained Optimization Problem
12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.
CS 4700: Foundations of Artificial Intelligence
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Today Wrap up of probability Vectors, Matrices. Calculus
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Biointelligence Laboratory, Seoul National University
Classification Part 3: Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Classification / Regression Neural Networks 2
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Linear Discrimination Reading: Chapter 2 of textbook.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Chapter 2-OPTIMIZATION
Variations on Backpropagation.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Learning Recommender Systems with Adaptive Regularization
Empirical risk minimization
CSE 4705 Artificial Intelligence
10701 / Machine Learning.
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Machine Learning – Regression David Fenyő
Probabilistic Models for Linear Regression
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Neural networks (1) Traditional multi-layer perceptrons
Image Classification & Training of Neural Networks
Linear Discrimination
Presentation transcript:

Collaborative Filtering Matrix Factorization Approach

Collaborative filtering algorithms Common types: Global effects Nearest neighbor Matrix factorization Restricted Boltzmann machine Clustering Etc.

Optimization Optimization is an important part of many machine learning methods. The thing we’re usually optimizing is the loss function for the model. For a given set of training data X and outcomes y, we want to find the model parameters w that minimize the total loss over all X, y.

Loss function Suppose target outcomes come from set Y Binary classification: Y = { 0, 1 } Regression: Y =  (real numbers) A loss function maps decisions to costs: defines the penalty for predicting when the true value is . Standard choice for classification: 0/1 loss (same as misclassification error) Standard choice for regression: squared loss

Least squares linear fit to data Calculate sum of squared loss (SSL) and determine w: Can prove that this method of determining w minimizes SSL.

Optimization Simplest example - quadratic function in 1 variable: f( x ) = x2 + 2x – 3 Want to find value of x where f( x ) is minimum

Optimization This example is simple enough we can find minimum directly Minimum occurs where slope of curve is 0 First derivative of function = slope of curve So set first derivative to 0, solve for x

is value of x where f( x ) is minimum Optimization f( x ) = x2 + 2x – 3 f( x ) / dx = 2x + 2 2x + 2 = 0 x = -1 is value of x where f( x ) is minimum

Optimization Another example - quadratic function in 2 variables: f( x ) = f( x1, x2 ) = x12 + x1x2 + 3x22 f( x ) is minimum where gradient of f( x ) is zero in all directions

Optimization Gradient is a vector Each element is the slope of function along direction of one of variables Each element is the partial derivative of function with respect to one of variables Example:

Optimization Gradient vector points in direction of steepest ascent of function

Optimization This two-variable example is still simple enough that we can find minimum directly Set both elements of gradient to 0 Gives two linear equations in two variables Solve for x1, x2

Optimization Finding minimum directly by closed form analytical solution often difficult or impossible. Quadratic functions in many variables system of equations for partial derivatives may be ill-conditioned example: linear least squares fit where redundancy among features is high Other convex functions global minimum exists, but there is no closed form solution example: maximum likelihood solution for logistic regression Nonlinear functions partial derivatives are not linear example: f( x1, x2 ) = x1( sin( x1x2 ) ) + x22 example: sum of transfer functions in neural networks

Optimization Many approximate methods for finding minima have been developed Gradient descent Newton method Gauss-Newton Levenberg-Marquardt BFGS Conjugate gradient Etc.

Gradient descent optimization Simple concept: follow the gradient downhill Process: Pick a starting position: x0 = ( x1, x2, …, xd ) Determine the descent direction: - f( xt ) Choose a learning rate:  Update your position: xt+1 = xt -   f( xt ) Repeat from 2) until stopping criterion is satisfied Typical stopping criteria f( xt+1 ) ~ 0 some validation metric is optimized

Gradient descent optimization Slides thanks to Alexandre Bayen (CE 191, Univ. California, Berkeley, 2006) http://www.ce.berkeley.edu/~bayen/ce191www/lecturenotes /lecture10v01_descent2.pdf

Gradient descent optimization Example in MATLAB Find minimum of function in two variables: y = x12 + x1x2 + 3x22 http://www.youtube.com/watch?v=cY1YGQQbrpQ

Gradient descent optimization Problems: Choosing step size too small  convergence is slow and inefficient too large  may not converge Can get stuck on “flat” areas of function Easily trapped in local minima

Stochastic gradient descent Stochastic (definition): involving a random variable involving chance or probability; probabilistic

Stochastic gradient descent Application to training a machine learning model: Choose one sample from training set Calculate loss function for that single sample Calculate gradient from loss function Update model parameters a single step based on gradient and learning rate Repeat from 1) until stopping criterion is satisfied Typically entire training set is processed multiple times before stopping. Order in which samples are processed can be fixed or random.

Matrix factorization in action < a bunch of numbers > + factorization (training process) < a bunch of numbers > training data

Matrix factorization in action + multiply and add features (dot product) for desired < user, movie > prediction

Matrix factorization Notation User matrix U dimensions = I x F Number of users = I Number of items = J Number of factors per user / item = F User of interest = i Item of interest = j Factor index = f User matrix U dimensions = I x F Item matrix V dimensions = J x F

Matrix factorization Prediction for  user, item  pair i, j : Loss for prediction where true rating is : Using squared loss; other loss functions possible Loss function contains F model variables from U, and F model variables from V

Matrix factorization Gradient of loss function for sample  i, j  : for f = 1 to F

Matrix factorization Let’s simplify the notation: for f = 1 to F

Matrix factorization Set learning rate =  Then the factor matrix updates for sample  i, j  are: for f = 1 to F

Matrix factorization SGD for training a matrix factorization: Decide on F = dimension of factors Initialize factor matrices with small random values Choose one sample from training set Calculate loss function for that single sample Calculate gradient from loss function Update 2  F model parameters a single step using gradient and learning rate Repeat from 3) until stopping criterion is satisfied

Matrix factorization Must use some form of regularization (usually L2): Update rules become: for f = 1 to F

Stochastic gradient descent Random thoughts … Samples can be processed in small batches instead of one at a time  batch gradient descent We’ll see stochastic / batch gradient descent again when we learn about neural networks (as back-propagation)