Presentation on theme: "Netflix Prize Solution: A Matrix Factorization Approach By Atul S. Kulkarni Graduate student University of Minnesota Duluth."— Presentation transcript:
Netflix Prize Solution: A Matrix Factorization Approach By Atul S. Kulkarni Graduate student University of Minnesota Duluth
Agenda Problem Description Netflix Data Why is it a tough nut to crack? Overview of methods already applied to this problem Overview of the Paper Details of the method How does this method works for the Netflix problem My implementation Results Q and A?
Netflix Prize Problem Given a set of users with their previous ratings for a set of movies, can we predict the rating they will assign to a movie they have not previously rated? Defined at Seeks to improve the Cinematch’s (Netflix’s existing movie recommender system) prediction performance by 10%. How is the performance measured? – Root Mean Square Error (RMSE) Winner gets a prize of 1 Million USD.
Problem Description Recommender Systems – Use the knowledge about preference of a group of users about a certain items and help predict the interest level for other users from same community.  Collaborative filtering – Widely used method for recommender systems – Tries to find traits of shared interest among users in a group to help predict the likes and dislikes of the other users within the group. 
Why is this problem interesting? Used by almost every recommender system today – Amazon – Yahoo – Google – Netflix – …
Netflix Data Netflix released data for this competition Contains nearly 100 Million ratings Number of users (Anonymous) = 480,189 Number of movies rated by them = 17,770 Training Data is provided per movie To verify the model developed without submitting the predictions to Netflix “probe.txt” is provided To submit the predictions for competition “qualifying.txt” is used
Netflix Data in Pictures These pictures are taken as is from 
Netflix Data in Pictures Contd.
Netflix Data Data in the training file is per movie – It looks like this Movie# Customer#,Rating,Date of Rating -Example 4: ,3, ,1, ,5,
Netflix Data Data points in the “probe.txt” looks like this (Have answers) Movie# Customer# 1: Data in the qualifying.txt looks like this (No answers) Movie# Customer#, DateofRating 1: , , ,
Hard Nut to Crack? Why is this problem such a difficult one? – Total ratings possible = 480,189 (user) * 17,770 (movies) = (8.5 Billion) – Total available = 100 Million – The User x Movies matrix has 8.4 Billion entries missing – Consider the problem as Least Square problem – We can consider this problem by representing it as system of equation in a matrix
Technically tough as well Huge memory requirements High time requirements Because we are using only ~100 Million of possible 8.5 Billion ratings the predictors have some error in their weights (small training data)
Various Methods Employed for Netflix Prize Problem Nearest Neighbor methods – k-NN with variations Matrix factorization – Probabilistic Latent Semantic Analysis – Probabilistic Matrix Factorization – Expectation Maximization for Matrix Factorization – Singular Value Decomposition – Regularized Matrix Factorization 
The Paper Title: “Improving regularized singular value decomposition for collaborative filtering” - Arkadiusz Paterek, Proceedings of KDD Cup and Workshop,  Uses Algorithm described by Simon Funk (Brandyn Webb) in . The algorithm revolves around regularized Singular Value Decomposition (SVD) described in  and suggests some interesting use of biases to it to improve performance. It also proposes some methods for post processing of the features extracted from the SVD. It compares the various combinations of methods suggested in the paper for the Netflix Data.
Singular Value Decomposition Consider the given problem as a Matrix of Users x Movies A or Movies x Users Show are the two examples What do we do with this representation? M1M2M3M4M5M6 U U23515 U32455 U1U2U3 M122 M24 M3534 M4555 M51 M6155
Singular Value Decomposition Method of Matrix Factorization Applicable to rectangular matrices and square alike Decomposes the matrix in to 3 component matrices whose product approximates the original matrix E.g. D $d  U $u [,1] [,2] [,3] [1,] [2,] [3,] V $v [,1] [,2] [,3] [1,] [2,] [3,] [4,] [5,] [6,]
Can we recover original Matrix? Yes. (Well almost!) Here is how. We will Multiply the 3 Matrices U*D*V T We get – A* ~= A. [,1] [,2] [,3] [,4] [,5] [,6] [1,] e e e-17 1 [2,] e e e+00 5 [3,] e e e-16 5 We can see this is an Approximation of the original matrix.
How do we use SVD? We use the 2 matrices U and V to estimate the original matrix A. So what happened to the diagonal matrix D? We train our method on the given training set and learn by rolling the diagonal matrix in the two matrices. We do U * V T and obtain A’. Error = ∀ i ∀ j A ij ’ – A ij.
Algorithm variations covered in this paper Simple Predictors Regularized SVD Improved Regularized SVD (with Biases) Post processing SVD with KNN Post processing SVD with kernel ridge regression K-means Linear model for each item Decreasing the number of Parameters
The SVD Algorithm from paper [3,4,6] Initialize 2 arrays movieFeatures (U) and customerFeatures (V) to very small value 0.1 For every feature# in features Until minimum iterations are done or RMSE is not improving more than minimum improvement For every data point in training set //data point has custID and movieID prating = customerFeatures[feature#][custID] * movieFeatures [feature#][movieID] //Predict the rating error = originalrating - prating//Find the error squareerrsum += error * error //Sum the squared error for RMSE. cf = customerFeatures[feature#][custID] //locally copy current feature value mf = movieFeatures [feature#][movieID] //locally copy current feature value Contd.
Algorithm contd. customerFeatures[feature#][custID] += learningrate *(error * mf – regularizationfactor * cf) //Rolling the ERROR in to the features movieFeatures [feature#][movieID] += learningrate *(error * cf – regularizationfactor * mf) //Rolling the ERROR in to the feature RMSE = (squareerrsum / total number of data points) // Calculate RMSE Now we do the testing For every test point with custID and movieID For every feature# in Features predictedrating += customerFeatures[feature#][custID] * movieFeatures [feature#][movieID] Caveats – clip the ratings in the range (1, 5) predicted rating might go out of bounds “Regularization factor” is introduced by Brandyn Webb in  to reduce the over fitting
Variation: Improved Regularized SVD That was regularized SVD Improved Regularized SVD with Biases – Predict the rating with 2 added biases C i per customer and D j per movie Rating = C i + D j + coustomerFeatures[featue#][i] * movieFeatures[Feature#][j] – During training update the biases as C i += learningrate * (err – regularization(C i + D j – global_mean)) D j += learningrate * (err – regularization(C i + D j – global_mean)) Learningrate =.001, regularization = 0.05, global_mean =
Variation: KNN for Movies Post processing with KNN – On the Regularized SVD movieFeature matrix we run cosine similarity between 2 vectors similarity = movieFeature[movieID1] T * movieFeature[movieID2] ||movieFeature[movieID1]||*||movieFeature[movieID2]|| – Using this similarity measure we build a neighborhood of 1 nearest movies and predict rating of the nearest movie as the predicted rating
Experimentation Strategy by author Select 1.5% - 15% of the probe.txt as hold-out set or test set. Train all models on rest of the ratings All models predict the ratings Merge the results using linear regression on the test set Combining two methods for initial prediction & then performing linear regression
Results from the Paper  PredictorTest RMSE with BASIC Test RMSE with BASIC and RSVD2 Cumulative Test RMSE BASIC RSVD RSVD KMEANS SVD_KNN SVD_KRR LM NSVD NSVD SVD_KRR * NSVD SVD_KRR * NSVD Replicated from the paper as is
My Experiments I am trying out the regularized SVD method and Improved Regularized SVD method with qualifying.txt, probe.txt Also, going to implement first 3 steps of the author’s experimentation strategy (in my case I will predict with regularized SVD and Improved regularized SVD) If time permits might try SVD KNN method I am also varying some parameters like learning rate, number of features, etc. to see its effect on the results. I shall have all my results posted on the web site soon
References 1.Herlocker, J, Konstan, J., Terveen, L., and Riedl, J. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22 (2004), ACM Press, 5-53.Evaluating Collaborative Filtering Recommender Systems 2.Gábor Takács, István Pilászy, Bottyán Németh, Domonkos Tikk Scalable Collaborative Filtering Approaches for Large Recommender Systems. JMLR Volume 10 : , Arkadiusz Paterek, Improving regularized singular value decomposition for collaborative filtering - Proceedings of KDD Cup and Workshop, http://sifter.org/~simon/journal/ htmlhttp://sifter.org/~simon/journal/ html 5.http://www.igvita.com/2006/10/29/dissecting-the-netflix-dataset/http://www.igvita.com/2006/10/29/dissecting-the-netflix-dataset/ 6.G. Gorrell and B. Webb. Generalized hebbian algorithm for incremental latent semantic analysis. Proceedings of Interspeech, 2006.