3Agenda Problem Description Netflix Data Why is it a tough nut to crack?Overview of methods already applied to this problemOverview of the PaperDetails of the methodHow does this method works for the Netflix problemMy implementationResultsQ and A?Agenda for my talk.
4Netflix Prize ProblemGiven a set of users with their previous ratings for a set of movies, can we predict the rating they will assign to a movie they have not previously rated?Defined atSeeks to improve the Cinematch’s (Netflix’s existing movie recommender system) prediction performance by 10%.How is the performance measured?Root Mean Square Error (RMSE)Winner gets a prize of 1 Million USD.Take ratings for a movie from all in the class.Try to make a prediction based on that.The Dark KnightStar Wars
5Problem Description Recommender Systems Collaborative filtering Use the knowledge about preference of a group of users about a certain items and help predict the interest level for other users from same community. Collaborative filteringWidely used method for recommender systemsTries to find traits of shared interest among users in a group to help predict the likes and dislikes of the other users within the group. 
6Why is this problem interesting? Used by almost every recommender system todayAmazonYahooGoogleNetflix…
7Netflix Data Netflix released data for this competition Contains nearly 100 Million ratingsNumber of users (Anonymous) = 480,189Number of movies rated by them = 17,770Training Data is provided per movieTo verify the model developed without submitting the predictions to Netflix “probe.txt” is providedTo submit the predictions for competition “qualifying.txt” is used
8Netflix Data in Pictures These pictures are taken as is from 
11Netflix Data Data in the training file is per movie It looks like this Customer#,Rating,Date of RatingExample4:,3,,1,410199,5,
12Netflix Data Movie# Customer# 1: 30878 2647871 1283744 Data in the qualifying.txt looks like this (No answers)Data points in the “probe.txt” looks like this (Have answers)Movie# Customer# 1:Movie# Customer#, DateofRating 1: , , ,
13Hard Nut to Crack? Why is this problem such a difficult one? Total ratings possible =480,189 (user) * 17,770 (movies) = (8.5 Billion)Total available = 100 MillionThe User x Movies matrix has 8.4 Billion entries missingConsider the problem as Least Square problemWe can consider this problem by representing it as system of equation in a matrix
14Technically tough as well Huge memory requirementsHigh time requirementsBecause we are using only ~100 Million of possible 8.5 Billion ratings the predictors have some error in their weights (small training data)4.3 Gigs if we don’t design the data structures carefully.Megs if go to the bit level representation in CTraining time vary between a few hours to days (15 in my case).Sparse data available for training.
15Various Methods Employed for Netflix Prize Problem Nearest Neighbor methodsk-NN with variationsMatrix factorizationProbabilistic Latent Semantic AnalysisProbabilistic Matrix FactorizationExpectation Maximization for Matrix FactorizationSingular Value DecompositionRegularized Matrix FactorizationWe will not talk a great deal about nearest neighbor methods.Probabilistic variant of LSA – Method from NLP that aims to find hidden concepts in the given set of documentsProbabilistic Matrix Factorization – Uses Gaussian model, scales well.Expectation Maximization for MF – tries to find the Maximum likelihood for a the rating using matrix factorization methods.SVDRegularized MF
16The PaperTitle: “Improving regularized singular value decomposition for collaborative filtering” - Arkadiusz Paterek, Proceedings of KDD Cup and Workshop, Uses Algorithm described by Simon Funk (Brandyn Webb) in .The algorithm revolves around regularized Singular Value Decomposition (SVD) described in  and suggests some interesting use of biases to it to improve performance.It also proposes some methods for post processing of the features extracted from the SVD.It compares the various combinations of methods suggested in the paper for the Netflix Data.
17Singular Value Decomposition Consider the given problem as a Matrix of Users x Movies AorMovies x UsersShow are the two examplesWhat do we do with this representation?M1M2M3M4M5M6U12451U23U3U1U2U3M12M24M353M4M51M6
18Singular Value Decomposition Method of Matrix FactorizationApplicable to rectangular matrices and square alikeDecomposes the matrix in to 3 component matrices whose product approximates the original matrixE.g.D $dU $u [,1] [,2] [,3][1,][2,][3,]V $v [,1] [,2] [,3][1,][2,][3,][4,][5,][6,]
19Can we recover original Matrix? Yes. (Well almost!) Here is how.We will Multiply the 3 Matrices U*D*VTWe get – A* ~= A.[,1] [,2] [,3] [,4] [,5] [,6][1,] e e e[2,] e e e[3,] e e eWe can see this is an Approximation of the original matrix.Emphasize on the small values that have show up in stead of missing values.
20How do we use SVD?We use the 2 matrices U and V to estimate the original matrix A.So what happened to the diagonal matrix D?We train our method on the given training set and learn by rolling the diagonal matrix in the two matrices.We do U * VT and obtain A’.Error = ∀i∀jAij’ – Aij.
21Algorithm variations covered in this paper Simple PredictorsRegularized SVDImproved Regularized SVD (with Biases)Post processing SVD with KNNPost processing SVD with kernel ridge regressionK-meansLinear model for each itemDecreasing the number of Parameters1. Total 6 predictors - 5 predictors are empirical probabilities for the user in question and 6th is the mean value of the rating for the movie.2. We try to find the two matrices U and V by iterating over the training set.3. Adding 1 variable per movie and per user called biases to the prediction and running the same training algorithm.4. SVD_KNN – proposed by an anonymous contestant. Find Movie-movie similarity and define 1 nearest neighbor for this user assign that rating.5. SVD_KRR – Complex method that discards all the values of matrix U and defines prediction using a Gaussian kernel function.6. K-means Clustering – divides the users in to K clusters and ratings is the median rating of the cluster.7. Linear Model for each movie – Another item – item similarity method where for every item we build a weighted linear model learned using Gradient Descent8. Decreasing # of Parameters – use only movies that are rated by user i are considered and then a model is fit with weights for those movies for that user. This model has #user * #of features as # parameters.Of this what will we Cover???
22The SVD Algorithm from paper [3,4,6] Initialize 2 arrays movieFeatures (U) and customerFeatures (V) to very small value 0.1For every feature# in featuresUntil minimum iterations are done or RMSE is not improving more than minimum improvementFor every data point in training set //data point has custID and movieIDprating = customerFeatures[feature#][custID] * movieFeatures [feature#][movieID] //Predict the ratingerror = originalrating - prating //Find the errorsquareerrsum += error * error //Sum the squared error for RMSE.cf = customerFeatures[feature#][custID] //locally copy current feature valuemf = movieFeatures [feature#][movieID] //locally copy current feature valueContd.
23Algorithm contd.customerFeatures[feature#][custID] += learningrate *(error * mf – regularizationfactor * cf) //Rolling the ERROR in to the featuresmovieFeatures [feature#][movieID] += learningrate *(error * cf – regularizationfactor * mf) //Rolling the ERROR in to the featureRMSE = (squareerrsum / total number of data points) // Calculate RMSENow we do the testingFor every test point with custID and movieIDFor every feature# in Featurespredictedrating += customerFeatures[feature#][custID] * movieFeatures [feature#][movieID]Caveats – clip the ratings in the range (1, 5) predicted rating might go out of bounds“Regularization factor” is introduced by Brandyn Webb in  to reduce the over fitting
24Variation: Improved Regularized SVD That was regularized SVDImproved Regularized SVD with BiasesPredict the rating with 2 added biases Ci per customer and Dj per movieRating = Ci + Dj + coustomerFeatures[featue#][i] * movieFeatures[Feature#][j]During training update the biases asCi += learningrate * (err – regularization(Ci + Dj – global_mean))Dj += learningrate * (err – regularization(Ci + Dj – global_mean))Learningrate = .001, regularization = 0.05, global_mean =
25Variation: KNN for Movies Post processing with KNNOn the Regularized SVD movieFeature matrix we run cosine similarity between 2 vectorssimilarity = movieFeature[movieID1]T * movieFeature[movieID2]||movieFeature[movieID1]||*||movieFeature[movieID2]||Using this similarity measure we build a neighborhood of 1 nearest movies and predict rating of the nearest movie as the predicted rating
26Experimentation Strategy by author Select 1.5% - 15% of the probe.txt as hold-out set or test set.Train all models on rest of the ratingsAll models predict the ratingsMerge the results using linear regression on the test setCombining two methods for initial prediction & then performing linear regression
27Results from the Paper PredictorTest RMSE with BASICTest RMSE with BASIC and RSVD2Cumulative Test RMSEBASIC.9826.9039RSVD.9024.9018.9094RSVD2KMEANS.9410.9029.9010SVD_KNN.9525.9013.8988SVD_KRR.9006.8959.8933LM.9506.8995.8902NSVD1.9312.8986.8887NSVD2.9590.9032.8879SVD_KRR * NSVD1-SVD_KRR * NSVD2.8877Author achieved with RSVD2 and BASIC method a RMSE of that around 4-5% lower than CineMatch algo.Linear regression with all the predictors from the table gives on test set and on the qualifying.txt set. (~6% improvement over Netflix)% improvement - The solution submitted to the Netflix Prize is the result of merging in proportion 85/15 two linear regressions trained on different training-test partitions: one linear regression with 56 predictors (most of them are different variations of regularized SVD and postprocessing with KNN) and 63 two-way interactions, and the second one with 16 predictors (subset of the predictors from the first regression) and 5 two-way interactions.Replicated from the paper as is
28My ExperimentsI am trying out the regularized SVD method and Improved Regularized SVD method with qualifying.txt, probe.txtAlso, going to implement first 3 steps of the author’s experimentation strategy (in my case I will predict with regularized SVD and Improved regularized SVD)If time permits might try SVD KNN methodI am also varying some parameters like learning rate, number of features, etc. to see its effect on the results.I shall have all my results posted on the web site soon
30ReferencesHerlocker, J, Konstan, J., Terveen, L., and Riedl, J. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22 (2004), ACM Press, 5-53.Gábor Takács, István Pilászy, Bottyán Németh, Domonkos Tikk Scalable Collaborative Filtering Approaches for Large Recommender Systems. JMLR Volume 10 : , 2009.Arkadiusz Paterek, Improving regularized singular value decomposition for collaborative filtering - Proceedings of KDD Cup and Workshop, 2007.G. Gorrell and B. Webb. Generalized hebbian algorithm for incremental latent semantic analysis. Proceedings of Interspeech, 2006.
31Atul S. Kulkarni firstname.lastname@example.org Thanks for your time!Atul S. Kulkarni