Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS910: Foundations of Data Analytics Graham Cormode Recommender Systems.

Similar presentations


Presentation on theme: "CS910: Foundations of Data Analytics Graham Cormode Recommender Systems."— Presentation transcript:

1 CS910: Foundations of Data Analytics Graham Cormode Recommender Systems

2 Objectives  To understand the concept of recommendation  To see neighbour based methods  To see latent factor methods  To see how recommender systems are evaluated CS910 Foundations of Data Analytics 2

3 Recommendations  A modern problem: a very large number of possible items – Which item should I try next, based on my preferences?  Arises in many different places: – Recommendations of content: books, music, movies, videos... – Recommendations of places to travel, hotels, restaurants – Recommendations of food to eat, sites to visit – Recommendations of articles to read: news, research, gossip  Each person has different preferences/interests – How to elicit and model these preferences? – How to customize recommendations to a particular individual? CS910 Foundations of Data Analytics 3

4 Recommendations in the Wild CS910 Foundations of Data Analytics 4

5 Recommender Systems  Recommender systems: produce tailored recommendations – Inputs: ratings of items by users Possibly also: user profiles, item profiles – Outputs: for a given user, output a list of recommended items Or, for a given (user, item) pair, output a predicted rating  Ratings can be in many forms – “Star rating” (out of 5) – Binary rating (thumbs up/thumbs down) – Likert scale (Strongly like, like, neutral, dislike, strongly dislike) – Comparisons: prefer X to Y  Will use movie recommendation as a running example CS910 Foundations of Data Analytics 5

6 Ratings Matrix Item 1Item 2Item 3Item 4Item 5... User User 22? User 35 User User 5?4... CS910 Foundations of Data Analytics 6  n  m matrix of ratings R, where r u,i is rating of user u for item i  Typically, matrix is large and sparse  Thousands of users (n) and thousands of items (m)  Each user has rated only a few items  Each item is rated by at most a small fraction of users  Goal is to provide predictions p u,i for certain (user, item) pairs

7 Evaluating a recommender system  Evaluation is similar to evaluating classifiers – Break labeled data into training and test data – For each test (user, item) pair, predict the user score for the item – Measure the difference, and aim to minimize over N tests  Combine the differences into a single score to compare systems – Most common: Root-Mean-Square-Error (RMSE) between p u,i & r u,i RMSE = √(  u,i (p u,i – r u,i ) 2 / N ) – Sometimes also use Mean Absolute Error (MAE)  u,i |p u,i – r u,i | / N – If recommendations are either ‘good’ or ‘bad’, can use precision, recall CS910 Foundations of Data Analytics 7

8 Initial attempts  Can we use existing methods: classification, regression etc.?  Assume we have features for each user and each item: – User: Demographics, stated preferences – Item: E.g. Genre, director, actors  Can treat as a classification problem: predict a score – Train classifier from examples  Limitations of the classifier approach: – Don’t necessarily have user and item information – Ignores what we do have: lots of ratings between users and items Hard to use as features, unless everyone has rated a fixed set CS910 Foundations of Data Analytics 8

9 Neighbourhood method  Neighbourhood-based collaborative filtering – Users “collaborate” to help recommend (filter) items 1.Find k other users K who are similar to target user u – Possibly assign a weight based on how similar, w u,v 2.Combine the k users’ (weighted) preferences – Use these to make predictions for u  Can use existing methods to measure similarity – PMCC to measure correlation of ratings as w u,v – Cosine similarity of vectors CS910 Foundations of Data Analytics 9

10 Neighbourhood example (unweighted)  3 users like the same set of movies as Joe (exact match) – All three like “Saving Private Ryan”, so this is top recommendation CS910 Foundations of Data Analytics 10

11 Different Rating Scales  Every user rates slightly differently – Some consistently rate high, some consistently rate low  Using PMCC avoids this effect when picking neighbours but needs adjustment for making predictions  Make an adjustment when computing a score: – Predict: p u,i = r u + (  v  K (r v,i – r v ) w u,v )/ (  v  K w u,v ) – r u : average rating for user u – w u,v : weight assigned to user v based on their similarity to u E.g. The correlation coefficient value – p u,i computes the weighted deviation from v’s average score, and adds onto u’s average score CS910 Foundations of Data Analytics 11

12 Item-based Collaborative Filtering  Often there are many more users than items – E.g. Only few thousand movies available, but millions of users – Comparing to all users can be slow  Can do neighbourhood-based filtering using items – Two items are similar if the users rating them are similar – Compute PMCC between the users rating them both as w i,j – Find k most similar items J – Compute simple weighted average p u,i =  j  J r u,j w i,j / (  j  J w i,j ) No adjustment by mean as we assume no bias from items CS910 Foundations of Data Analytics 12

13 Latent Factor Analysis  We rejected methods based on features of items, since we could not guarantee they would be available  Latent Factor Analysis tries to find “hidden” features from the rating matrix. – Factors might correspond to recognisable features like genre – Other factors: Child-friendly, comedic, light/dark – More abstract: depth of character, quirkiness – Could find factors that are hard to interpret CS910 Foundations of Data Analytics 13

14 Latent Factor Example CS910 Foundations of Data Analytics 14

15 Matrix Factorization  Model each user and item as a vector of (inferred) factors – Let q i be the vector for item i, w u be the vector for user u – The predicted rating p u,i is then the dot product (w u ∙ q i )  How to learn the factors from the given data? – Given ratings matrix R, try to express R as a product WQ W is n  f matrix of users and their latent factors Q is a f  m matrix of items and their latent factors – A matrix factorization problem: factor R into W  Q  Can be solved by Singular Value Decomposition CS910 Foundations of Data Analytics 15

16 Singular Value Decomposition  Given m x n matrix M, decompose into M = U  V T, where: – U is a m x m matrix of orthogonal columns [left singular vectors] –  is a rectangular m x n diagonal matrix [singular values] – V T is a n x n matrix of orthogonal rows [right singular vectors]  The Singular Value Decomposition is highly structured – The singular values are the square roots of eigenvalues of MM T – The left (right) singular vectors are eigenvectors of MM T (M T M)  SVD can be used to give approximate representations – Take the k largest singular values, set rest to zero – Picks out the k most important “directions” – Gives the k latent factors to describe the data CS910 Foundations of Data Analytics 16

17 SVD for recommender systems  Textbook SVD doesn’t work when matrix has missing values! – Could try to fill in the missing values somehow, then factor  Instead, set up as an optimization problem: – Learn length k vectors q i, w u to solve the following optimization: Min q, v ∑ (u,i)  R (r u,i – q i  w u ) 2 Minimize the squared error between the predicted and true value  If we had a complete matrix, SVD would solve this problem – Set W = U k  ½ k and Q =  ½ k V k U k, V k are singular vectors corresponding to k largest singular values  Additional problem: too much freedom (not enough ratings) – Risk of overfitting the training data, failing to generalize CS910 Foundations of Data Analytics 17

18 Regularization  Regularization is a technique used in many places – Here, avoid overfitting by penalizing having too many parameters – Achieve this by adding the size of the parameters to optimization Min q, v ∑ (u,i)  R (r u,i – q i  w u ) 2 + (ǁq i ǁ ǁw u ǁ 2 2 ) ǁxǁ 2 2 is the L 2 (Euclidean) norm squared: sum of squared values – Effect is to set more values of q and v to 0 to minimize complexity  Many different forms of regularization: – L 2 regularization: add terms of the form ǁxǁ 2 2 – L 1 regularization: terms of the form ǁxǁ 1 (can give sparser solutions)  The form of the regularization should fit the optimization CS910 Foundations of Data Analytics 18

19 Solving the optimization: Gradient Descent  How to solve Min q, v ∑ (u,i)  R (r u,i – q i  w u ) 2 + (ǁq i ǁ ǁw u ǁ 2 2 ) ?  Gradient Descent – For each training example, find error of current prediction e u,i = r u,i – q i  w u – Modify the parameters by taking a step in direction of the gradient q i  q i + γ (e u,i w u - λ q i ) [derivative of target with respect to q] w u  w u + γ (e u,i q i - λ w u ) [derivative with respect to p] – γ is parameter to control the speed of descent  Advantages and disadvantages of gradient descent – ++ Fairly easy to implement: easy to compute update at each step – -- Can be slow: hard to parallelize CS910 Foundations of Data Analytics 19

20 Solving the optimization: Least Squares  How to solve Min q, v ∑ (u,i)  R (r u,i – q i  w u ) 2 + (ǁq i ǁ ǁw u ǁ 2 2 ) ?  Reducing to Least Squares – Suppose the values of w u are fixed – Then the goal is to minimize a function of the squares of q i s – Solved by techniques from regression: least squares minimization  Alternating least squares – Pretend values of w u are fixed, optimize values of q i – Swap, pretend values of q i are fixed, optimize values of w u – Repeat until convergence  Can be slower than gradient descent on a single machine – But can parallelize: compute each q i independently CS910 Foundations of Data Analytics 20

21 Adding biases  Can generalize matrix factorization to incorporate other factors – E.g. Fred always rates 1 star less than average – E.g. Citizen Kane is rated 0.5 higher than other films on average  These are not captured as well by a model of the form q i  w u – Explicitly modeling biases (intercepts) can give better fit Model with biases: p u,i =  + b i + b u + (w u ∙ q i )  : global average rating b i : bias for item i b u : rating bias from user u (similar to neighborhood method)  Optimize the new error function in the same way: Min q,v,b ∑ (u,i)  R (r u,i –  – b u – b i – q i  w u ) 2 + (ǁq i ǁ ǁw u ǁ b u 2 + b i 2 ) – Can add more biases e.g. incorporate variation over time CS910 Foundations of Data Analytics 21

22 Cold start problem: new items  How to cope when new objects are added to the system? – New users arrive, new movies are released: “cold start” problem  New item is created: no ratings, so will not be recommended? – Use attributes of the item (actors, genre) to give some score – Randomly suggest it to users to get some ratings CS910 Foundations of Data Analytics 22

23 Cold start problem: new users  New users arrive: we have no idea what they like! – Recommend globally popular items to them (Harry Potter…) May not give much specific information about their tastes – Encourage new users to rate some items before recommending – Suggest items that are “divisive”: try to maximize information Tradeoff: “poor” recommendations may drive users away CS910 Foundations of Data Analytics 23

24 Case Study: The Netflix Prize  Netflix ran competition from – Netflix streams movies over internet (and rents DVDs by mail) – Users rate each movie on a 5-star scale – Netflix makes recommendations of what to watch next  Object of competition: improve over current recommendations – “Cinematch” algorithm: “uses straightforward linear models…” – Prize: $1M to improve RMSE by 10%  Training data: 100M dated ratings from 480K users to 18K movies – Can submit ratings of test data at most once per day – Avoid stressing of servers, attempts to elicit true answers CS910 Foundations of Data Analytics 24

25 The Netflix Prize CS910 Foundations of Data Analytics 25 https://www.youtube.com/watch?v=Imp V70uLxyw

26 Netflix prize factors  Postscript: Netflix adopted some ideas but not all – “Explainability” of recommendations is an additional requirement – Cost of fitting models, making predictions is also important CS910 Foundations of Data Analytics 26

27 Recommender Systems Summary  Introduced the concept of recommendation  Saw neighbour based methods  Saw latent factor methods  Understood how recommender systems are evaluated – Netflix prize as a case study in applied recommender systems  Recommended reading: – Recommender systems (Encyclopedia of Machine Learning) Recommender systems – Matrix Factorization Techniques for Recommender Systems Koren, Bell, Volinsky, IEEE Software Matrix Factorization Techniques for Recommender Systems CS910 Foundations of Data Analytics 27


Download ppt "CS910: Foundations of Data Analytics Graham Cormode Recommender Systems."

Similar presentations


Ads by Google