Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining - Volinsky - 2011 - Columbia University 1 Topic 12 – Recommender Systems and the Netflix Prize.

Similar presentations


Presentation on theme: "Data Mining - Volinsky - 2011 - Columbia University 1 Topic 12 – Recommender Systems and the Netflix Prize."— Presentation transcript:

1 Data Mining - Volinsky - 2011 - Columbia University 1 Topic 12 – Recommender Systems and the Netflix Prize

2 Outline Intro to Recommender Systems The Netflix Prize – structure and history Models for Recommendation Extensions of model for other domains Data Mining - Volinsky - 2011 - Columbia University 2

3 Recommender Systems Systems which take user preferences about items as input and outputs recommendations Early examples Bellcore Music Recommender (1995) MIT Media Lab: Firefly (1996) Best example: Amazon.com Worst example: Amazon.com Also: Netflix eBay Google Reader iTunes Genius digg.com Hulu.com 3 Data Mining - Volinsky - 2011 - Columbia University

4 Recommender Systems Basic idea –recommend item i to user u for the purpose of Exposing them to something they would not have otherwise seen Leading customers to the Long Tail Increasing customers’ satisfaction Data for recommender systems (need to know who likes what) –Purchase/rented –Ratings –Web page views –Which do you think is best? Data Mining - Volinsky - 2011 - Columbia University 4

5 Recommender Systems Two types of data: Explicit data: user provides information about their preferences –Pro: high quality ratings –Con: Hard to get: people cannot be bothered Implicit data: infer whether or not user likes product based on behavior –Pro: Much more data available, less invasive –Con: Inference often wrong (does purchase imply preference?) In either case, data is just a big matrix –Users x items –Entries binary or real-valued Biggest Problem: –Sparsity: most users have not rated most products. Data Mining - Volinsky - 2011 - Columbia University 5 45531 312445 53432142 24542 522434 42331

6 Recommender Systems: Models Two camps on how to make recommendations: –Collaborative Filtering (CF) Use collective intelligence from all available rating information to make predictions for individuals Depends on the fact that user tastes are correlated and commutative: If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y –Content based Extracts “features” from items for a big regression or rule-based model See www.nanocrowd.com –15 years of research in the field –Conventional wisdom: CF performs better when there is sufficient data Content-based is useful when there is little data 6 Data Mining - Volinsky - 2011 - Columbia University

7 Recommender Systems: Evaluation Narrow focus on accuracy sometimes misses the point –Prediction Diversity, Prediction Context, Order of predictions –For explicit ratings: use training – test setup and use some form of MSE score OK, but treats all ratings the same (care most about top k) Doesn’t reflect interventional impact of recommender –For implicit, we don’t have a score to calculate MSE. Can still do training test Create a ranked vector of preference – really only interested in top-k for a recommendation algorithm use “discounted cumulative gain” –Gives more weight to those predicted at the top Data Mining - Volinsky - 2011 - Columbia University 7

8 Evaluation: Discounted Cumulative Gain Recommending stories on a web site: –Ten stories to recommend in test set: ABCDEFGHIJ –Only room on page for 5 recommendations –Algorithm 1 ranks: AECHI –Algorithm 2 ranks: CDBFG –User actually viewed AEBFGJ –rel = {0,1} –DCG 1 = 1+1/log 2 (2)+0+0+0 = 2 –DCG 2 =0+0+1/log 2 (3)+1/log 2 (4)+1/log 2 (5) = 1.56 Data Mining - Volinsky - 2011 - Columbia University 8

9 The Netflix Prize Data Mining - Volinsky - 2011 - Columbia University 9

10 Netflix A US-based DVD rental-by mail company >10M customers, 100K titles, ships 1.9M DVDs per day Good recommendations = happy customers 10 Data Mining - Volinsky - 2011 - Columbia University

11 Netflix Prize October, 2006: Offers $1,000,000 for an improved recommender algorithm Training data 100 million ratings 480,000 users 17,770 movies 6 years of data: 2000-2005 Test data Last few ratings of each user (2.8 million) Evaluation via RMSE: root mean squared error Netflix Cinematch RMSE: 0.9514 Competition $1 million grand prize for 10% improvement If 10% not met, $50,000 annual “Progress Prize” for best improvement datescoremovieuser 2002-01-031211 2002-04-0452131 2002-05-0543452 2002-05-0541232 2003-05-0337682 2003-10-105763 2004-10-114454 2004-10-1115685 2004-10-1123425 2004-12-1222345 2005-01-025766 2005-01-314566 datescoremovieuser 2003-01-03?2121 2002-05-04?11231 2002-07-05?252 2002-09-05?87732 2004-05-03?982 2003-10-10?163 2004-10-11?24504 2004-10-11?20325 2004-10-11?90985 2004-12-12?110125 2005-01-02?6646 2005-01-31?15266 11 Data Mining - Volinsky - 2011 - Columbia University

12 Netflix Prize Competition design Hold-out set created by taking last 9 ratings for each user –Non-random, biased set Hold-out set split randomly three ways: –Probe Set – appended to training data to allow unbiased estimation of RMSE –Submit ratings for the (Quiz+Test) Sets –Netflix returns RMSE on the Quiz Set only –Quiz Set results posted on public leaderboard, but Test Set used to determine the winner! »Prevents overfitting 12 Data Mining - Volinsky - 2011 - Columbia University

13 Data Characteristics 13 Data Mining - Volinsky - 2011 - Columbia University

14 Ratings per movie/user Mean Rating# RatingsUser ID 1.9017,651 305344 1.8117,432 387418 1.2216,5602439493 4.2615,8111664010 4.0814,8292118461 1.37 9,8201461435 Avg #ratings/user: 208 Avg #ratings/movie: 5627 14 Data Mining - Volinsky - 2011 - Columbia University

15 Data Characteristics Most Loved Movies CountAvg rating Most Loved Movies 1378124.593The Shawshank Redemption 1335974.545Lord of the Rings: The Return of the King 1808834.306The Green Mile 1506764.460Lord of the Rings: The Two Towers 1390504.415Finding Nemo 1174564.504Raiders of the Lost Ark Most Rated Movies Miss Congeniality Independence Day The Patriot The Day After Tomorrow Pretty Woman Pirates of the Caribbean Highest Variance The Royal Tenenbaums Lost In Translation Pearl Harbor Miss Congeniality Napolean Dynamite Fahrenheit 9/11 15 Data Mining - Volinsky - 2011 - Columbia University

16 The competition progresses… Cinematch beaten in two weeks Halfway to 10% in 6 weeks Our team, BellKor (Bob Bell and Yehuda Koren) took over the lead in the summer… With 48 hours to go to the $50K Progress Prize, we had a comfortable lead…with another submission in pocket 16 Data Mining - Volinsky - 2011 - Columbia University

17 8pm 6am 10/1 8pm Leaderboard 05:00 pm Sept 30 17 Data Mining - Volinsky - 2011 - Columbia University

18 Leaderboard 06:00 pm Sept 30 wanna split 50/50? 18 Data Mining - Volinsky - 2011 - Columbia University

19 8pm 6am 10/1 8pm ARRRRGH! We have one more chance…. 19 Data Mining - Volinsky - 2011 - Columbia University

20 A Nervous Last Day Start in virtual tie on Quiz data Unclear who leads on Test data Can Gravity/Dinosaurs improve again? More offers for combining Can we squeeze out a few more points? Improved our mixing strategy Created a second team, KorBell, just to be safe 20 Data Mining - Volinsky - 2011 - Columbia University

21 8pm 6am 10/1 8pm Our final submission(s)… 21 Data Mining - Volinsky - 2011 - Columbia University

22 22 Data Mining - Volinsky - 2011 - Columbia University

23 So The Drama Timeline: –2008: BellKor merges with BigChaos to win 2 nd $50K Progress Prize (9.4%) –2009: June 26: BellKor, Big Chaos and Pragmatic Theory merge passing 10% threshold, begins 30 day ‘last call’ period. July 24: All is quiet. We crawl up to 10.08%... July 25: The Ensemble, a coalition of 23 independent teams, combine to pass us on the public leaderboard. July 26: Bellkor pulls back into a tie. Then Ensemble pulls ahead by 0.01 % July 26: Netflix Prize ends: winner (determined on Test Set) is unclear. September 21: Netflix announces BellKor’s Pragmatic Chaos as winner of $1M prize. 23 Data Mining - Volinsky - 2011 - Columbia University

24 So The Drama Timeline: –2009: June 26: BellKor, Big Chaos and Pragmatic Theory merge passing 10% threshold, begins 30 day ‘last call’ period. Lead next best team by 0.37% Only two teams within 0.80% –But soon, the cavalry charges Mega-mergers start to form. –July 25 (25 hours left): We crawl up to 10.08%... The Ensemble, a coalition of 23 independent teams, combine to pass us on the public leaderboard 24 Data Mining - Volinsky - 2011 - Columbia University

25 25

26 26 Test Set Results BellKor’s Pragmatic Theory: 0.8567 The Ensemble: 0.8567 Tie breaker was submission date/time We won by 20 minutes! But really: BellKor’s Pragmatic Theory: 0.856704 The Ensemble: 0.856714 Also, a combination of BPC (10.06%) and Ensemble (10.06%) scores results in a 10.19% improvement! Data Mining - Volinsky - 2011 - Columbia University

27 Our Approach Our Prize winning solutions were an ensemble of many separate solution sets Progress Prize 2007: 103 sets Progress Prize 2008 (w/Big Chaos): 205 sets Grand Prize 2009 (w/ BC and Pragmatic Theory): > 800 sets!! –We used two main classes of models Nearest Neighbors Latent Factor Models (via Singular Value Decomposition) Also regularized regression, not a big factor Teammates used neural nets and other methods Approaches mainly algorithmic, not statistical in nature 27 Data Mining - Volinsky - 2011 - Columbia University

28 Data representation (excluding dates) 121110987654321 455311 3124452 534321423 245424 5224345 423316 users movies - unknown rating- rating between 1 to 5 28 Data Mining - Volinsky - 2011 - Columbia University

29 Nearest Neighbors 29 Data Mining - Volinsky - 2011 - Columbia University

30 Nearest Neighbors 121110987654321 455311 3124452 534321423 245424 5224345 423316 users movies - unknown rating- rating between 1 to 5 30 Data Mining - Volinsky - 2011 - Columbia University

31 Nearest Neighbors 121110987654321 455?311 3124452 534321423 245424 5224345 423316 users movies - estimate rating of movie 1 by user 5 31 Data Mining - Volinsky - 2011 - Columbia University

32 Nearest Neighbors 121110987654321 455?311 3124452 534321423 245424 5224345 423316 users Neighbor selection: Identify movies similar to 1, rated by user 5 movies 32 Data Mining - Volinsky - 2011 - Columbia University

33 Nearest Neighbors 121110987654321 455?311 3124452 534321423 245424 5224345 423316 users Compute similarity weights: s 13 =0.2, s 16 =0.3 movies 33 Data Mining - Volinsky - 2011 - Columbia University

34 Nearest Neighbors 121110987654321 4552.6311 3124452 534321423 245424 5224345 423316 users Predict by taking weighted average: (0.2*2+0.3*3)/(0.2+0.3)=2.6 movies 34 Data Mining - Volinsky - 2011 - Columbia University

35 Nearest Neighbors –To predict the rating for user u on item i: Use similar users’ ratings for similar movies: r ui = rating for user u and item i b ui = baseline rating for user u and item I s ij = similarity between items i and j N(i,u) = neighborhood of item i for user u (might be fixed at k) 35 Data Mining - Volinsky - 2011 - Columbia University

36 Nearest Neighbors Useful to “center” the data, and model residuals What is s ij ??? –Cosine distance –Correlation What is N(i,u)?? –Top-k –Threshold What is b ui How to deal with missing values? Choose several different options and throw them in! 36 Data Mining - Volinsky - 2011 - Columbia University

37 Nearest Neighbors, cont This is called “item-item” NN –Can also do user-user –Which do you think is better? Advantages of NN –Few modeling assumptions –Easy to explain to users –Most popular RS tool Data Mining - Volinsky - 2011 - Columbia University 37

38 Nearest Neighbors, Modified Problem with traditional k-NN: Similarity weights are calculated globally, and do not account for correlation among the neighbors –We estimate the weights (w ij ) simultaneously via a least squares optimization : Basically, a regression using the ratings in the nbhd. –Shrinkage helps address correlation –(don’t try this at home) 38 Data Mining - Volinsky - 2011 - Columbia University

39 Geared towards females Geared towards males serious escapist The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility Latent factor models – Singular Value Decomposition 39 Data Mining - Volinsky - 2011 - Columbia University SVD finds concepts

40 Matrix Decomposition - SVD Data Mining - Volinsky - 2011 - Columbia University 40 45531 312445 53432142 24542 522434 42331 items.2-.4.1.5.6-.5.5.3-.2.32.11.1 -22.1-.7.3.7 -.92.41.4.3-.4.8-.5-2.5.3-.21.1 1.3-.11.2-.72.91.4.31.4.5.7-.8.1-.6.7.8.4-.3.92.41.7.6-.42.1 ~ ~ items users ? D3D3 Example with 3 factors (concepts Each user and each item is described by a feature vector across concepts

41 Factorization-based modeling 45531 312445 53432142 24542 522434 42331.2-.4.1.5.6-.5.5.3-.2.32.11.1 -22.1-.7.3.7 -.92.41.4.3-.4.8-.5-2.5.3-.21.1 1.3-.11.2-.72.91.4.31.4.5.7-.8.1-.6.7.8.4-.3.92.41.7.6-.42.1 ~ This is a strange way to use SVD! –Usually for reducing dimensionality, here for filling in missing data! –Special techniques to do SVD w/ missing data Alternating Least Squares = variant of EM algorithms Probably most popular model among contestants –12/11/2006: Simon Funk describes an SVD based method 41 Data Mining - Volinsky - 2011 - Columbia University

42 Latent Factor Models, Modified Problem with traditional SVD: – User and item factors are determined globally – Each user described as a fixed linear combination across factors – What if there are different people in the household? Let the linear combination change as a function of the item rated. Substitute p u with p u (i), and add similarity weights Again, don’t try this at home! 42 Data Mining - Volinsky - 2011 - Columbia University

43 First 2 Singular Vectors 43 Data Mining - Volinsky - 2011 - Columbia University

44 Incorporating Implicit Data Implicit Data: what you choose to rate is an important, and separate piece of information than how you rate it. Helps incorporate negative information, especially for those users with low variance. Can be fit in NN or SVD 44 Data Mining - Volinsky - 2011 - Columbia University

45 Temporal Effects It took us 1.5 years to figure out how to use the dates! Account for temporal effects in two ways Directly in the SVD model – allow user factor vectors p u to vary with time p u (t) –Many ways to do this, adds many parameters, need regularization –DTTAH Observed: number of user ratings on a given date is a proxy for how long ago the movie was seen: –Some movies age better than others 45 Data Mining - Volinsky - 2011 - Columbia University

46 46 Memento vs Patch Adams rating frequency Data Mining - Volinsky - 2011 - Columbia University

47 Netflix Prize: Other Topics Ensembles of models Application to TV data Data Mining - Volinsky - 2011 - Columbia University 47

48 The Power of Ensembles Ensembles of models Over 800 in Grand Prize solution! Some of our “models” were blends of other models! –or models on residuals of other models – Why: because it worked Black Box magic at its worst However, mega blends are not needed in practice –Best model: complex model with implicit and explicit data, and time varying coefficients, millions of parameters and regularization. –This did as well as our 2007 Progress Prize winning combo of 107 models! –Even a small handful of ‘simple’ models gets you 80-85% of the way In practice,.01% does not matter much Courtesy LingPipe Blog 48 Data Mining - Volinsky - 2011 - Columbia University

49 Blending Models Figuring out the right way to blend models was an important piece of our team strategy Subsequent slides courtesy of Michael Jahrer (BigChaos)… Keep in mind… Data Mining - Volinsky - 2011 - Columbia University 49

50 Probe Blending < Quiz Blending A few facts before we go on: 1)Every model I can be reduced to its prediction vector on the probe set (p i ). That can be compared to the true probe answers (r) 2)The model can also be applied to the quiz set (q i ), but can only be compared to the true answers (s) by submission 3)We submitted all individual models to get the RMSE for every model i. 4)The distribution of the quiz vector (how many 1s, 2s, 3s, etc) was well known. This gives the variance of the vector Data Mining - Volinsky - 2011 - Columbia University 50

51 Probe Blending Data Mining - Volinsky - 2011 - Columbia University 51

52 Quiz Blending Data Mining - Volinsky - 2011 - Columbia University 52


Download ppt "Data Mining - Volinsky - 2011 - Columbia University 1 Topic 12 – Recommender Systems and the Netflix Prize."

Similar presentations


Ads by Google