Download presentation

Presentation is loading. Please wait.

Published byDonna Ketchum Modified over 2 years ago

1
The Netflix Prize Sam Tucker, Erik Ruggles, Kei Kubo, Peter Nelson and James Sheridan Advisor: Dave Musicant

2
The Problem

3
The User Meet Dave: He likes: 24, Highlander, Star Wars Episode V, Footloose, Dirty Dancing He dislikes: The Room, Star Wars Episode II, Barbarella, Flesh Gordon What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?

4
The Other User Meet College Dave: He likes: 24, Highlander, Star Wars Episode V, Barbarella, Flesh Gordon He dislikes: The Room, Star Wars Episode II, Footloose, Dirty Dancing What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?

5
The Netflix Prize Netflix offered $1 million to anyone who could improve on their existing system by %10 Huge publically available set of ratings for contestants to “train” their systems on Small “probe” set for contestants to test their own systems Larger hidden set of ratings to officially test the submissions Performance measured by RMSE

6
The Project For a given user and movie, predict the rating – RBMs – kNN, LPP – SVD Identify patterns in the data – Clustering Make pretty pictures – Force-directed Layout

7
The Dataset 17,770 movies 480,189 users About 100 million ratings Efficiency paramount: – Storing as a matrix: At least 5G (too big) – Storing as a list: 0.5G (linear search too slow) We started running it in Python in October…

8
The Dataset

9
Results NetflixRBMskNNSVDClustering RMSE0.9525

10
Restricted Boltzmann Machines

11
Goals Create a better recommender than Netflix Investigate Problem Children of Netflix Dataset – Napoleon Dynamite Problem – Users with few ratings

12
Neural Networks Want to use Neural Networks – Layers – Weights – Threshold

13
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

14
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

15
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

16
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

17
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

18
Neural Networks Want to use Neural Networks – Layers – Weights – Threshold – Hard to train large Nets RBMs – Fast and Easy to Train – Use Randomness – Biases

19
Structure Two sides – Visual – Hidden All nodes Binary – Calculate Probability – Random Number

20
1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room

21
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 24 Footloose Highlander The Room

22
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 24 Footloose Highlander The Room

23
Contrastive Divergence Positive Side – Insert actual user ratings – Calculate hidden side

24
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 24 Footloose Highlander The Room

25
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 24 Footloose Highlander The Room

26
Contrastive Divergence Positive Side – Insert actual user ratings – Calculate hidden side Negative Side – Calculate Visual side – Calculate hidden side

27
1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room

28
1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room

29
1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room

30
1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room

31
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room

32
Predicting Ratings For each user: Insert known ratings Calculate Hidden side For each movie: Calculate probability of all ratings Take expected value

33
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room BSG

34
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room BSG

35
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room BSG

36
1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5 Missing 1 1 2 2 3 3 4 4 5 5 24 Footloose Highlander The Room BSG

37
Fri Feb 19 09:18:59 2010 The RMSE for iteration 0 is 0.904828 with a probe RMSE of 0.977709 The RMSE for iteration 1 is 0.861516 with a probe RMSE of 0.945408 The RMSE for iteration 2 is 0.847299 with a probe RMSE of 0.936846. The RMSE for iteration 17 is 0.802811 with a probe RMSE of 0.925694 The RMSE for iteration 18 is 0.802389 with a probe RMSE of 0.925146 The RMSE for iteration 19 is 0.801736 with a probe RMSE of 0.925184 Fri Feb 19 17:54:02 2010 2.857% better than Netflix’s advertised error of 0.9525 for the competition Cult Movies: 1.1663Few Ratings: 1.0510 Results

38
NetflixRBMskNNSVDClustering RMSE0.95250.9252

39
k Nearest Neighbors

40
kNN One of the most common algorithms for finding similar users in a dataset. Simple but various ways to implement – Calculation Euclidean Distance Cosine Similarity – Analysis Average Weighted Average Majority

41
The Methods of Measuring Distances Euclidean Distance Cosine Similarity D(a, b) θ

42
The Problem of Cosine Similarity Problem: – Because the matrix of users and movies are highly sparse, we often cannot find users who rate the same movies. Conclusion: – Cannot compare users in these cases because similarity becomes 0, when there’s no common rated movie. Solution: – Set small default values to avoid it.

43
RMSE( Root Mean Squared Error) kEuclideanCosine Similarity*Cosine Similarity w/ Default Values 11.5933191.4426831.430385 21.3900241.2778891.257577 31.2931871.2243141.222081 ………… 271.1606471.1477571.149164 281.1603661.1478431.149094 291.1600581.1484181.149145 * In Cosine Similarity, the RMSE are the result among predicted ratings which program returned. There are a lot of missing predictions where the program cannot find nearest neighbors.

44
Local Minimum Issue

49
Dimensionality Reduction LPP (Locality Preserving Projections) 1.Construct the adjacency graph 2.Choose the weights 3.Compute the eigenvector equation below:

50
The Result of Dimensionality Reduction Other techniques when k = 15: – Euclidean: error = 1.173049 – Cosine: error = 1.147835 – Cosine w/ Defaults: error = 1.148560 Using dimensionality reduction technique: – k = 15 and d = 100:error = 1.060185

51
Results NetflixRBMskNNSVDClustering RMSE0.95250.92521.0602

52
Singular Value Decomposition

53
The Dataset

54
A Simpler Dataset

55
Collection of pointsA Scatterplot

56
Low-Rank Approximations The points mostly lie on a planePerpendicular variation = noise

57
Low-Rank Approximations How do we discover the underlying 2d structure of the data? Roughly speaking, we want the “2d” matrix that best explains our data. Formally,

58
Low-Rank Approximations Singular Value Decomposition (SVD) in the world of linear algebra Principal Component Analysis (PCA) in the world of statistics

59
Practical Applications Compressing images Discovering structure in data “Denoising” data Netflix: Filling in missing entries (i.e., ratings)

60
Netflix as Seen Through SVD

61
Strategy to solve the Netflix problem: – Assume the data has a simple (affine) structure with added noise – Find the low-rank matrix that best approximates our known values (i.e., infer that simple structure) – Fill in the missing entries based on that matrix – Recommend movies based on the filled-in values

62
Netflix as Seen Through SVD

63
Every user is represented by a k-dimensional vector (This is the matrix U) Every movie is represented by k-dimensional vector (This is the matrix M) Predicted ratings are dot products between user vectors and movie vectors

64
SVD Implementation Alternating Least Squares: – Initialize U and M randomly – Hold U constant and solve for M (least squares) – Hold M constant and solve for U (least squares) – Keep switching back and forth, until your error on the training set isn’t changing much (alternating) – See how it did!

65
SVD Results How did it do? – Probe Set: RMSE of about.90, ??% improvement over the Netflix recommender system

66
Dimensional Fun Each movie or user is represented by a 60-dimensional vector Do the dimensions mean anything? Is there an “action” dimension or a “comedy” dimension, for instance?

67
Dimensional Fun Some of the lowest movies along the 0 th dimension: – Michael Moore Hates America – In the Face of Evil: Reagan’s War in Word & Deed – Veggie Tales: Bible Heroes – Touched by an Angel: Season 2 – A History of God

68
Dimensional Fun Some of the highest movies along the 47 th dimension: – Emanuelle in America – Lust for Dracula – Timegate: Tales of the Saddle Tramps – Legally Exposed – Sexual Matrix

69
Dimensional Fun Some of the highest movies along the 55 th dimension: – Strange Things Happen at Sundown – Alien 3000 – Shaolin vs. Evil Dead – Dark Harvest – Legend of the Chupacabra

70
Results NetflixRBMskNNSVDClustering RMSE0.95250.92521.0602.90

71
Clustering

72
Goals Identify groups of similar movies Provide ratings based on similarity between movies Provide ratings based on similarity between users

81
Predictions We want to know what College Davewill think of “Grease”. Find out what he thinks of theprototype most similar to “Grease”.

82
College Dave gives “Grease” 1 Star!

83
Other Approaches Distribute across many machines Density Based Algorithms Ensembles – It is better to have a bunch of predictors that can do one thing well, then one predictor that can do everything well. – (In theory, but it actually doesn’t help much.)

84
Results Rating prediction Best rmse≈.93 but randomness gives us a pretty wide range. Genre Clustering Classifying based only on the most popular: 40% Classifying based on two most popular: 63%

85
Clustering Fun! (These are the ONLY two movies in the cluster) (These are AWESOME MOVIES!) (These are NOT!) (Pretty obvious) (Pretty surprising)

86
More Clustering Fun! (Also surprising) (For those of you born before 1965) (Insight into who actually likes Tom Cruise) (“Go forth, and kill! Zardoz has spoken.”)

87
The last of the fun (Also, movies to recommend to College Dave) (If only we could recommend based on T-Shirt purchases…) (Intellectual humor.) (Ahhhhhhhh!!!!!) (Did not see the last one coming…)

88
Results NetflixRBMskNNSVDClustering RMSE0.95250.92521.06020.900.93

89
Visualization

107
THANK YOU! Questions? – Email compsgroup@lists.carleton.edu

108
References ifsc.ualr.edu/xwxu/publications/kdd-96.pdf gael- varoquaux.info/scientific_computing/ica_pca /index.html

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google