Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Netflix Prize Sam Tucker, Erik Ruggles, Kei Kubo, Peter Nelson and James Sheridan Advisor: Dave Musicant.

Similar presentations


Presentation on theme: "The Netflix Prize Sam Tucker, Erik Ruggles, Kei Kubo, Peter Nelson and James Sheridan Advisor: Dave Musicant."— Presentation transcript:

1 The Netflix Prize Sam Tucker, Erik Ruggles, Kei Kubo, Peter Nelson and James Sheridan Advisor: Dave Musicant

2 The Problem

3 The User Meet Dave: He likes: 24, Highlander, Star Wars Episode V, Footloose, Dirty Dancing He dislikes: The Room, Star Wars Episode II, Barbarella, Flesh Gordon What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?

4 The Other User Meet College Dave: He likes: 24, Highlander, Star Wars Episode V, Barbarella, Flesh Gordon He dislikes: The Room, Star Wars Episode II, Footloose, Dirty Dancing What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?

5 The Netflix Prize Netflix offered $1 million to anyone who could improve on their existing system by %10 Huge publically available set of ratings for contestants to “train” their systems on Small “probe” set for contestants to test their own systems Larger hidden set of ratings to officially test the submissions Performance measured by RMSE

6 The Project For a given user and movie, predict the rating – RBMs – kNN, LPP – SVD Identify patterns in the data – Clustering Make pretty pictures – Force-directed Layout

7 The Dataset 17,770 movies 480,189 users About 100 million ratings Efficiency paramount: – Storing as a matrix: At least 5G (too big) – Storing as a list: 0.5G (linear search too slow) We started running it in Python in October…

8 The Dataset

9 Results NetflixRBMskNNSVDClustering RMSE0.9525

10 Restricted Boltzmann Machines

11 Goals Create a better recommender than Netflix Investigate Problem Children of Netflix Dataset – Napoleon Dynamite Problem – Users with few ratings

12 Neural Networks Want to use Neural Networks – Layers – Weights – Threshold

13 Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

14 Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

15 Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

16 Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

17 Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

18 Neural Networks Want to use Neural Networks – Layers – Weights – Threshold – Hard to train large Nets RBMs – Fast and Easy to Train – Use Randomness – Biases

19 Structure Two sides – Visual – Hidden All nodes Binary – Calculate Probability – Random Number

20 Missing Footloose Highlander The Room

21 Missing 24 Footloose Highlander The Room

22 Missing 24 Footloose Highlander The Room

23 Contrastive Divergence Positive Side – Insert actual user ratings – Calculate hidden side

24 Missing 24 Footloose Highlander The Room

25 Missing 24 Footloose Highlander The Room

26 Contrastive Divergence Positive Side – Insert actual user ratings – Calculate hidden side Negative Side – Calculate Visual side – Calculate hidden side

27 Missing Footloose Highlander The Room

28 Missing Footloose Highlander The Room

29 Missing Footloose Highlander The Room

30 Missing Footloose Highlander The Room

31 Missing Footloose Highlander The Room

32 Predicting Ratings For each user: Insert known ratings Calculate Hidden side For each movie: Calculate probability of all ratings Take expected value

33 Missing Footloose Highlander The Room BSG

34 Missing Footloose Highlander The Room BSG

35 Missing Footloose Highlander The Room BSG

36 Missing Footloose Highlander The Room BSG

37 Fri Feb 19 09:18: The RMSE for iteration 0 is with a probe RMSE of The RMSE for iteration 1 is with a probe RMSE of The RMSE for iteration 2 is with a probe RMSE of The RMSE for iteration 17 is with a probe RMSE of The RMSE for iteration 18 is with a probe RMSE of The RMSE for iteration 19 is with a probe RMSE of Fri Feb 19 17:54: % better than Netflix’s advertised error of for the competition Cult Movies: Few Ratings: Results

38 NetflixRBMskNNSVDClustering RMSE

39 k Nearest Neighbors

40 kNN One of the most common algorithms for finding similar users in a dataset. Simple but various ways to implement – Calculation Euclidean Distance Cosine Similarity – Analysis Average Weighted Average Majority

41 The Methods of Measuring Distances Euclidean Distance Cosine Similarity D(a, b) θ

42 The Problem of Cosine Similarity Problem: – Because the matrix of users and movies are highly sparse, we often cannot find users who rate the same movies. Conclusion: – Cannot compare users in these cases because similarity becomes 0, when there’s no common rated movie. Solution: – Set small default values to avoid it.

43 RMSE( Root Mean Squared Error) kEuclideanCosine Similarity*Cosine Similarity w/ Default Values ………… * In Cosine Similarity, the RMSE are the result among predicted ratings which program returned. There are a lot of missing predictions where the program cannot find nearest neighbors.

44 Local Minimum Issue

45

46

47

48

49 Dimensionality Reduction LPP (Locality Preserving Projections) 1.Construct the adjacency graph 2.Choose the weights 3.Compute the eigenvector equation below:

50 The Result of Dimensionality Reduction Other techniques when k = 15: – Euclidean: error = – Cosine: error = – Cosine w/ Defaults: error = Using dimensionality reduction technique: – k = 15 and d = 100:error =

51 Results NetflixRBMskNNSVDClustering RMSE

52 Singular Value Decomposition

53 The Dataset

54 A Simpler Dataset

55 Collection of pointsA Scatterplot

56 Low-Rank Approximations The points mostly lie on a planePerpendicular variation = noise

57 Low-Rank Approximations How do we discover the underlying 2d structure of the data? Roughly speaking, we want the “2d” matrix that best explains our data. Formally,

58 Low-Rank Approximations Singular Value Decomposition (SVD) in the world of linear algebra Principal Component Analysis (PCA) in the world of statistics

59 Practical Applications Compressing images Discovering structure in data “Denoising” data Netflix: Filling in missing entries (i.e., ratings)

60 Netflix as Seen Through SVD

61 Strategy to solve the Netflix problem: – Assume the data has a simple (affine) structure with added noise – Find the low-rank matrix that best approximates our known values (i.e., infer that simple structure) – Fill in the missing entries based on that matrix – Recommend movies based on the filled-in values

62 Netflix as Seen Through SVD

63 Every user is represented by a k-dimensional vector (This is the matrix U) Every movie is represented by k-dimensional vector (This is the matrix M) Predicted ratings are dot products between user vectors and movie vectors

64 SVD Implementation Alternating Least Squares: – Initialize U and M randomly – Hold U constant and solve for M (least squares) – Hold M constant and solve for U (least squares) – Keep switching back and forth, until your error on the training set isn’t changing much (alternating) – See how it did!

65 SVD Results How did it do? – Probe Set: RMSE of about.90, ??% improvement over the Netflix recommender system

66 Dimensional Fun Each movie or user is represented by a 60-dimensional vector Do the dimensions mean anything? Is there an “action” dimension or a “comedy” dimension, for instance?

67 Dimensional Fun Some of the lowest movies along the 0 th dimension: – Michael Moore Hates America – In the Face of Evil: Reagan’s War in Word & Deed – Veggie Tales: Bible Heroes – Touched by an Angel: Season 2 – A History of God

68 Dimensional Fun Some of the highest movies along the 47 th dimension: – Emanuelle in America – Lust for Dracula – Timegate: Tales of the Saddle Tramps – Legally Exposed – Sexual Matrix

69 Dimensional Fun Some of the highest movies along the 55 th dimension: – Strange Things Happen at Sundown – Alien 3000 – Shaolin vs. Evil Dead – Dark Harvest – Legend of the Chupacabra

70 Results NetflixRBMskNNSVDClustering RMSE

71 Clustering

72 Goals Identify groups of similar movies Provide ratings based on similarity between movies Provide ratings based on similarity between users

73

74

75

76

77

78

79

80

81 Predictions We want to know what College Davewill think of “Grease”. Find out what he thinks of theprototype most similar to “Grease”.

82 College Dave gives “Grease” 1 Star!

83 Other Approaches Distribute across many machines Density Based Algorithms Ensembles – It is better to have a bunch of predictors that can do one thing well, then one predictor that can do everything well. – (In theory, but it actually doesn’t help much.)

84 Results Rating prediction Best rmse≈.93 but randomness gives us a pretty wide range. Genre Clustering Classifying based only on the most popular: 40% Classifying based on two most popular: 63%

85 Clustering Fun! (These are the ONLY two movies in the cluster) (These are AWESOME MOVIES!) (These are NOT!) (Pretty obvious) (Pretty surprising)

86 More Clustering Fun! (Also surprising) (For those of you born before 1965) (Insight into who actually likes Tom Cruise) (“Go forth, and kill! Zardoz has spoken.”)

87 The last of the fun  (Also, movies to recommend to College Dave) (If only we could recommend based on T-Shirt purchases…) (Intellectual humor.) (Ahhhhhhhh!!!!!) (Did not see the last one coming…)

88 Results NetflixRBMskNNSVDClustering RMSE

89 Visualization

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107 THANK YOU! Questions? –

108 References ifsc.ualr.edu/xwxu/publications/kdd-96.pdf gael- varoquaux.info/scientific_computing/ica_pca /index.html


Download ppt "The Netflix Prize Sam Tucker, Erik Ruggles, Kei Kubo, Peter Nelson and James Sheridan Advisor: Dave Musicant."

Similar presentations


Ads by Google