Download presentation

Presentation is loading. Please wait.

Published byDonna Ketchum Modified about 1 year ago

1
The Netflix Prize Sam Tucker, Erik Ruggles, Kei Kubo, Peter Nelson and James Sheridan Advisor: Dave Musicant

2
The Problem

3
The User Meet Dave: He likes: 24, Highlander, Star Wars Episode V, Footloose, Dirty Dancing He dislikes: The Room, Star Wars Episode II, Barbarella, Flesh Gordon What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?

4
The Other User Meet College Dave: He likes: 24, Highlander, Star Wars Episode V, Barbarella, Flesh Gordon He dislikes: The Room, Star Wars Episode II, Footloose, Dirty Dancing What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?

5
The Netflix Prize Netflix offered $1 million to anyone who could improve on their existing system by %10 Huge publically available set of ratings for contestants to “train” their systems on Small “probe” set for contestants to test their own systems Larger hidden set of ratings to officially test the submissions Performance measured by RMSE

6
The Project For a given user and movie, predict the rating – RBMs – kNN, LPP – SVD Identify patterns in the data – Clustering Make pretty pictures – Force-directed Layout

7
The Dataset 17,770 movies 480,189 users About 100 million ratings Efficiency paramount: – Storing as a matrix: At least 5G (too big) – Storing as a list: 0.5G (linear search too slow) We started running it in Python in October…

8
The Dataset

9
Results NetflixRBMskNNSVDClustering RMSE0.9525

10
Restricted Boltzmann Machines

11
Goals Create a better recommender than Netflix Investigate Problem Children of Netflix Dataset – Napoleon Dynamite Problem – Users with few ratings

12
Neural Networks Want to use Neural Networks – Layers – Weights – Threshold

13
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

14
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

15
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

16
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

17
Output Hidden Input Cloudy Freezing Umbrella Is it Raining?

18
Neural Networks Want to use Neural Networks – Layers – Weights – Threshold – Hard to train large Nets RBMs – Fast and Easy to Train – Use Randomness – Biases

19
Structure Two sides – Visual – Hidden All nodes Binary – Calculate Probability – Random Number

20
Missing Footloose Highlander The Room

21
Missing 24 Footloose Highlander The Room

22
Missing 24 Footloose Highlander The Room

23
Contrastive Divergence Positive Side – Insert actual user ratings – Calculate hidden side

24
Missing 24 Footloose Highlander The Room

25
Missing 24 Footloose Highlander The Room

26
Contrastive Divergence Positive Side – Insert actual user ratings – Calculate hidden side Negative Side – Calculate Visual side – Calculate hidden side

27
Missing Footloose Highlander The Room

28
Missing Footloose Highlander The Room

29
Missing Footloose Highlander The Room

30
Missing Footloose Highlander The Room

31
Missing Footloose Highlander The Room

32
Predicting Ratings For each user: Insert known ratings Calculate Hidden side For each movie: Calculate probability of all ratings Take expected value

33
Missing Footloose Highlander The Room BSG

34
Missing Footloose Highlander The Room BSG

35
Missing Footloose Highlander The Room BSG

36
Missing Footloose Highlander The Room BSG

37
Fri Feb 19 09:18: The RMSE for iteration 0 is with a probe RMSE of The RMSE for iteration 1 is with a probe RMSE of The RMSE for iteration 2 is with a probe RMSE of The RMSE for iteration 17 is with a probe RMSE of The RMSE for iteration 18 is with a probe RMSE of The RMSE for iteration 19 is with a probe RMSE of Fri Feb 19 17:54: % better than Netflix’s advertised error of for the competition Cult Movies: Few Ratings: Results

38
NetflixRBMskNNSVDClustering RMSE

39
k Nearest Neighbors

40
kNN One of the most common algorithms for finding similar users in a dataset. Simple but various ways to implement – Calculation Euclidean Distance Cosine Similarity – Analysis Average Weighted Average Majority

41
The Methods of Measuring Distances Euclidean Distance Cosine Similarity D(a, b) θ

42
The Problem of Cosine Similarity Problem: – Because the matrix of users and movies are highly sparse, we often cannot find users who rate the same movies. Conclusion: – Cannot compare users in these cases because similarity becomes 0, when there’s no common rated movie. Solution: – Set small default values to avoid it.

43
RMSE( Root Mean Squared Error) kEuclideanCosine Similarity*Cosine Similarity w/ Default Values ………… * In Cosine Similarity, the RMSE are the result among predicted ratings which program returned. There are a lot of missing predictions where the program cannot find nearest neighbors.

44
Local Minimum Issue

45

46

47

48

49
Dimensionality Reduction LPP (Locality Preserving Projections) 1.Construct the adjacency graph 2.Choose the weights 3.Compute the eigenvector equation below:

50
The Result of Dimensionality Reduction Other techniques when k = 15: – Euclidean: error = – Cosine: error = – Cosine w/ Defaults: error = Using dimensionality reduction technique: – k = 15 and d = 100:error =

51
Results NetflixRBMskNNSVDClustering RMSE

52
Singular Value Decomposition

53
The Dataset

54
A Simpler Dataset

55
Collection of pointsA Scatterplot

56
Low-Rank Approximations The points mostly lie on a planePerpendicular variation = noise

57
Low-Rank Approximations How do we discover the underlying 2d structure of the data? Roughly speaking, we want the “2d” matrix that best explains our data. Formally,

58
Low-Rank Approximations Singular Value Decomposition (SVD) in the world of linear algebra Principal Component Analysis (PCA) in the world of statistics

59
Practical Applications Compressing images Discovering structure in data “Denoising” data Netflix: Filling in missing entries (i.e., ratings)

60
Netflix as Seen Through SVD

61
Strategy to solve the Netflix problem: – Assume the data has a simple (affine) structure with added noise – Find the low-rank matrix that best approximates our known values (i.e., infer that simple structure) – Fill in the missing entries based on that matrix – Recommend movies based on the filled-in values

62
Netflix as Seen Through SVD

63
Every user is represented by a k-dimensional vector (This is the matrix U) Every movie is represented by k-dimensional vector (This is the matrix M) Predicted ratings are dot products between user vectors and movie vectors

64
SVD Implementation Alternating Least Squares: – Initialize U and M randomly – Hold U constant and solve for M (least squares) – Hold M constant and solve for U (least squares) – Keep switching back and forth, until your error on the training set isn’t changing much (alternating) – See how it did!

65
SVD Results How did it do? – Probe Set: RMSE of about.90, ??% improvement over the Netflix recommender system

66
Dimensional Fun Each movie or user is represented by a 60-dimensional vector Do the dimensions mean anything? Is there an “action” dimension or a “comedy” dimension, for instance?

67
Dimensional Fun Some of the lowest movies along the 0 th dimension: – Michael Moore Hates America – In the Face of Evil: Reagan’s War in Word & Deed – Veggie Tales: Bible Heroes – Touched by an Angel: Season 2 – A History of God

68
Dimensional Fun Some of the highest movies along the 47 th dimension: – Emanuelle in America – Lust for Dracula – Timegate: Tales of the Saddle Tramps – Legally Exposed – Sexual Matrix

69
Dimensional Fun Some of the highest movies along the 55 th dimension: – Strange Things Happen at Sundown – Alien 3000 – Shaolin vs. Evil Dead – Dark Harvest – Legend of the Chupacabra

70
Results NetflixRBMskNNSVDClustering RMSE

71
Clustering

72
Goals Identify groups of similar movies Provide ratings based on similarity between movies Provide ratings based on similarity between users

73

74

75

76

77

78

79

80

81
Predictions We want to know what College Davewill think of “Grease”. Find out what he thinks of theprototype most similar to “Grease”.

82
College Dave gives “Grease” 1 Star!

83
Other Approaches Distribute across many machines Density Based Algorithms Ensembles – It is better to have a bunch of predictors that can do one thing well, then one predictor that can do everything well. – (In theory, but it actually doesn’t help much.)

84
Results Rating prediction Best rmse≈.93 but randomness gives us a pretty wide range. Genre Clustering Classifying based only on the most popular: 40% Classifying based on two most popular: 63%

85
Clustering Fun! (These are the ONLY two movies in the cluster) (These are AWESOME MOVIES!) (These are NOT!) (Pretty obvious) (Pretty surprising)

86
More Clustering Fun! (Also surprising) (For those of you born before 1965) (Insight into who actually likes Tom Cruise) (“Go forth, and kill! Zardoz has spoken.”)

87
The last of the fun (Also, movies to recommend to College Dave) (If only we could recommend based on T-Shirt purchases…) (Intellectual humor.) (Ahhhhhhhh!!!!!) (Did not see the last one coming…)

88
Results NetflixRBMskNNSVDClustering RMSE

89
Visualization

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107
THANK YOU! Questions? –

108
References ifsc.ualr.edu/xwxu/publications/kdd-96.pdf gael- varoquaux.info/scientific_computing/ica_pca /index.html

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google