The User Meet Dave: He likes: 24, Highlander, Star Wars Episode V, Footloose, Dirty Dancing He dislikes: The Room, Star Wars Episode II, Barbarella, Flesh Gordon What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?
The Other User Meet College Dave: He likes: 24, Highlander, Star Wars Episode V, Barbarella, Flesh Gordon He dislikes: The Room, Star Wars Episode II, Footloose, Dirty Dancing What new movies would he like to see? What would he rate: Star Trek, Battlestar Galactica, Grease, Forrest Gump?
The Netflix Prize Netflix offered $1 million to anyone who could improve on their existing system by %10 Huge publically available set of ratings for contestants to “train” their systems on Small “probe” set for contestants to test their own systems Larger hidden set of ratings to officially test the submissions Performance measured by RMSE
The Project For a given user and movie, predict the rating – RBMs – kNN, LPP – SVD Identify patterns in the data – Clustering Make pretty pictures – Force-directed Layout
The Dataset 17,770 movies 480,189 users About 100 million ratings Efficiency paramount: – Storing as a matrix: At least 5G (too big) – Storing as a list: 0.5G (linear search too slow) We started running it in Python in October…
Fri Feb 19 09:18:59 2010 The RMSE for iteration 0 is 0.904828 with a probe RMSE of 0.977709 The RMSE for iteration 1 is 0.861516 with a probe RMSE of 0.945408 The RMSE for iteration 2 is 0.847299 with a probe RMSE of 0.936846. The RMSE for iteration 17 is 0.802811 with a probe RMSE of 0.925694 The RMSE for iteration 18 is 0.802389 with a probe RMSE of 0.925146 The RMSE for iteration 19 is 0.801736 with a probe RMSE of 0.925184 Fri Feb 19 17:54:02 2010 2.857% better than Netflix’s advertised error of 0.9525 for the competition Cult Movies: 1.1663Few Ratings: 1.0510 Results
kNN One of the most common algorithms for finding similar users in a dataset. Simple but various ways to implement – Calculation Euclidean Distance Cosine Similarity – Analysis Average Weighted Average Majority
The Methods of Measuring Distances Euclidean Distance Cosine Similarity D(a, b) θ
The Problem of Cosine Similarity Problem: – Because the matrix of users and movies are highly sparse, we often cannot find users who rate the same movies. Conclusion: – Cannot compare users in these cases because similarity becomes 0, when there’s no common rated movie. Solution: – Set small default values to avoid it.
RMSE( Root Mean Squared Error) kEuclideanCosine Similarity*Cosine Similarity w/ Default Values 11.5933191.4426831.430385 21.3900241.2778891.257577 31.2931871.2243141.222081 ………… 271.1606471.1477571.149164 281.1603661.1478431.149094 291.1600581.1484181.149145 * In Cosine Similarity, the RMSE are the result among predicted ratings which program returned. There are a lot of missing predictions where the program cannot find nearest neighbors.
Dimensionality Reduction LPP (Locality Preserving Projections) 1.Construct the adjacency graph 2.Choose the weights 3.Compute the eigenvector equation below:
The Result of Dimensionality Reduction Other techniques when k = 15: – Euclidean: error = 1.173049 – Cosine: error = 1.147835 – Cosine w/ Defaults: error = 1.148560 Using dimensionality reduction technique: – k = 15 and d = 100:error = 1.060185
Strategy to solve the Netflix problem: – Assume the data has a simple (affine) structure with added noise – Find the low-rank matrix that best approximates our known values (i.e., infer that simple structure) – Fill in the missing entries based on that matrix – Recommend movies based on the filled-in values
Every user is represented by a k-dimensional vector (This is the matrix U) Every movie is represented by k-dimensional vector (This is the matrix M) Predicted ratings are dot products between user vectors and movie vectors
SVD Implementation Alternating Least Squares: – Initialize U and M randomly – Hold U constant and solve for M (least squares) – Hold M constant and solve for U (least squares) – Keep switching back and forth, until your error on the training set isn’t changing much (alternating) – See how it did!
SVD Results How did it do? – Probe Set: RMSE of about.90, ??% improvement over the Netflix recommender system
Dimensional Fun Each movie or user is represented by a 60-dimensional vector Do the dimensions mean anything? Is there an “action” dimension or a “comedy” dimension, for instance?
Dimensional Fun Some of the lowest movies along the 0 th dimension: – Michael Moore Hates America – In the Face of Evil: Reagan’s War in Word & Deed – Veggie Tales: Bible Heroes – Touched by an Angel: Season 2 – A History of God
Dimensional Fun Some of the highest movies along the 47 th dimension: – Emanuelle in America – Lust for Dracula – Timegate: Tales of the Saddle Tramps – Legally Exposed – Sexual Matrix
Dimensional Fun Some of the highest movies along the 55 th dimension: – Strange Things Happen at Sundown – Alien 3000 – Shaolin vs. Evil Dead – Dark Harvest – Legend of the Chupacabra
Other Approaches Distribute across many machines Density Based Algorithms Ensembles – It is better to have a bunch of predictors that can do one thing well, then one predictor that can do everything well. – (In theory, but it actually doesn’t help much.)
Results Rating prediction Best rmse≈.93 but randomness gives us a pretty wide range. Genre Clustering Classifying based only on the most popular: 40% Classifying based on two most popular: 63%
Clustering Fun! (These are the ONLY two movies in the cluster) (These are AWESOME MOVIES!) (These are NOT!) (Pretty obvious) (Pretty surprising)
More Clustering Fun! (Also surprising) (For those of you born before 1965) (Insight into who actually likes Tom Cruise) (“Go forth, and kill! Zardoz has spoken.”)
The last of the fun (Also, movies to recommend to College Dave) (If only we could recommend based on T-Shirt purchases…) (Intellectual humor.) (Ahhhhhhhh!!!!!) (Did not see the last one coming…)