Netflix Prize: Predicting Ratings. Data mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped.

Netflix Prize: Predicting Ratings

Data mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped

Overall Plan Compute user similarity using: –termFrequency:# of movies in common –documentFrequency:1/|rating 1 – rating 2 | tfdf = (# of movies in common) * 1/|rating 1 – rating 2 |

Plan 1 Store it all in memory (haha) in java Store a User class with: –UserID –Array of Movies classes: movieID Rating Then have matrix of users with an array of top similar users using (tfdf) Problem 1 - Memory issues

Plan 2* Step 1: store in text files on hard drive in java –text file for each user Step 2: compute similarity (tfdf) –text file of top then users for each user Step 3: predictions –Run through two directories of text files to compute an average movie rating prediction Problem 2 - Very Slow: –Step 1: 3 days – ~5000 movie text files currently –Step 2: 1 user every 35 mins | 1 user every 5 mins –Step 3: ~10 minutes currently

Plan 3 Step 1: Store in text file’s data in a database using php –Table: userID | movieID | rating Primary keys: userID, movieID Step 2: Compute Similarity –Table: userID | 1 st userIDs | 2 nd userID | etc. Primary key: userID Step 3: Predictions Problem 3 - Very Slow: –Step 1: 4 days – 7000 movie text files currently –Step 2: n/a –Step 3: n/a

Results Predicting everything 3.0: –RMSE = 1.3149 Similarities I have so far: –RMSE = 1.3149 | 384 users –RMSE = 1.3149 | 575 users http://www.netflixprize.com/leaderboard –Grand Prize RMSE = 0.8563 RMSE: –sqrt(avg((actual_rating - predicted rating) * (actual_rating - predicted rating))).

Future Idea

Netflix Prize: Predicting Ratings. Data mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped.

Similar presentations

Presentation on theme: "Netflix Prize: Predicting Ratings. Data mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Netflix Prize: Predicting Ratings. Data mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped.

Similar presentations

Presentation on theme: "Netflix Prize: Predicting Ratings. Data mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped."— Presentation transcript:

Similar presentations

About project

Feedback