Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Netflix Challenge Parallel Collaborative Filtering James Jolly Ben Murrell CS 387 Parallel Programming with MPI Dr. Fikret Ercal.

Similar presentations


Presentation on theme: "The Netflix Challenge Parallel Collaborative Filtering James Jolly Ben Murrell CS 387 Parallel Programming with MPI Dr. Fikret Ercal."— Presentation transcript:

1 The Netflix Challenge Parallel Collaborative Filtering James Jolly Ben Murrell CS 387 Parallel Programming with MPI Dr. Fikret Ercal

2 What is Netflix? subscription-based movie rental online frontend over 100,000 movies to pick from 8M subscribers 2007 net income: $67M

3 What is the Netflix Prize? attempt to increase Cinematch accuracy predict how users will rate unseen movies $1M for 10% improvement

4 The contest dataset… contains 100,480,577 ratings from 480,189 users for 17,770 movies

5 Why is it hard? user tastes difficult to model in general movies tough to classify large volume of data

6 Sounds like a job for collaborative filtering! infer relationships between users leverage them to make predictions

7 Why is it hard? UserMovieRating DijkstraOffice Space5 KnuthOffice Space5 TuringOffice Space5 KnuthDr. Strangelove4 TuringDr. Strangelove2 BooleTitanic5 KnuthTitanic1 TuringTitanic2

8 What makes users similar? Titanic Dr. Strangelove Office Space

9 What makes users similar? The Pearson Correlation Coefficient! Titanic Dr. Strangelove Office Space p c =.813

10 Building a similarity matrix… TuringKnuthBooleChomsky Turing1.0000.8130.7500.125 Knuth0.8131.0000.3250.500 Boole0.7500.3251.0000.500 Chomsky0.1250.500 1.000

11 Predicting user ratings… Would Chomsky like “Grammar Rock”? approach: use matrix to find users like Chomsky drop ratings from those who haven’t seen it take weighted average of remaining ratings

12 Predicting user ratings… TuringKnuthBooleChomsky Turing1.0000.8130.7500.125 Knuth0.8131.0000.3250.500 Boole0.7500.3251.0000.500 Chomsky0.1250.500 1.000 Suppose Turing, Knuth, and Boole rated it 5, 3, and 1. Since.125 +.5 +.5 = 1.125, we predict… r Chomsky = ( (.125/1.125)5 + (.5/1.125)3 + (.5/1.125)1 )/3 r Chomsky = 1.519

13 So how is the data really organized? movie file 1 movie file 2 movie file 3 … user 1, rating ‘5’ user 13, rating ‘3’ user 42, rating ‘2’ … user 13, rating ‘1’ user 42, rating ‘1’ user 1337, rating ‘2’ … user 13, rating ‘5’ user 311, rating ‘4’ user 666, rating ‘5’ …

14 Training Data 17,770 text files (one for each movie) > 2 GB

15 Parallelization Two Step Process: Learning Step Prediction Step Concerns: Data Distribution Task Distribution

16 Parallelizing the learning step… user 1 user 2 user 3 user 4 user 5 user 6 user 7 user 8 user 1c1,1c1,2c1,3c1,4c1,5c1,6c1,7c1,8 user 2c2,1c2,2c2,3c2,4c2,5c2,6c2,7c2,8 user 3c3,1c3,2c3,3c3,4c3,5c3,6c3,7c3,8 user 4c4,1c4,2c4,3c4,4c4,5c4,6c4,7c4,8 user 5c5,1c5,2c5,3c5,4c5,5c5,6c5,7c5,8 user 6c6,1c6,2c6,3c6,4c6,5c6,6c6,7c6,8 user 7c7,1c7,2c7,3c7,4c7,5c7,6c7,7c7,8 user 8c8,1c8,2c8,3c8,4c8,5c8,6c8,7c8,8

17 Parallelizing the learning step… user 1 user 2 user 3 user 4 user 5 user 6 user 7 user 8 user 1c1,1c1,2c1,3c1,4c1,5c1,6c1,7c1,8 user 2c2,1c2,2c2,3c2,4c2,5c2,6c2,7c2,8 user 3c3,1c3,2c3,3c3,4c3,5c3,6c3,7c3,8 user 4c4,1c4,2c4,3c4,4c4,5c4,6c4,7c4,8 user 5c5,1c5,2c5,3c5,4c5,5c5,6c5,7c5,8 user 6c6,1c6,2c6,3c6,4c6,5c6,6c6,7c6,8 user 7c7,1c7,2c7,3c7,4c7,5c7,6c7,7c7,8 user 8c8,1c8,2c8,3c8,4c8,5c8,6c8,7c8,8 P=2 P=3 P=4 P=1

18 Parallelizing the learning step… store data as user[movie] = rating each proc has all rating data for n/p users calculate each ci,j calculation requires message passing (only 1/p of correlations can be calculated locally within a node)

19 Parallelizing the prediction step… Data distribution directly affects task distribution Method 1: Store all user information on each processor and stripe movie information (less communication) P1P2P3 All User Information Movie1Movie2Movie3 Movie4Movie5Movie6 Movie7Movie8Movie9 Movie10Movie11Movie12 P0 predict(user, movie) rating estimate

20 Parallelizing the prediction step… Data distribution directly affects task distribution Method 2: Store all movie information on each processor and stripe user information (more communication) P1P2P3 All Movie Ratings User1User2User3 User4User5User6 User7User8User9 User10User11User12 P0 predict(user, movie) gather partial estimates

21 Parallelizing the prediction step… Data distribution directly affects task distribution Method 3: hybrid approach (lots of communication high number of nodes) P1P2P3 Users 1-3 Movie1Movie2Movie3 Movie4Movie5Movie6 Movie7Movie8Movie9 Movie10Movie11Movie12 P0 P4P5P6 Users 1-3 Movie13Movie14Movie15 Movie16Movie17Movie18 Movie19Movie20Movie21 Movie22Movie23Movie24 P7P8P9 Users 4-6 Movie13Movie14Movie15 Movie16Movie17Movie18 Movie19Movie20Movie21 Movie22Movie23Movie24 ……… Users 4-6 Movie25Movie26Movie27 Movie28Movie29Movie30 Movie31Movie32Movie33 Movie34Movie35Movie36 predict(user, movie)

22 Our Present Implementation operates on a trimmed-down dataset stripes movie information and stores similarity matrix in each processor this won’t scale well! storing all movie information on each node would be optimal, but nic.mst.edu can’t handle it

23 In summary… tackling Netflix Prize requires lots of data handling we are working toward an implementation that can operate on the entire training set simple collaborative filtering should get us close to the old Cinematch performance


Download ppt "The Netflix Challenge Parallel Collaborative Filtering James Jolly Ben Murrell CS 387 Parallel Programming with MPI Dr. Fikret Ercal."

Similar presentations


Ads by Google