Download presentation
Presentation is loading. Please wait.
Published byEzekiel Bagwell Modified over 10 years ago
1
C O B A F I : COLLABORATIVE BAYESIAN FILTERING Alex Beutel Joint work with Kenton Murray, Christos Faloutsos, Alex Smola April 9, 2014 – Seoul, South Korea
2
Online Recommendation 25 Users Movies 5 3 5 5 2
3
Online Rating Models 3
4
Normal Collaborative Filtering Fit a Gaussian - Minimize the error Reality Minimizing error isnt good enough - Understanding the shape matters! 4
5
Online Rating Models Our Model 5 Normal Collaborative Filtering Fit a Gaussian - Minimize the error
6
Our Goals and Challenges Given: A matrix of user ratings Find: A model that best fits and predicts user preferences Goals: G1. Fit the recommender distribution G2. Understand users who rate few items G3. Detect abnormal spam behavior 6
7
1. Background OUTLINE 2. Model Formulation 3. Inference 4. Catching Spam 5. Experiments 7
8
Collaborative Filtering X U V Users Movies Genres 5 = 1.50.73 6 0002.236 2.231.20.2 5 = 8 [Background]
9
Matrix Factorization X Users Movies 9 [Background] U V Genres
10
Bayesian Probabilistic Matrix Factorization (Salakhutdinov & Mnih, ICML 2008) μUμU ~ … 10 [Background]
11
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 11
12
Our Model 12 Use user preferences to predict ratings Cluster users (& items) Share preferences within clusters
13
The Recommender Distribution First introduced by Tan et al, 2013 Normalization θ 2 = -1.0θ 2 = 0.4 θ 1 = 0 Vary θ 2 13 Linear Quadratic
14
The Recommender Distribution 0.30.40.30.2-0.70.40.30.80.4 Genre PreferencesGeneral LeaningHow Polarized uiui 14 Goal 1: Fit the recommender distribution
15
Understanding varying preferences 5 5 2 15 3 1 5 1
16
Resulting Co-clustering U V 16
17
Finding User Preferences μUμU μUμU 17 Goal 2: Understand users who rate few items
18
Chinese Restaurant Process μ1μ1 μ2μ2 μ3μ3 18
19
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 19
20
Gibbs Sampling - Clusters Probability of a cluster based on size (CRP) x Probability u i would come from the cluster [Details] 20 Probability of picking a cluster =
21
Sampling user parameters [Details] Probability of preferences u i given cluster parameters x Probability of predicting ratings r i,j using new preferences Recommender distribution is non-conjugate Cant sample directly! 21 Probability of user preferences u i =
22
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 22
23
Review Spam and Fraud 5 5 Image from http://sinovera.deviantart.com/art/Cute-Devil-117932337 1 1 1 1 1 1 1 1 1 5 5 5 5 5 23
24
Clustering Fraudsters μ1μ1 μ2μ2 μ3μ3 New Spam ClusterPrevious Real Cluster 24
25
Clustering Fraudsters μ1μ1 μ2μ2 μ3μ3 Too much spam – get separated into fraud cluster Trying to hide just means (a) very little spam or (b) camouflage reinforcing realistic reviews. 25
26
Clustering Fraudsters μ1μ1 μ2μ2 μ3μ3 μ4μ4 μ5μ5 Naïve Spammers Spam + NoiseHijacked Accounts 26 Goal 3: Detect abnormal spam behavior
27
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 27
28
Does it work? 28 Better Fit
29
Catching Naïve Spammers 29 83% are clustered together Injection
30
Clustered Hijacked Accounts Clustered hijacked accounts Clustered attacked movies 30 Injection
31
Real world clusters 31
32
Shape of real world data 32
33
Shape of Netflix reviews Most GaussianMost skewed The RookieThe O.C. Season 2 The FanSamurai X: Trust and Betrayal Cadet KellyAqua Teen Hunger Force: Vol. 2 Money TrainSealab 2001: Season 1 Alice Doesnt Live HereAqua Teen Hunger Force: Vol. 2 Sea of LoveGilmore Girls: Season 3 Boiling PointFelicity: Season 4 True BelieverThe O.C. Season 1 StakeoutThe Shield Season 3 The PackageQueer as Folk Season 4 33 More Gaussian More Skewed
34
Shape of Amazon Clothing reviews Amazon Clothing Most Skewed Reviews Bra Disc Nipple Covers Vanity Fair Womens String Bikini Panty Lee Mens Relaxed Fit Tapered Jean Carhartt Mens Dungaree Jean Wrangler Mens Cowboy Cut Slim Fit Jean Nearly all are heavily polarized! 34
35
Shape of Amazon Electronics reviews Amazon Electronics Most Skewed Reviews Sony CD-R 50 Pack Spindle Olympus Stylus Epic Zoom Camera Sony AC Adapter Laptop Charger Apricorn Hard Drive Upgrade Kit Corsair 1GB Desktop Memory Nearly all are heavily polarized! 35
36
Shape of BeerAdvocate reviews BeerAdvocate Most Gaussian Reviews Weizenbock (Sierra Nevada) Ovila Abbey Saison (Sierra Nevada) Stoudts Abbey Double Ale Stoudts Fat Dog Stout Juniper Black Ale Nearly all are Gaussian! 36
37
Hypotheses on shape of data Hard to evaluate beyond binary Selection bias – Only committed viewers watch Season 4 of a TV series Hard to compare value across very different items. Lots of beers and movies to compare Fewer TV shows Even fewer jeans or hard drives vs. 37
38
Key Points Modeling: Fit real data with flexible recommender distribution Prediction: Predict user preferences Anomaly Detection: When does a user not match the normal model? 38
39
Questions? Alex Beutel abeutel@cs.cmu.edu http://alexbeutel.com 39
40
u5u5 u6u6 μaμa μαμα Sampling Cluster Parameters Hyperparameters μ α, λ α, W α, ν Priors on μ α, λ α, W α 40
41
Gibbs Sampling - Clusters Probability of a cluster (CRP) Probability u i would be sampled from cluster a [Details] 41
42
Sampling user parameters [Details] Probability of u i given cluster parameters Probability of predicting ratings r i,j Recommender distribution is non-conjugate Cant sample directly! 42 Use a Laplace approximation and perform Metropolis-Hastings Sampling
43
Sampling user parameters [Details] Use candidate normal distribution Mode of p( u i )Variance of p( u i ) Sample Metropolis-Hastings Sampling: Keep new with probability 43
44
Sampling Cluster Parameters Priors Users/Items in the cluster [Details] 44
45
Inferring Hyperparameters [Details] Solved directly – no sampling needed! Prior hidden as additional cluster 45
46
Have to use non-standard sampling procedure: 99.12% acceptance rate for Amazon Electronics 77.77% acceptance rate for Netflix 24k Does Metropolis Hasting work? 46
47
Does it work? UniformBPMFCoBaFi (us) Netflix (24k users) 1.69041.25251.1827 BeerAdvocate2.19721.98551.6741 Compare on Predictive Probability (PP) to see how well our model fits the data 47
48
Handling Spammers PP BeforePP After BPMF1.70471.8146 CoBaFi1.05491.7042 PP BeforePP After BPMF1.23751.3057 CoBaFi0.96701.2935 Random naïve spammers in Amazon Electronics dataset Random hijacked accounts in Netflix 24k dataset 48
49
Clustered Naïve Spammers 83% are clustered together 49
50
Clustered Hijacked Accounts Clustered hijacked accountsClustered attacked movies 50
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.