Download presentation

Presentation is loading. Please wait.

Published byEzekiel Bagwell Modified over 2 years ago

1
C O B A F I : COLLABORATIVE BAYESIAN FILTERING Alex Beutel Joint work with Kenton Murray, Christos Faloutsos, Alex Smola April 9, 2014 – Seoul, South Korea

2
Online Recommendation 25 Users Movies

3
Online Rating Models 3

4
Normal Collaborative Filtering Fit a Gaussian - Minimize the error Reality Minimizing error isnt good enough - Understanding the shape matters! 4

5
Online Rating Models Our Model 5 Normal Collaborative Filtering Fit a Gaussian - Minimize the error

6
Our Goals and Challenges Given: A matrix of user ratings Find: A model that best fits and predicts user preferences Goals: G1. Fit the recommender distribution G2. Understand users who rate few items G3. Detect abnormal spam behavior 6

7
1. Background OUTLINE 2. Model Formulation 3. Inference 4. Catching Spam 5. Experiments 7

8
Collaborative Filtering X U V Users Movies Genres 5 = = 8 [Background]

9
Matrix Factorization X Users Movies 9 [Background] U V Genres

10
Bayesian Probabilistic Matrix Factorization (Salakhutdinov & Mnih, ICML 2008) μUμU ~ … 10 [Background]

11
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 11

12
Our Model 12 Use user preferences to predict ratings Cluster users (& items) Share preferences within clusters

13
The Recommender Distribution First introduced by Tan et al, 2013 Normalization θ 2 = -1.0θ 2 = 0.4 θ 1 = 0 Vary θ 2 13 Linear Quadratic

14
The Recommender Distribution Genre PreferencesGeneral LeaningHow Polarized uiui 14 Goal 1: Fit the recommender distribution

15
Understanding varying preferences

16
Resulting Co-clustering U V 16

17
Finding User Preferences μUμU μUμU 17 Goal 2: Understand users who rate few items

18
Chinese Restaurant Process μ1μ1 μ2μ2 μ3μ3 18

19
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 19

20
Gibbs Sampling - Clusters Probability of a cluster based on size (CRP) x Probability u i would come from the cluster [Details] 20 Probability of picking a cluster =

21
Sampling user parameters [Details] Probability of preferences u i given cluster parameters x Probability of predicting ratings r i,j using new preferences Recommender distribution is non-conjugate Cant sample directly! 21 Probability of user preferences u i =

22
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 22

23
Review Spam and Fraud 5 5 Image from

24
Clustering Fraudsters μ1μ1 μ2μ2 μ3μ3 New Spam ClusterPrevious Real Cluster 24

25
Clustering Fraudsters μ1μ1 μ2μ2 μ3μ3 Too much spam – get separated into fraud cluster Trying to hide just means (a) very little spam or (b) camouflage reinforcing realistic reviews. 25

26
Clustering Fraudsters μ1μ1 μ2μ2 μ3μ3 μ4μ4 μ5μ5 Naïve Spammers Spam + NoiseHijacked Accounts 26 Goal 3: Detect abnormal spam behavior

27
1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 27

28
Does it work? 28 Better Fit

29
Catching Naïve Spammers 29 83% are clustered together Injection

30
Clustered Hijacked Accounts Clustered hijacked accounts Clustered attacked movies 30 Injection

31
Real world clusters 31

32
Shape of real world data 32

33
Shape of Netflix reviews Most GaussianMost skewed The RookieThe O.C. Season 2 The FanSamurai X: Trust and Betrayal Cadet KellyAqua Teen Hunger Force: Vol. 2 Money TrainSealab 2001: Season 1 Alice Doesnt Live HereAqua Teen Hunger Force: Vol. 2 Sea of LoveGilmore Girls: Season 3 Boiling PointFelicity: Season 4 True BelieverThe O.C. Season 1 StakeoutThe Shield Season 3 The PackageQueer as Folk Season 4 33 More Gaussian More Skewed

34
Shape of Amazon Clothing reviews Amazon Clothing Most Skewed Reviews Bra Disc Nipple Covers Vanity Fair Womens String Bikini Panty Lee Mens Relaxed Fit Tapered Jean Carhartt Mens Dungaree Jean Wrangler Mens Cowboy Cut Slim Fit Jean Nearly all are heavily polarized! 34

35
Shape of Amazon Electronics reviews Amazon Electronics Most Skewed Reviews Sony CD-R 50 Pack Spindle Olympus Stylus Epic Zoom Camera Sony AC Adapter Laptop Charger Apricorn Hard Drive Upgrade Kit Corsair 1GB Desktop Memory Nearly all are heavily polarized! 35

36
Shape of BeerAdvocate reviews BeerAdvocate Most Gaussian Reviews Weizenbock (Sierra Nevada) Ovila Abbey Saison (Sierra Nevada) Stoudts Abbey Double Ale Stoudts Fat Dog Stout Juniper Black Ale Nearly all are Gaussian! 36

37
Hypotheses on shape of data Hard to evaluate beyond binary Selection bias – Only committed viewers watch Season 4 of a TV series Hard to compare value across very different items. Lots of beers and movies to compare Fewer TV shows Even fewer jeans or hard drives vs. 37

38
Key Points Modeling: Fit real data with flexible recommender distribution Prediction: Predict user preferences Anomaly Detection: When does a user not match the normal model? 38

39
Questions? Alex Beutel 39

40
u5u5 u6u6 μaμa μαμα Sampling Cluster Parameters Hyperparameters μ α, λ α, W α, ν Priors on μ α, λ α, W α 40

41
Gibbs Sampling - Clusters Probability of a cluster (CRP) Probability u i would be sampled from cluster a [Details] 41

42
Sampling user parameters [Details] Probability of u i given cluster parameters Probability of predicting ratings r i,j Recommender distribution is non-conjugate Cant sample directly! 42 Use a Laplace approximation and perform Metropolis-Hastings Sampling

43
Sampling user parameters [Details] Use candidate normal distribution Mode of p( u i )Variance of p( u i ) Sample Metropolis-Hastings Sampling: Keep new with probability 43

44
Sampling Cluster Parameters Priors Users/Items in the cluster [Details] 44

45
Inferring Hyperparameters [Details] Solved directly – no sampling needed! Prior hidden as additional cluster 45

46
Have to use non-standard sampling procedure: 99.12% acceptance rate for Amazon Electronics 77.77% acceptance rate for Netflix 24k Does Metropolis Hasting work? 46

47
Does it work? UniformBPMFCoBaFi (us) Netflix (24k users) BeerAdvocate Compare on Predictive Probability (PP) to see how well our model fits the data 47

48
Handling Spammers PP BeforePP After BPMF CoBaFi PP BeforePP After BPMF CoBaFi Random naïve spammers in Amazon Electronics dataset Random hijacked accounts in Netflix 24k dataset 48

49
Clustered Naïve Spammers 83% are clustered together 49

50
Clustered Hijacked Accounts Clustered hijacked accountsClustered attacked movies 50

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google