Offline Components: Collaborative Filtering in Cold-start Situations.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Part 2: Unsupervised Learning

Recommender System A Brief Survey.

Bayesian Belief Propagation

Topic models Source: Topic models, David Blei, MLSS 09.

Biointelligence Laboratory, Seoul National University

Pattern Recognition and Machine Learning

Statistical Topic Modeling part 1

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

2. Introduction Multiple Multiplicative Factor Model For Collaborative Filtering Benjamin Marlin University of Toronto. Department of Computer Science.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Customizable Bayesian Collaborative Filtering Denver Dash Big Data Reading Group 11/19/2007.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Chapter 12 (Section 12.4) : Recommender Systems Second edition of the book, coming soon.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Matrix Factorization Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.

Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.

Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill.

Evaluation Methods and Challenges. 2 Deepak Agarwal & Bee-Chung ICML’11 Evaluation Methods Ideal method –Experimental Design: Run side-by-side.

GAUSSIAN PROCESS FACTORIZATION MACHINES FOR CONTEXT-AWARE RECOMMENDATIONS Trung V. Nguyen, Alexandros Karatzoglou, Linas Baltrunas SIGIR 2014 Presentation:

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

- 1 - Recommender Problems for Content Optimization Deepak Agarwal Yahoo! Research MMDS, June 15 th, 2010 Stanford, CA.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

Predictive Discrete Latent Factor Models for large incomplete dyadic data Deepak Agarwal, Srujana Merugu, Abhishek Agarwal Y! Research MMDS Workshop, Stanford.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

The Summary of My Work In Graduate Grade One Reporter: Yuanshuai Sun

Lecture 2: Statistical learning primer for biologists

Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Gaussian Processes For Regression, Classification, and Prediction.

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Tutorial I: Missing Value Analysis

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Analysis of Social Media MLD , LTI William Cohen

Regression Based Latent Factor Models Deepak Agarwal Bee-Chung Chen Yahoo! Research KDD 2009, Paris 6/29/2009.

Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

Collaborative Deep Learning for Recommender Systems

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Matchbox Large Scale Online Bayesian Recommendations

Learning Recommender Systems with Adaptive Regularization

MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER SYSTEMS

Multimodal Learning with Deep Boltzmann Machines

Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.

Latent Variables, Mixture Models and EM

CSCI 5822 Probabilistic Models of Human and Machine Learning

Advanced Artificial Intelligence

Q4 : How does Netflix recommend movies?

Probabilistic Models with Latent Variables

Bayesian Inference for Mixture Language Models

Topic models for corpora and for graphs

Topic models for corpora and for graphs

Presentation transcript:

Offline Components: Collaborative Filtering in Cold-start Situations

2 Deepak Agarwal & Bee-Chung ICML’11 Problem Item j with User i with user features x i (demographics, browse history, search history, …) item features x j (keywords, content categories,...) (i, j) : response y ij visits Algorithm selects (explicit rating, implicit click/no-click) Predict the unobserved entries based on features and the observed entries

3 Deepak Agarwal & Bee-Chung ICML’11 Model Choices Feature-based (or content-based) approach –Use features to predict response (regression, Bayes Net, mixture models, …) –Limitation: need predictive features Bias often high, does not capture signals at granular levels Collaborative filtering (CF aka Memory based) –Make recommendation based on past user-item interaction User-user, item-item, matrix factorization, … See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc. –Better performance for old users and old items –Does not naturally handle new users and new items (cold-start)

4 Deepak Agarwal & Bee-Chung ICML’11 Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearson’s correlation Optimization based (Koren et al)

5 Deepak Agarwal & Bee-Chung ICML’11 How to Deal with the Cold-Start Problem Heuristic-based approaches –Linear combination of regression and CF models –Filterbot Add user features as psuedo users and do collaborative filtering -Hybrid approaches -Use content based to fill up entries, then use CF Matrix Factorization –Good performance on Netflix (Koren, 2009) Model-based approaches –Bilinear random-effects model (probabilistic matrix factorization) Good on Netflix data [Ruslan et al ICML, 2009] –Add feature-based regression to matrix factorization (Agarwal and Chen, 2009) –Add topic discovery (from textual items) to matrix factorization (Agarwal and Chen, 2009; Chun and Blei, 2011)

6 Deepak Agarwal & Bee-Chung ICML’11 Per-item regression models When tracking users by cookies, distribution of visit patters could get extremely skewed –Majority of cookies have 1-2 visits Per item models (regression) based on user covariates attractive in such cases

7 Deepak Agarwal & Bee-Chung ICML’11 Several per-item regressions: Multi-task learning Low dimension (5-10), B estimated retrospective data Agarwal,Chen and Elango, KDD, 2010 Affinity to old items

Per-user, per-item models via bilinear random-effects model

9 Deepak Agarwal & Bee-Chung ICML’11 Motivation Data measuring k-way interactions pervasive –Consider k = 2 for all our discussions E.g. User-Movie, User-content, User-Publisher-Ads,…. –Power law on both user and item degrees Classical Techniques –Approximate matrix through a singular value decomposition (SVD) After adjusting for marginal effects (user pop, movie pop,..) –Does not work Matrix highly incomplete, severe over-fitting –Key issue Regularization of eigenvectors (factors) to avoid overfitting

10 Deepak Agarwal & Bee-Chung ICML’11 Early work on complete matrices Tukey’s 1-df model (1956) –Rank 1 approximation of small nearly complete matrix Criss-cross regression (Gabriel, 1978) Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) Modern day recommender problems –Highly incomplete, large, noisy.

11 Deepak Agarwal & Bee-Chung ICML’11 Latent Factor Models “newsy” “sporty” “newsy” s item v z Affinity = u’v Affinity = s’z us p o r t y

12 Deepak Agarwal & Bee-Chung ICML’11 Factorization – Brief Overview Latent user factors: (α i, u i =(u i1,…,u in )) (Nn + Mm) parameters Key technical issue: Latent movie factors: (β j, v j =(v j1,….,v jn )) will overfit for moderate values of n,m Regularization Interaction

13 Deepak Agarwal & Bee-Chung ICML’11 Latent Factor Models: Different Aspects Matrix Factorization –Factors in Euclidean space –Factors on the simplex Incorporating features and ratings simultaneously Online updates

14 Deepak Agarwal & Bee-Chung ICML’11 Maximum Margin Matrix Factorization (MMMF) Complete matrix by minimizing loss (hinge,squared-error) on observed entries subject to constraints on trace norm –Srebro, Rennie, Jakkola (NIPS 2004) Convex, Semi-definite programming (expensive, not scalable) Fast MMMF (Rennie & Srebro, ICML, 2005) –Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. Other variation: Ensemble MMMF (DeCoste, ICML2005) –Ensembles of partially trained MMMF (some improvements)

15 Deepak Agarwal & Bee-Chung ICML’11 Matrix Factorization for Netflix prize data Minimize the objective function Simon Funk: Stochastic Gradient Descent Koren et al (KDD 2007): Alternate Least Squares –They move to SGD later in the competition

16 Deepak Agarwal & Bee-Chung ICML’11 uiui vjvj r ij auau avav Optimization is through Gradient Descent (Iterated conditional modes) Other variations like constraining the mean through sigmoid, using “who-rated- whom” Combining with Boltzmann Machines also improved performance Probabilistic Matrix Factorization (Ruslan & Minh, 2008, NIPS)

17 Deepak Agarwal & Bee-Chung ICML’11 Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) Fully Bayesian treatment using an MCMC approach Significant improvement Interpretation as a fully Bayesian hierarchical model shows why that is the case –Failing to incorporate uncertainty leads to bias in estimates –Multi-modal posterior, MCMC helps in converging to a better one r Var-comp: a u MCEM also more resistant to over-fitting

18 Deepak Agarwal & Bee-Chung ICML’11 Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) Specify rank probabilistically (automatic rank selection)

19 Deepak Agarwal & Bee-Chung ICML’11 How to incorporate features: Deal with both warm start and cold-start Models to predict ratings for new pairs –Warm-start: (user, movie) present in the training data with large sample size –Cold-start: At least one of (user, movie) new or has small sample size Rough definition, warm-start/cold-start is a continuum. Challenges –Highly incomplete (user, movie) matrix –Heavy tailed degree distributions for users/movies Large fraction of ratings from small fraction of users/movies –Handling both warm-start and cold-start effectively in the presence of predictive features

20 Deepak Agarwal & Bee-Chung ICML’11 Possible approaches Large scale regression based on covariates –Does not provide good estimates for heavy users/movies –Large number of predictors to estimate interactions Collaborative filtering –Neighborhood based –Factorization Good for warm-start; cold-start dealt with separately Single model that handles cold-start and warm-start –Heavy users/movies → User/movie specific model –Light users/movies → fallback on regression model –Smooth fallback mechanism for good performance

Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model

22 Deepak Agarwal & Bee-Chung ICML’11 Regression-based Factorization Model (RLFM) Main idea: Flexible prior, predict factors through regressions Seamlessly handles cold-start and warm-start Modified state equation to incorporate covariates

23 Deepak Agarwal & Bee-Chung ICML’11 RLFM: Model Rating: Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) user i gives item j Bias of user i: Popularity of item j: Factors of user i: Factors of item j: Could use other classes of regression models

24 Deepak Agarwal & Bee-Chung ICML’11 Graphical representation of the model

25 Deepak Agarwal & Bee-Chung ICML’11 Advantages of RLFM Better regularization of factors –Covariates “shrink” towards a better centroid Cold-start: Fallback regression model (FeatureOnly)

26 Deepak Agarwal & Bee-Chung ICML’11 RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data)

27 Deepak Agarwal & Bee-Chung ICML’11 Induced correlations among observations Hierarchical random-effects model Marginal distribution obtained by integrating out random effects

28 Deepak Agarwal & Bee-Chung ICML’11 Closer look at induced marginal correlations

29 Deepak Agarwal & Bee-Chung ICML’11 Model fitting: EM for our class of models

30 Deepak Agarwal & Bee-Chung ICML’11 The parameters for RLFM Latent parameters Hyper-parameters

31 Deepak Agarwal & Bee-Chung ICML’11 Computing the mode Minimized

32 Deepak Agarwal & Bee-Chung ICML’11 The EM algorithm

33 Deepak Agarwal & Bee-Chung ICML’11 Computing the E-step Often hard to compute in closed form Stochastic EM (Markov Chain EM; MCEM) –Compute expectation by drawing samples from –Effective for multi-modal posteriors but more expensive Iterated Conditional Modes algorithm (ICM) –Faster but biased hyper-parameter estimates

34 Deepak Agarwal & Bee-Chung ICML’11 Monte Carlo E-step Through a vanilla Gibbs sampler (conditionals closed form) Other conditionals also Gaussian and closed form Conditionals of users (movies) sampled simultaneously Small number of samples in early iterations, large numbers in later iterations

35 Deepak Agarwal & Bee-Chung ICML’11 M-step (Why MCEM is better than ICM) Update G, optimize Update A u =a u I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well

36 Deepak Agarwal & Bee-Chung ICML’11 Experiment 1: Better regularization MovieLens-100K, avg RMSE using pre-specified splits ZeroMean, RLFM and FeatureOnly (no cold-start issues) Covariates: –Users : age, gender, zipcode (1 st digit only) –Movies: genres

37 Deepak Agarwal & Bee-Chung ICML’11 Experiment 2: Better handling of Cold-start MovieLens-1M; EachMovie Training-test split based on timestamp Same covariates as in Experiment 1.

38 Deepak Agarwal & Bee-Chung ICML’11 Experiment 4: Predicting click-rate on articles Goal: Predict click-rate on articles for a user on F1 position Article lifetimes short, dynamic updates important User covariates: –Age, Gender, Geo, Browse behavior Article covariates –Content Category, keywords 2M ratings, 30K users, 4.5 K articles

39 Deepak Agarwal & Bee-Chung ICML’11 Results on Y! FP data

40 Deepak Agarwal & Bee-Chung ICML’11 Some other related approaches Stern, Herbrich and Graepel, WWW, 2009 –Similar to RLFM, different parametrization and expectation propagation used to fit the models Porteus, Asuncion and Welling, AAAI, 2011 –Non-parametric approach using a Dirichlet process Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 –Regression + random effects per user regularized through a Graphical Lasso

Add Topic Discovery into Matrix Factorization fLDA: Matrix Factorization through Latent Dirichlet Allocation

42 Deepak Agarwal & Bee-Chung ICML’11 fLDA: Introduction Model the rating y ij that user i gives to item j as the user’s affinity to the topics that the item has –Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data –fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation User i ’s affinity to topic k Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: z jk ’s are Item latent factors learnt from data with the LDA prior New items: z jk ’s are predicted based on the bag of words in the items

43 Deepak Agarwal & Bee-Chung ICML’11  11, …,  1W …  k1, …,  kW …  K1, …,  KW Topic 1 Topic k Topic K LDA Topic Modeling (1) LDA is effective for unsupervised topic discovery [Blei’03] –It models the generating process of a corpus of items (articles) –For each topic k, draw a word distribution  k = [  k1, …,  kW ] ~ Dir(  ) –For each item j, draw a topic distribution  j = [  j1, …,  jK ] ~ Dir( ) –For each word, say the nth word, in item j, Draw a topic z jn for that word from  j = [  j1, …,  jK ] Draw a word w jn from  k = [  k1, …,  kW ] with topic k = z jn Item j Topic distribution: [  j1, …,  jK ] Words: w j1, …, w jn, … Per-word topic: z j1, …, z jn, … Assume z jn = topic k Observed

44 Deepak Agarwal & Bee-Chung ICML’11 LDA Topic Modeling (2) Model training: –Estimate the prior parameters and the posterior topic  word distribution  based on a training corpus of items –EM + Gibbs sampling is a popular method Inference for new items –Compute the item topic distribution based on the prior parameters and  estimated in the training phase Supervised LDA [Blei’07] –Predict a target value for each item based on supervised LDA topics Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j Regression weight for topic k vs. One regression per user Same set of topics across different regressions

45 Deepak Agarwal & Bee-Chung ICML’11 fLDA: Model Rating: Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) user i gives item j Bias of user i: Popularity of item j: Topic affinity of user i: Pr(item j has topic k): The LDA topic of the nth word in item j Observed words: The nth word in item j

46 Deepak Agarwal & Bee-Chung ICML’11 Model Fitting Given: –Features X = {x i, x j, x ij } –Observed ratings y = {y ij } and words w = {w jn } Estimate: –Parameters:  = [b, g 0, d 0, H,  2, a , a , A s,,  ] Regression weights and prior parameters –Latent factors:  = {  i,  j, s i } and z = {z jn } User factors, item factors and per-word topic assignment Empirical Bayes approach: –Maximum likelihood estimate of the parameters: –The posterior distribution of the factors:

47 Deepak Agarwal & Bee-Chung ICML’11 The EM Algorithm Iterate through the E and M steps until convergence –Let be the current estimate –E-step: Compute The expectation is not in closed form We draw Gibbs samples and compute the Monte Carlo mean –M-step: Find It consists of solving a number of regression and optimization problems

48 Deepak Agarwal & Bee-Chung ICML’11 Supervised Topic Assignment Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when z jn is set to topic k Probability of observing y ij given the model The topic of the nth word in item j

49 Deepak Agarwal & Bee-Chung ICML’11 fLDA: Experimental Results (Movie) Task: Predict the rating that a user would give a movie Training/test split: –Sort observations by time –First 75%  Training data –Last 25%  Test data Item warm-start scenario –Only 2% new items in test data ModelTest RMSE RLFM fLDA Factor-Only FilterBot unsup-LDA MostPopular Feature-Only Constant fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios

50 Deepak Agarwal & Bee-Chung ICML’11 fLDA: Experimental Results (Yahoo! Buzz) Task: Predict whether a user would buzz-up an article Severe item cold-start –All items are new in test data Data Statistics 1.2M observations 4K users 10K articles fLDA significantly outperforms other models

51 Deepak Agarwal & Bee-Chung ICML’11 Experimental Results: Buzzing Topics Top Terms (after stemming)Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al CIA interrogation mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain Swine flu NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week NFL games court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow Gay marriage palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska Sarah Palin idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama American idol economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month Recession north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia North Korea issues 3/4 topics are interpretable; 1/2 are similar to unsupervised topics

52 Deepak Agarwal & Bee-Chung ICML’11 fLDA Summary fLDA is a useful model for cold-start item recommendation It also provides interpretable recommendations for users –User’s preference to interpretable LDA topics Future directions: –Investigate Gibbs sampling chains and the convergence properties of the EM algorithm –Apply fLDA to other multi-task prediction problems fLDA can be used as a tool to generate supervised features (topics) from text data

53 Deepak Agarwal & Bee-Chung ICML’11 Summary Regularizing factors through covariates effective Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine Distributed computing on Hadoop: Multiple models and average across partitions

54 Deepak Agarwal & Bee-Chung ICML’11 Hierarchical smoothing Advertising application, Agarwal et al. KDD 10 Product of states for each node pair Spike and Slab prior –Known to encourage parsimonious solutions Several cell states have no corrections –Not used before for multi-hierarchy models, only in regression –We choose P =.5 (and choose “a” by cross-validation) a – psuedo number of successes pubclass Pub-id Advertiser Conv-id campaign Ad-id z ( S z, E z, λ z )