Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Similar presentations


Presentation on theme: "Cao et al. ICML 2010 Presented by Danushka Bollegala."— Presentation transcript:

1 Cao et al. ICML 2010 Presented by Danushka Bollegala.

2  Predict links (relations) between entities  Recommend items for users (MovieLens, Amazon)  Recommend users for users (social recommendation)  Similarity search (suggest similar web pages)  Query suggestion (suggest related queries by other users)  Collective Link Prediction (CLP)  Perform multiple prediction tasks for the same set of users simultaneously ▪ Predict/recommend multiple item types (books and movies)  Pros  Prediction tasks might not be independent, one can benefit from another (books vs. movies vs. food)  Less affected by data sparseness (cold start problem)

3 Transfer Learning+ Collective Link Prediction (this paper) Transfer Learning+ Collective Link Prediction (this paper) Gaussian Process for Regression (GPR) (PRML Sec. 6.4) Gaussian Process for Regression (GPR) (PRML Sec. 6.4) Link prediction = matrix factorization Probabilistic Principal Component Analysis (PPCA) (Bishop & Tipping, 1999) PRML Chapter 12. Probabilistic Principal Component Analysis (PPCA) (Bishop & Tipping, 1999) PRML Chapter 12. Probabilistic non-linear matrix factorization Probabilistic non-linear matrix factorization Lawrence & Utrasun, ICML 2009 Task similarity Matrix, T

4  Link matrix X (x i,j is the rating given by user I to item j)  X i,j is modeled by f(u i, v j, ε)  f: link function  u i : latent representation of a user i  v j : latent representation of an item j  ε: noise term  Generalized matrix approximation  Assumption: E is Gaussian noise N(0, σ 2 I)  Use Y = f -1 (X)  Then, Y follows a multivariate Gaussian distribution.

5 Revision (PRML Section 6.4)

6  We can view a function as an infinite dimensional vector  f(x): (f(x 1), f(x 2 ),...) T  Each point in the domain is mapped by f to a dimension in the vector  In machine learning we must find functions (e.g. linear predictors) that map input values to their corresponding output values  We must also avoid over-fitting  This can be visualized as sampling from a distribution over functions with certain properties  Preference bias (cf. restriction bias)

7  Linear regression model  We get different output functions y for different weight vectors w.  Let us impose a Gaussian prior over w  Train dataset: {(x 1,y 1 ),...,(x N,y N )}  Targets: y=(y 1,...,y N ) T  Design matrix

8  When we impose a Gaussian prior over the weight vector, then the target y is also Gaussian.  K: Kernel matrix (Gram matrix)  k: kernel function

9  Gaussian process is defined as a probability distribution over functions y(x) such that the set of values y(x) evaluated at an arbitrary set of points x 1,...,x N jointly have a Gaussian distribution.  p(x 1,...,x N ) is Gaussian.  Often the mean is set to zero  Non-informative prior  Then the kernel function fully defines the GP.  Gaussian kernel:  Exponential Kernel:

10  Predict outputs with noise x x y y e e t t

11  PMF can be seen as a Gaussian Process with latent variables (GP-LVM) [Lawrence & Utrasun ICML 2009] Generalized matrix approximation model Y=f -1 (X) follows a multivariate Gaussian distribution A Gaussian prior is set on U Probabilistic PCA model by Tipping & Bishop (1999) Probabilistic PCA model by Tipping & Bishop (1999) Non-linear version Mapping back to X Mapping back to X

12

13  GP model for each task  A single model for all tasks

14  Known as Kronecker product for two matrices (e.g., numpy,kron(a,b))

15  Each task might have a different rating distribution.  c, α, b are parameters that must be estimated from the data.  We can relax the constraint α > 0 if we have no prior knowledge regarding the negativity of the skewness of the rating distribution.

16  Similar to GPR prediction  Predicting y= g(x)  Predicting x

17  Compute the likelihood of the dataset  Use Stochastic Gradient Descent for optimization  Non-convex optimization  Sensitive to initial conditions

18  Setting  Use each dataset and predict multiple items  Datasets  MovieLens ▪ 100000 ratings, 1-5 scale ratings, 943 users, 1682 movies, 5 popular genres  Book-Crossing ▪ 56148 ratings, 1-10 scale, 28503 users, 9909 books, 4 most general Amazon book categories  Douban ▪ A social network-based recommendation serivce ▪ 10000 users, 200000 items ▪ Movies, books, music

19  Evaluation measure  Mean Absolute Error (MAE)  Baselines  I-GP: Independent Link Prediction using GP  CMF: Collective matrix factorization ▪ non GP, classical NMF  M-GP: Joint Link prediction using multi-relational GP ▪ Does not consider the similarity between tasks  Proposed method = CLP-GP

20 Note: (1) Smaller values are better (2) with(+)/without(-) link function.

21 Good

22

23  Romance and Drama are very similar  Action and Comedy are very dissimilar

24  Elegant model and well-written paper  Few parameters (latent space dimension k) need to be specified  All other parameters can be learnt  Applicable to a wide range of tasks  Cons:  Computational complexity ▪ Predictions require kernel matrix inversion ▪ SGD updates might not converge ▪ The problem is non-convex...


Download ppt "Cao et al. ICML 2010 Presented by Danushka Bollegala."

Similar presentations


Ads by Google