Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data

Similar presentations


Presentation on theme: "A Latent Space Approach to Dynamic Embedding of Co-occurrence Data"— Presentation transcript:

1 A Latent Space Approach to Dynamic Embedding of Co-occurrence Data
Purnamrita Sarkar Machine Learning Department Sajid M. Siddiqi Robotics Institute Geoffrey J. Gordon Machine Learning Department CMU

2 A Static Problem: Pairwise Discrete Entity Links  Embedding in R2
Output: Input: Bob Alice Charlie Tree SVM Entropy Neural Author Keyword Alice SVM Bob Neural Charlie Tree Entropy E N S T A 1 B C Co-occurrence counts Embedding in R2 Pairwise Link Data This problem was addressed by the CODE algorithm of Globerson et al., NIPS 2004

3 A Dynamic Problem: Pairwise Discrete Entity Links Over Time  Embeddings in R2
Input: Smooth Embedding in Rn over time Output: Problem: this series of motivational slides does not capture the fact that we learn a distribution over coordinates Pairwise Link Data per timestep Co-occurrence counts per timestep This is the problem we address with our algorithm: D-CODE (Dynamic CODE). Additionally, we want distributions over entity coordinates, rather than point estimates

4 Notation

5 One time-slice of the model
Author Coordinates Author-Word Co-occurrence Counts Word Coordinates

6 The dynamic model

7 The Observation Model Closer the pair in latent space →
Higher the probability of co-occurrence But what if we want a distribution over the coordinates, and not just point estimates ? Kalman Filters !

8 Kalman Filters Requirements: Operations: X1 XT … X0 O1 OT
1. X and O are real-valued, and 2. p(Xt| Xt-1) and p(Ot|Xt) are Gaussian Xt : hidden Ot : observed Operations: Start with an initial belief over X, i.e. P(X0) For t = 1…T, with P(Xt|O1|t-1) as the current belief: Condition on Ot to obtain P(Xt|O1|t) Predict the joint belief P(Xt,Xt+1|O1|t) with the transition model Roll-up (i.e. integrate) Xt to get the updated belief P(Xt+1|O1|t)

9 Kalman Filter Inference
N(obs , obs) N(t|t-1 , t|t-1)

10 Kalman Filter Inference
N(0 , transition) N(t|t-1 , t|t-1)

11 Approximating the Observation Model
Lets take a closer look at the observation model It can be moment-matched to a Gaussian… …except for the normalizer (a nasty log-sum) Our approach: approximating the normalization constant

12 Linearizing the normalizer
We want to approximate the observation model with a Gaussian distribution. Step 1: First order Taylor approximation However, this is still hard

13 Linearizing the normalizer
We want to approximate the observation model with a Gaussian distribution. Step 2: 2nd order Taylor approximation of We obtain a closed-form Gaussian with parameters related to the Jacobian and Hessian of this Taylor approximation. This Gaussian N(approx,approx) is our approximated observation model. We choose the linearization points to be the posterior means of the coordinates, given the data observed so far. This Gaussian preserves x-y correlations!

14 Approximating the Observation Model
Two pairs of contour plots of an author’s true posterior conditional (left panel) in a 3-author 5-word embedding and the corresponding approximate Gaussian posterior conditional obtained (right panel). A is a difficult-to-approximate bimodal case, B is an easier unimodal case

15 Our Algorithms D-CODE : Expected model probability
Can be obtained in closed form using our approximation D-CODE MLE: Evaluate model probability using the posterior means Static versions of the above, which learn an embedding on CT-1 to predict for year T.

16 Algorithms for Comparison
We compare with a dynamic version of PCA over overlapping windows of data The consistency between configurations between two consecutive time-steps is maintained by a Procrustes transform For ranking, we evaluate our model probability at the PCA coordinates We also compare to Locally Linear Embedding (LLE) on the author prediction task. Like the static D-CODE variants above, we embed data for year T − 1 and predict for year T. Define author-author distances based on the words they use, as in Mei and Shelton (2006). This allows us to compare with LLE

17 Experiments We present the experiments in two sections Qualitative results Visualization Quantitative results Ranking on the Naïve Bayes author prediction task Naïve Bayes author prediction: We use the distributions / point-estimates over entity locations at each timestep to perform Naive Bayes ranking of authors given a subset of words from a paper in the next timestep.

18 Synthetic Data Consider a dataset of 6 authors and 3 words
There is one group of words A1…3, and two groups of authors X1…3 and Y1…3. Initially the words Ai are mostly used by authors Xi. Over time, the words gradually shift towards authors Yi. There is a random noise component in the data.

19 The dynamic embedding successfully reflects trends in the underlying data

20 Ranking with and w/o distributions
Ranks using D-CODE shift much more smoothly with time

21 NIPS co-authorship data

22

23

24 NIPS authors rankings:(Jordan, variational)
Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

25 NIPS authors rankings: (Smola, kernel)
Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

26 NIPS authors rankings: (Waibel, speech)
Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

27 Rank Prediction Median predicted rank of true authors of papers in t = 13 based on embeddings until t = 12. Values statistically indistinguishable from the best in each row are in bold. D-CODE is the best model in most cases, showing the usefulness of having distributions rather than just point estimates. D-CODE and D-CODE MLE also beat their static counterparts, showing the advantage of dynamic modeling.

28 Conclusion Novel dynamic embedding algorithm based on co-occurrence counts data, using Kalman Filters Visualization Prediction Detecting trends Distributions in embeddings make a difference!! Can also do smoothing with closed-form updates

29 Acknowledgements We gratefully acknowledge Carlos Guestrin for his guidance and helpful comments.


Download ppt "A Latent Space Approach to Dynamic Embedding of Co-occurrence Data"

Similar presentations


Ads by Google