Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data Purnamrita Sarkar Machine Learning Department Sajid M. Siddiqi Robotics Institute Geoffrey.

Similar presentations


Presentation on theme: "A Latent Space Approach to Dynamic Embedding of Co-occurrence Data Purnamrita Sarkar Machine Learning Department Sajid M. Siddiqi Robotics Institute Geoffrey."— Presentation transcript:

1 A Latent Space Approach to Dynamic Embedding of Co-occurrence Data Purnamrita Sarkar Machine Learning Department Sajid M. Siddiqi Robotics Institute Geoffrey J. Gordon Machine Learning Department CMU

2 A Static Problem: Pairwise Discrete Entity Links  Embedding in R 2 This problem was addressed by the CODE algorithm of Globerson et al., NIPS 2004 Input: AuthorKeyword AliceSVM BobNeural CharlieTree AliceEntropy CharlieNeural ENST A1010 B0100 C0101 Pairwise Link Data Co-occurrence countsEmbedding in R 2 Output: Bob Alice Charlie Tree SVM Entropy Neural

3 Pairwise Link Data per timestep Co-occurrence counts per timestep Input: Smooth Embedding in R n over time Output: A Dynamic Problem: Pairwise Discrete Entity Links Over Time  Embeddings in R 2 This is the problem we address with our algorithm: D-CODE (Dynamic CODE). Additionally, we want distributions over entity coordinates, rather than point estimates

4 Notation

5 One time-slice of the model Author Coordinates Word Coordinates Author-Word Co-occurrence Counts.....

6 The dynamic model

7 The Observation Model Closer the pair in latent space → Higher the probability of co-occurrence But what if we want a distribution over the coordinates, and not just point estimates ? Kalman Filters !

8 X1X1 XTXT … X0X0 O1O1 OTOT Kalman Filters X t : hidden O t : observed Operations: Requirements: 1. X and O are real-valued, and 2. p(X t | X t-1 ) and p(O t |X t ) are Gaussian 1.Start with an initial belief over X, i.e. P(X 0 ) 2.For t = 1…T, with P(X t |O 1|t-1 ) as the current belief: a)Condition on O t to obtain P(X t |O 1|t ) b)Predict the joint belief P(X t,X t+1 |O 1|t ) with the transition model c)Roll-up (i.e. integrate) X t to get the updated belief P(X t+1 |O 1|t )

9 Kalman Filter Inference N(  obs,  obs ) N(  t|t-1,  t|t-1 )

10 Kalman Filter Inference N( ,  transition ) N(  t|t-1,  t|t-1 )

11 Approximating the Observation Model Lets take a closer look at the observation model It can be moment-matched to a Gaussian… …except for the normalizer (a nasty log-sum) Our approach: approximating the normalization constant

12 Linearizing the normalizer We want to approximate the observation model with a Gaussian distribution. Step 1: First order Taylor approximation –However, this is still hard

13 Linearizing the normalizer We want to approximate the observation model with a Gaussian distribution. Step 2: 2 nd order Taylor approximation of –We obtain a closed-form Gaussian with parameters related to the Jacobian and Hessian of this Taylor approximation. This Gaussian N(  approx,  approx ) is our approximated observation model. –We choose the linearization points to be the posterior means of the coordinates, given the data observed so far. –This Gaussian preserves x-y correlations!

14 Approximating the Observation Model (A) (B) Two pairs of contour plots of an author’s true posterior conditional (left panel) in a 3-author 5-word embedding and the corresponding approximate Gaussian posterior conditional obtained (right panel). A is a difficult-to- approximate bimodal case, B is an easier unimodal case

15 Our Algorithms D-CODE : Expected model probability –Can be obtained in closed form using our approximation D-CODE MLE: Evaluate model probability using the posterior means Static versions of the above, which learn an embedding on C T-1 to predict for year T.

16 Algorithms for Comparison We compare with a dynamic version of PCA over overlapping windows of data –The consistency between configurations between two consecutive time-steps is maintained by a Procrustes transform –For ranking, we evaluate our model probability at the PCA coordinates We also compare to Locally Linear Embedding (LLE) on the author prediction task. Like the static D-CODE variants above, we embed data for year T − 1 and predict for year T. Define author-author distances based on the words they use, as in Mei and Shelton (2006). This allows us to compare with LLE

17 Experiments We present the experiments in two sections –Qualitative results Visualization –Quantitative results Ranking on the Naïve Bayes author prediction task Naïve Bayes author prediction: We use the distributions / point-estimates over entity locations at each timestep to perform Naive Bayes ranking of authors given a subset of words from a paper in the next timestep.

18 Synthetic Data Consider a dataset of 6 authors and 3 words There is one group of words A 1…3, and two groups of authors X 1…3 and Y 1…3. Initially the words A i are mostly used by authors X i. Over time, the words gradually shift towards authors Y i. There is a random noise component in the data.

19 The dynamic embedding successfully reflects trends in the underlying data

20 Ranking with and w/o distributions Ranks using D-CODE shift much more smoothly with time

21 NIPS co-authorship data

22

23

24 NIPS authors rankings:(Jordan, variational) Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to 1999. Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

25 NIPS authors rankings: (Smola, kernel) Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to 1999. Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

26 NIPS authors rankings: (Waibel, speech) Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to 1999. Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

27 Rank Prediction Median predicted rank of true authors of papers in t = 13 based on embeddings until t = 12. Values statistically indistinguishable from the best in each row are in bold. D-CODE is the best model in most cases, showing the usefulness of having distributions rather than just point estimates. D-CODE and D-CODE MLE also beat their static counterparts, showing the advantage of dynamic modeling.

28 Conclusion Novel dynamic embedding algorithm based on co- occurrence counts data, using Kalman Filters –Visualization –Prediction –Detecting trends Distributions in embeddings make a difference!! Can also do smoothing with closed-form updates

29 Acknowledgements We gratefully acknowledge Carlos Guestrin for his guidance and helpful comments.


Download ppt "A Latent Space Approach to Dynamic Embedding of Co-occurrence Data Purnamrita Sarkar Machine Learning Department Sajid M. Siddiqi Robotics Institute Geoffrey."

Similar presentations


Ads by Google