A Latent Space Approach to Dynamic Embedding of Co-occurrence Data

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.

Dynamic Bayesian Networks (DBNs)

Modeling Uncertainty over time Time series of snapshot of the world “state” we are interested represented as a set of random variables (RVs) – Observable.

Observers and Kalman Filters

Uncertainty Representation. Gaussian Distribution variance Standard deviation.

A Generalized Model for Financial Time Series Representation and Prediction Author: Depei Bao Presenter: Liao Shu Acknowledgement: Some figures in this.

Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data Purnamrita Sarkar Machine Learning Department Sajid M. Siddiqi Robotics Institute Geoffrey.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

CS 547: Sensing and Planning in Robotics Gaurav S. Sukhatme Computer Science Robotic Embedded Systems Laboratory University of Southern California

SLAM: Simultaneous Localization and Mapping: Part I Chang Young Kim These slides are based on: Probabilistic Robotics, S. Thrun, W. Burgard, D. Fox, MIT.

Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.

Parametric Inference.

Discriminative Training of Kalman Filters P. Abbeel, A. Coates, M

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

Comparative survey on non linear filtering methods : the quantization and the particle filtering approaches Afef SELLAMI Chang Young Kim.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Kalman Filtering Jur van den Berg. Kalman Filtering (Optimal) estimation of the (hidden) state of a linear dynamic process of which we obtain noisy (partial)

Computer vision: models, learning and inference Chapter 10 Graphical Models.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

A Unifying Review of Linear Gaussian Models

Lecture II-2: Probability Review

Crash Course on Machine Learning

Markov Localization & Bayes Filtering

Recap: Reasoning Over Time  Stationary Markov models  Hidden Markov models X2X2 X1X1 X3X3 X4X4 rainsun X5X5 X2X2 E1E1 X1X1 X3X3 X4X4 E2E2 E3E3.

Probabilistic Robotics Bayes Filter Implementations.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

Randomized Algorithms for Bayesian Hierarchical Clustering

Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.

CS Statistical Machine learning Lecture 24

Hilbert Space Embeddings of Conditional Distributions -- With Applications to Dynamical Systems Le Song Carnegie Mellon University Joint work with Jonathan.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Kalman Filtering And Smoothing

Math 285 Project Diffusion Maps Xiaoyan Chong Department of Mathematics and Statistics San Jose State University.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

The Unscented Kalman Filter for Nonlinear Estimation Young Ki Baik.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.

CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.

Zhaoxia Fu, Yan Han Measurement Volume 45, Issue 4, May 2012, Pages 650–655 Reporter: Jing-Siang, Chen.

Multiple Random Variables and Joint Distributions

Learning Deep Generative Models by Ruslan Salakhutdinov

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.

HMM: Particle filters Lirong Xia. HMM: Particle filters Lirong Xia.

Variational filtering in generated coordinates of motion

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Probabilistic Robotics

PSG College of Technology

Machine Learning Basics

Markov ó Kalman Filter Localization

Course: Autonomous Machine Learning

Combining Species Occupancy Models and Boosted Regression Trees

Introduction to particle filter

CSCI 5822 Probabilistic Models of Human and Machine Learning

CSCI 5822 Probabilistic Models of Human and Machine Learning

Filtering and State Estimation: Basic Concepts

Introduction to particle filter

Hidden Markov Models Markov chains not so useful for most agents

Expectation-Maximization & Belief Propagation

Hidden Markov Models Lirong Xia.

Parametric Methods Berlin Chen, 2005 References:

Timescales of Inference in Visual Adaptation

HMM: Particle filters Lirong Xia. HMM: Particle filters Lirong Xia.

Presentation transcript:

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data Purnamrita Sarkar Machine Learning Department Sajid M. Siddiqi Robotics Institute Geoffrey J. Gordon Machine Learning Department CMU

A Static Problem: Pairwise Discrete Entity Links  Embedding in R2 Output: Input: Bob Alice Charlie Tree SVM Entropy Neural Author Keyword Alice SVM Bob Neural Charlie Tree Entropy E N S T A 1 B C Co-occurrence counts Embedding in R2 Pairwise Link Data This problem was addressed by the CODE algorithm of Globerson et al., NIPS 2004

A Dynamic Problem: Pairwise Discrete Entity Links Over Time  Embeddings in R2 Input: Smooth Embedding in Rn over time Output: Problem: this series of motivational slides does not capture the fact that we learn a distribution over coordinates Pairwise Link Data per timestep Co-occurrence counts per timestep This is the problem we address with our algorithm: D-CODE (Dynamic CODE). Additionally, we want distributions over entity coordinates, rather than point estimates

Notation

One time-slice of the model Author Coordinates . . . . . Author-Word Co-occurrence Counts . . . . . Word Coordinates

The dynamic model

The Observation Model Closer the pair in latent space → Higher the probability of co-occurrence But what if we want a distribution over the coordinates, and not just point estimates ? Kalman Filters !

Kalman Filters Requirements: Operations: X1 XT … X0 O1 OT 1. X and O are real-valued, and 2. p(Xt| Xt-1) and p(Ot|Xt) are Gaussian Xt : hidden Ot : observed Operations: Start with an initial belief over X, i.e. P(X0) For t = 1…T, with P(Xt|O1|t-1) as the current belief: Condition on Ot to obtain P(Xt|O1|t) Predict the joint belief P(Xt,Xt+1|O1|t) with the transition model Roll-up (i.e. integrate) Xt to get the updated belief P(Xt+1|O1|t)

Kalman Filter Inference N(obs , obs) N(t|t-1 , t|t-1)

Kalman Filter Inference N(0 , transition) N(t|t-1 , t|t-1)

Approximating the Observation Model Lets take a closer look at the observation model It can be moment-matched to a Gaussian… …except for the normalizer (a nasty log-sum) Our approach: approximating the normalization constant

Linearizing the normalizer We want to approximate the observation model with a Gaussian distribution. Step 1: First order Taylor approximation However, this is still hard

Linearizing the normalizer We want to approximate the observation model with a Gaussian distribution. Step 2: 2nd order Taylor approximation of We obtain a closed-form Gaussian with parameters related to the Jacobian and Hessian of this Taylor approximation. This Gaussian N(approx,approx) is our approximated observation model. We choose the linearization points to be the posterior means of the coordinates, given the data observed so far. This Gaussian preserves x-y correlations!

Approximating the Observation Model Two pairs of contour plots of an author’s true posterior conditional (left panel) in a 3-author 5-word embedding and the corresponding approximate Gaussian posterior conditional obtained (right panel). A is a difficult-to-approximate bimodal case, B is an easier unimodal case

Our Algorithms D-CODE : Expected model probability Can be obtained in closed form using our approximation D-CODE MLE: Evaluate model probability using the posterior means Static versions of the above, which learn an embedding on CT-1 to predict for year T.

Algorithms for Comparison We compare with a dynamic version of PCA over overlapping windows of data The consistency between configurations between two consecutive time-steps is maintained by a Procrustes transform For ranking, we evaluate our model probability at the PCA coordinates We also compare to Locally Linear Embedding (LLE) on the author prediction task. Like the static D-CODE variants above, we embed data for year T − 1 and predict for year T. Define author-author distances based on the words they use, as in Mei and Shelton (2006). This allows us to compare with LLE

Experiments We present the experiments in two sections Qualitative results Visualization Quantitative results Ranking on the Naïve Bayes author prediction task Naïve Bayes author prediction: We use the distributions / point-estimates over entity locations at each timestep to perform Naive Bayes ranking of authors given a subset of words from a paper in the next timestep.

Synthetic Data Consider a dataset of 6 authors and 3 words There is one group of words A1…3, and two groups of authors X1…3 and Y1…3. Initially the words Ai are mostly used by authors Xi. Over time, the words gradually shift towards authors Yi. There is a random noise component in the data.

The dynamic embedding successfully reflects trends in the underlying data

Ranking with and w/o distributions Ranks using D-CODE shift much more smoothly with time

NIPS co-authorship data

NIPS authors rankings:(Jordan, variational) Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to 1999. Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

NIPS authors rankings: (Smola, kernel) Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to 1999. Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

NIPS authors rankings: (Waibel, speech) Average author rank given a word, predicted using D-CODE (above) and Dynamic PCA (middle), and the empirical probabilities p(a | w) on NIPS data (below). t = 13 corresponds to 1999. Note that D-CODE’s predicted rank is close to 1 when p(a | w) is high, and larger otherwise. In contrast, Dynamic PCA’s predicted rank shows no noticeable correlation.

Rank Prediction Median predicted rank of true authors of papers in t = 13 based on embeddings until t = 12. Values statistically indistinguishable from the best in each row are in bold. D-CODE is the best model in most cases, showing the usefulness of having distributions rather than just point estimates. D-CODE and D-CODE MLE also beat their static counterparts, showing the advantage of dynamic modeling.

Conclusion Novel dynamic embedding algorithm based on co-occurrence counts data, using Kalman Filters Visualization Prediction Detecting trends Distributions in embeddings make a difference!! Can also do smoothing with closed-form updates

Acknowledgements We gratefully acknowledge Carlos Guestrin for his guidance and helpful comments.