Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.
Dynamic Bayesian Networks (DBNs)
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Parameter Expanded Variational Bayesian Methods Yuan (Alan) Qi and Tommi S. Jaakkola, MIT NIPS 2006 Presented by: John Paisley Duke University, ECE 3/13/2009.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Anomaly Detection in the WIPER System using A Markov Modulated Poisson Distribution Ping Yan Tim Schoenharl Alec Pawling Greg Madey.
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.
Beam Sampling for the Infinite Hidden Markov Model Van Gael, et al. ICML 2008 Presented by Daniel Johnson.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
… Hidden Markov Models Markov assumption: Transition model:
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Hidden Topic Markov Models Amit Gruber, Michal Rosen-Zvi and Yair Weiss in AISTATS 2007 Discussion led by Chunping Wang ECE, Duke University March 2, 2009.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Randomized Algorithms for Bayesian Hierarchical Clustering
Variational Inference for the Indian Buffet Process
Bayesian Multivariate Logistic Regression by Sean O’Brien and David Dunson (Biometrics, 2004 ) Presented by Lihan He ECE, Duke University May 16, 2008.
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
CS Statistical Machine learning Lecture 24
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Learning to Detect Events with Markov-Modulated Poisson Processes Ihler, Hutchins and Smyth (2007)
The Infinite Hierarchical Factor Regression Model Piyush Rai and Hal Daume III NIPS 2008 Presented by Bo Chen March 26, 2009.
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data
Stick-breaking Construction for the Indian Buffet Process Duke University Machine Learning Group Presented by Kai Ni July 27, 2007 Yee Whye The, Dilan.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
Variational Infinite Hidden Conditional Random Fields with Coupled Dirichlet Process Mixtures K. Bousmalis, S. Zafeiriou, L.-P. Morency, M. Pantic, Z.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Tea – Time - Talks Every Friday 3.30 pm ICS 432. We Need Speakers (you)! Please volunteer. Philosophy: a TTT (tea-time-talk) should approximately take.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Hidden Markov Models BMI/CS 576
Nonparametric Bayesian Learning of Switching Dynamical Processes
Accelerated Sampling for the Indian Buffet Process
Particle Filtering for Geometric Active Contours
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Nonparametric Bayesian Texture Learning and Synthesis
Presentation transcript:

Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan He ECE, Duke University Nov 14, 2008

Introduction Infinite HMM Beam sampler Experimental results Conclusion Outline 2/14

Introduction:HMM HMM: hidden Markov model 3/14 s0s0 s1s1 s2s2 sTsT y1y1 y2y2 yTyT … π0π0 π  Model parameters  Hidden state sequence s={s 1, s 2, …, s T },  Observation sequence y={y 1, y 2, …, y T }  π 0i = p(s 1 =i)  π ij = p(s t =j|s t-1 =i)  Complete likelihood Number of states

4/14 Introduction:HMM Inference Inference of HMM: forward-backward algorithm  Maximum likelihood: overfitting problem  Bayesian learning: VB or MCMC If we don’t know K a priori  Model selection: inference for all K; computationally expensive.  Nonparametric Bayesian model: iHMM (HMM with an infinite number of states) With iHMM framework  The forward-backward algorithm cannot be applied since the number of states K is infinite.  Gibbs sampling can be used, but convergence is very slow due to the strong dependencies between consecutive time steps.

Beam sampling = slice sampling + dynamic programming 5/14 Introduction:Beam Sampling  Slice sampling: limit the number of states considered at each time step to a finite number  Dynamic programming: sample whole state trajectory efficiently Advantages:  Converges in much fewer iterations than Gibbs sampling  Actual complexity per iteration is only marginally more than the Gibbs sampling  Mixes well regardless of strong correlations in the data  More robust with respect to varying initialization and prior distribution

Implemented via HDP 6/14 Infinite HMM In the stick-breaking representation Infinite hidden Markov model Transition probability Emission distribution parameter

7/14 Beam Sampler Intuitive thought: only consider the states with large transition probabilities so that the number of possible states in each time step is finite.  Approximation  How to define “large transition probability”?  Might change distributions of other variables Idea: introduce auxiliary variable u such that conditioned on u the number of trajectories with positive probability is finite.  The auxiliary variables do not change the marginal distribution over other variables so MCMC sampling will converge to true posterior

8/14 Beam Sampler Sampling u: for each t we introduce an auxiliary variable u t with conditional distribution (conditional on π, s t-1 and s t ) Sampling s: we sample the whole trajectory s given u and other variables using a form of forward filtering-backward sampling.  Forward filtering: compute  Backward sampling: sample s t sequentially for t = T, T-1, …, 2, 1 sequentially for t = 1, 2, …, T Only trajectories s with for all t will have non-zero probability given u

9/14 Beam Sampler Computing p(s t |- ) only needs to sum up a finite part of p(s t-1 |-) We only need to compute p(s t |y 1:t, u 1:t ) for the finitely many s t values belonging to some trajectory with positive probability. Forward filtering Backward sampling  Sample s T from  Sample s t given the sample for s t+1 : Sampling φ, π, β: directly from the theory of HDPs

10/14 Experiments Toy example 1: examining convergence speed & sensitivity of prior setting Transition: …, p=0.01 self-transition Observation: discrete HMM: Strong / vague / fixed prior settings for α and γ # states summed up

11/14 Experiments Toy example 2: examining performance for positive correlation data Self transition =

12/14 Experiments Real example 1: changepoint detection (Well data) State partition from one beam sampling iteration Probability that two datapoints are in one segment Gibbs sampling: slow convergence harder decision Beam sampling: fast convergence softer decision

13/14 Experiments Real example 2: text prediction (Alice’s Adventures in Wonderland) iHMM by Gibbs sampling & beam sampling: have similar results; converge to around K=16 states. VB HMM: model selection: around K=16 worse than iHMM

14/14 Conclusion  The beam sampler is introduced for the iHMM inference  Beam sampler combines slice sampling and dynamic programming  Slice sampling limits the number of states considered at each time step to a finite number  Dynamic programming samples whole hidden state trajectories efficiently  Advantages of beam sampler: converges faster than Gibbs sampler mixes well regardless of strong correlations in the data more robust with respect to varying initialization and prior distribution