Presentation on theme: "Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani."— Presentation transcript:
Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani
Introduction n HM models: time series with discrete hidden states n Infinite HM models (iHMM): nonparametric Bayesian approach n Equivalence between Polya urn and HDP interpretations for iHMM n Inference algorithms: collapsed Gibbs sampler, beam sampler n Use of iHMM: simple sequence labeling task
From HMMs to Bayesian HMMs n An example of HMM: speech recognition u Hidden state sequence: phones u Observation: acoustic signals u Parameters , come from a physical model of speech / can be learned from recordings of speech n Computational questions u 1.( , , K) is given: apply Bayes rule to find posterior of hidden variables u Computation can be done by a dynamic programming called forward-backward algorithm u 2. K given, , not given: apply EM u 3.( , , K) is not given: penalizing, etc..
From HMMs to Bayesian HMMs n Fully Bayesian approach u Adding priors for , and extending full joint pdf as u Compute the marginal likelihood or evidence for comparing, choosing or averaging over different values of K. u Analytic computing of the marginal likelihood is intractable
From HMMs to Bayesian HMMs n Methods for dealing the intractability u MCMC 1: by estimating the marginal likelihood explicitly. Annealed importance sampling, Bridge sampling. Computationally expensive. u MCMC 2: by switching between different K values. Reversible jump MCMC u Approximation by using good state sequence: by independency of parameters and conjugacy between prior and likelihood under given hidden states, marginal likelihood can be computed analytically. u Variational Bayesian inference: by computing lower bound of the marginal likelihood and applying VB inference.
Infinite HMM – hierarchical Polya Urn n iHMM: Instead of defining K different HMMs, implicitly define a distribution over the number of visited states. n Polya Urn: u add a ball of new color: / ( + n i ). u add a ball of color i : n i / ( + n i ). u Nonparametric clustering scheme n Hierarchical Polya Urn: u Assume separate Urn(k) for each state k u At each time step t, select a ball from the corresponding Urn(k)_(t-1) u Interpretation of transition probability by the # of balls of color j in Urn color i: u Probability of drawing from oracle:
Inference n Gibbs sampler: O(KT 2 ) n Approximate Gibbs sampler: O(KT) n State sequence variables are strongly correlated slow mixing n Beam sampler as an auxiliary variable MCMC algorithm u Resamples the whole Markov chain at once u Hence suffers less from slow mixing
n u Compute only for finitely many s t, s t-1 values. n
Inference – Beam sampler n Complexity: O(TK 2 ) when K states are presented n Remarks: auxiliary variables need not be sampled from uniform. Beta distribution could also be used to bias auxiliary variables close to the boundaries of
Example: unsupervised part-of–speech (PoS) tagging n PoS-tagging: annotating the words in a sentence with their appropriate part- of-speech tag u “ The man sat” ‘The’ : determiner, ‘man’: noun, ‘sat’: verb u HM model is commonly used F Observation: words F Hidden: unknown PoP-tag F Usually learned using a corpus of annotated sentences: building corpus is expensive u In iHMM F Multinomial likelihood is assumed F with base distribution H as symmetric Dirichlet so its conjugate to multinomial likelihood u Trained on section 0 of WSJ of Penn Treebank: 1917 sentences with total of 50282 word tokens (observations) and 7904 word types (dictionary size) u Initialize the sampler with 50 states with 50000 iterations
Example: unsupervised part-of–speech (PoS) tagging n Top 5 words for the five most common states u Top line: state ID and frequency u Rows: top 5 words with frequency in the sample u state 9: class of prepositions u State 12: determinants + possessive pronouns u State 8: punctuation + some coordinating conjunction u State 18: nouns u State 17: personal pronouns
Beyond the iHMM: input-output(IO) iHMM n MC affected by external factors u A robot is driving around in a room while taking pictures (room index picture) u If robot follows a particular policy, robots action can be integrated as an input to iHMM (IO-iHMM) u Three dimensional transition matrix:
Beyond the iHMM: sticky and block-diagonal iHMM n Weight on the diagonal of the transition matrix controls the frequency of state transitions n Probability of staying in state i for g times: n Sticky iHMM: by adding a prior probability mass to the diagonal of the transition matrix and applying a dynamic programming based inference n Appropriate for segmentation problems where the number of segments is not known a priori n To carry more weight for diagonal entry: u is a parameter for controlling the switching rate n Block-diagonal iHMM:for grouping of states u Sticky iHMM is a case for size 1 block u Larger blocks allow unsupervised clustering of states u Used for unsupervised learning of view-based object models from video data where each block corresponds to an object. u Intuition behind: Temporary contiguous video frames are more likely correspond to different views of the same objects than different objects n Hidden semi-Markov model u Assuming an explicit duration model for the time spent in a particular state
Beyond the iHMM: iHMM with Pitman-Yor base distribution n Frequency vs. rank of colors (on log-log scale) u DP is quite specific about distribution implied in the Polya Urn: colors that appear once or twice is very small u Pitman-Yor can be more specific about the tails u Pitman-Yor fits a power-law distribution (linear fitting in the plot) u Replace DP by Pitman-Yor in most cases u Helpful comments on beam sampler
Beyond the iHMM: autoregressive iHMM, SLD-iHMM n AR-iHMM: Observations follow auto-regressive dynamics n SLD-iHMM: part of the continuous variables are observed and the unobserved variables follow linear dynamics SLD model FA-HMM model