Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Topic models Source: Topic models, David Blei, MLSS 09.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Hierarchical Dirichlet Processes
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
Statistical Topic Modeling part 1
Suggested readings Historical notes Markov chains MCMC details
Generative Topic Models for Community Analysis
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Topic Modeling with Network Regularization Md Mustafizur Rahman.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
British Museum Library, London Picture Courtesy: flickr.
Integrating Topics and Syntax Paper by Thomas Griffiths, Mark Steyvers, David Blei, Josh Tenenbaum Presentation by Eric Wang 9/12/2008.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Integrating Topics and Syntax -Thomas L
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
CS Statistical Machine learning Lecture 24
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Latent Dirichlet Allocation
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Online Multiscale Dynamic Topic Models
CSC 594 Topics in AI – Natural Language Processing
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Latent Dirichlet Analysis
Haim Kaplan and Uri Zwick
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Presentation transcript:

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass

2 Outline Introduction LDA HMM-LDA Experiments Conclusions

3 Introduction An effective LM needs to not only account for the casual speaking style of lectures but also accommodate the topic-specific vocabulary of the subject matter Available training corpora rarely match the target lecture in both style and topic In this paper, the syntactic state and semantic topic assignment are investigated using HMM with LDA model

4 LDA A generative probabilistic model of a corpus The topic mixture is drawn from a conjugate Dirichlet prior –PLSA –LDA –Model parameters

5 Markov chain Monte Carlo A class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution The most common application of these algorithms is numerically calculating multi-dimensional integrals –an ensemble of "walkers" moves around randomly A Markov chain is constructed in such a way as to have the integrand as its equilibrium distribution

6 LDA Estimate posteriori Integrating out: Gibbs sampling

7 Markov chain Monte Carlo (cont.) Gibbs Sampling

8 HMM+LDA HMMs generate documents purely based on syntactic relations among unobserved word classes –Short-range dependencies Topic model generate documents based on semantic correlations between words, independent of word order –long-range dependencies A major advantage of generative models is modularity –Different models are easily combined –Words are exhibited by Mixture of model & product of model –Only a subset of words, content words, exhibit long-range dependencies Replace one probability distribution over words used in syntactic model with the semantic model

9 HMM+LDA (cont.) Notation: –A sequence of words –A sequence of topic assignments –A sequence of classes – means semantic class –zth topic associated with distribution over words –Each class is associated with distribution over words –Each document has a distribution over topic –Transition between class and follows a distribution

10 HMM+LDA (cont.) A document is generated: –Sample from a prior –For each word in document Draw from If, then draw from,else draw from

11 HMM+LDA (cont.) Inference – are drawn from –The row of the transition matrix are drawn from – are drawn from –Assume all Dirichlet distribution are symmetric

12 HMM+LDA (cont.) Gibbs Sampling

13 HMM-LDA Analysis Lectures Corpus –3 undergraduate subject in math, physics, computer science –10 CS lectures for development set, 10 CS lectures for test set Textbook Corpus –CS course textbook –divided in to 271 topic-cohesive documents at every section heading Run Gibbs sampler against the two dataset –L: 2,800 iterations, T: 2,000 iterations –Use lowest perplexity model as the final model

14 HMM-LDA Analysis (cont.) Semantic topics (Lectures) Machine learningLinear Algebra Magnetism : cursory examination of the data suggests that speakers talking about children tend to laugh more during the lecture Although it may not be desirable to capture speaker idiosyncrasies in the topic mixtures, HMM-LDA has clearly demonstrated its ability to capture distinctive semantic topics in a corpus Childhood Memories

15 HMM-LDA Analysis (cont.) Semantic topics (Textbook) A topically coherent paragraph 6 of the 7 instances of the words “and” and “or” (underline) are correctly classified Multi-word topic key phrases can be identified for n-gram topic models the context-dependent labeling abilities of the HMM-LDA models is demonstrated

16 HMM-LDA Analysis (cont.) Syntactic States (Lectures) –State 20 is topic state Verbs Prepositions Hesitation disfluencies Conjunctions As demonstrated with spontaneous speech, HMM-LDA yields syntactic states that have a good correspondence to part-of speech labels, without requiring any labeled training data

17 Discussions Although MCMC techniques converge to the global stationary distribution, we cannot guarantee convergence from observation of the perplexity alone Unlike EM algorithms, random sampling may actually temporarily decrease the model likelihood The number of iteration was chosen to be at least double the point at which the PP first appeared to converge

18 Language Modeling Experiments Baseline model: Lecture + Textbook Interpolated trigram model (using modified Kneser-Ney discounting) Topic-deemphasized style (trigram) model (Lectures): –To deemphasize the observed occurrences of topic words and ideally redistribute these counts to all potential topic words –The counts of topic to style word transitions are not altered

19 Language Modeling Experiments (cont.) Textbook model should ideally have higher weight in the contexts containing topic words Domain trigram model (Textbook): –Emphasize the sequences containing a topic word in the context by doubling their counts

20 Language Modeling Experiments (cont.) unsmoothed topical tirgram model: –Apply HMM-LDA with 100 topics to identify representative words and their associated contexts for each topics Topic mixtures for all models –Mixture weights were tuned on individual target lectures (cheat) –15 of 100 topics account for over 90% of the total weight

21 Language Modeling Experiments (cont.) Since the topic distribution shifts over a long lecture, modeling a lecture with fixed weights may not be the most optimal Update the mixture distribution by linearly interpolating it with the posterior topic distribution given the current word

22 Language Modeling Experiments (cont.) The variation of topic mixtures Review previous lecture -> Show an example of computation using accumulators -> Focus the lecture on stream as a data structure, with an intervening example that finds pairs of i and j that sum up to a prime

23 Language Modeling Experiments (cont.) Experimental results

24 Conclusions HMM-LDA shows great promise for finding structure in unlabeled data, from which we can build more sophisticated models Speaker-specific adaptation will be investigated in the future