1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Slides:

Advertisements

Similar presentations

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

Advertisements

Biointelligence Laboratory, Seoul National University

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

Supervised Learning Recap

Statistical Topic Modeling part 1

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.

Lecture 5: Learning models using EM

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Expectation Maximization Algorithm

Visual Recognition Tutorial

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Computer vision: models, learning and inference

Scalable Text Mining with Sparse Generative Models

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

EM and expected complete log-likelihood Mixture of Experts

7-Speech Recognition Speech Recognition Concepts

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Yuya Akita , Tatsuya Kawahara

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

CS Statistical Machine learning Lecture 24

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Ch3: Model Building through Regression

Extended Baum-Welch algorithm

Statistical Models for Automatic Speech Recognition

J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009

LTI Student Research Symposium 2004 Antoine Raux

LECTURE 07: BAYESIAN ESTIMATION

Topic Models in Text Processing

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Introduction to HMM (cont)

Qiang Huo(*) and Chorkin Chan(**)

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu

Speech Lab. NTNU 2 Reference Chia-Sheng Wu, “ Bayesian Latent Semantic Analysis for Text Categorization and Information Retrieval ”, 2005 Q. Huo and C.-H. Lee, “ On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate ”, 1997

Speech Lab. NTNU 3 Outline IntroductionPLSA ML (Maximum Likelihood) MAP (Maximum A Posterior) QB (Quasi-Bayes) ExperimentsConclusions

Speech Lab. NTNU 4 Introduction LSA vs. PLSA Linear algebra and probability Semantic space and latent topics Batch learning vs. Incremental learning

Speech Lab. NTNU 5 PLSA PLSA is a general machine learning technique, which adopts the aspect model to represent the co-occurrence data. Topics (hidden variables) Corpus (document-word pairs)

Speech Lab. NTNU 6 PLSA Assume that d i and w j are independent conditionally on the mixture of associated topic z k Joint probability:

Speech Lab. NTNU 7 ML PLSA Log likelihood of Y: ML estimation:

Speech Lab. NTNU 8 ML PLSA Maximization:

Speech Lab. NTNU 9 ML PLSA Complete data: Incomplete data: EM (Expectation-Maximization) Algorithm E-stepM-step

Speech Lab. NTNU 10 ML PLSA E-Step

Speech Lab. NTNU 11 ML PLSA Auxiliary function: And

Speech Lab. NTNU 12 ML PLSA M-step: Lagrange multiplier

Speech Lab. NTNU 13 ML PLSA Differentiation New parameter estimation:

Speech Lab. NTNU 14 MAP PLSA Estimation by Maximizing the posteriori probability: Definition of prior distribution: Dirichlet density: Prior density: Kronecker delta Assume and are independent

Speech Lab. NTNU 15 MAP PLSA Consider prior density: Maximum a Posteriori:

Speech Lab. NTNU 16 MAP PLSA E-step:expectation Auxiliary function:

Speech Lab. NTNU 17 MAP PLSA M-step Lagrange multiplier

Speech Lab. NTNU 18 MAP PLSA Auxiliary function:

Speech Lab. NTNU 19 MAP PLSA Differentiation New parameter estimation:

Speech Lab. NTNU 20 QB PLSA It needs to update continuously for an online information system. Estimation by maximize the posteriori probability: Posterior density is approximated by the closest tractable prior density with hyperparameters As compared to MAP PLSA, the key difference using QB PLSA is due to the updating of hyperparameters.

Speech Lab. NTNU 21 QB PLSA Conjugate prior: In Bayesian probability theory, a conjugate prior is a prior distribution which has the property that the posterior distribution is the same type of distribution. A close-form solution A reproducible prior/posteriori pair for incremental learning

Speech Lab. NTNU 22 QB PLSA Hyperparameter α:

Speech Lab. NTNU 23 QB PLSA After careful arrangement, exponential of posteriori expectation function can be expressed: A reproducible prior/posterior pair is generated to build the updating mechanism of hyperparameters

Speech Lab. NTNU 24 Initial Hyperparameters A open issue in Bayesian learning If the initial prior knowledge is too strong or after a lot of adaptation data have been incrementally processed, the new adaptation data usually have only a small impact on parameters updating in incremental training.

Speech Lab. NTNU 25 Experiments MED Corpus: 1033 medical abstracts with 30 queries 7014 unique terms 433 abstracts for ML training 600 abstracts for MAP or QB training Query subset for testing K=8Reuters documents for training 2925 for QB learning 2790 documents for testing unique words 10 categories

Speech Lab. NTNU 26 Experiments

Speech Lab. NTNU 27 Experiments

Speech Lab. NTNU 28 Experiments

Speech Lab. NTNU 29 Conclusions This paper presented an adaptive text modeling and classification approach for PLSA based information system. Future work: Extension of PLSA for bigram or trigram will be explored. Application for spoken document classification and retrieval

30 Discriminative Maximum Entropy Language Model for Speech Recognition Chuang-Hua Chueh, To-Chang Chien and Jen- Tzung Chien Presenter: Hsuan-Sheng Chiu

Speech Lab. NTNU 31 Reference R. Rosenfeld, S. F. Chen and X. Zhu, “ Whole-sentence exponential language models : a vehicle for linguistic statistical integration ”, 2001 W.H. Tsai, “ An Initial Study on Language Model Estimation and Adaptation Techniques for Mandarin Large Vocabulary Continuous Speech Recognition ”, 2005

Speech Lab. NTNU 32 Outline Introduction Whole-sentence exponential model Discriminative ME language model ExperimentConclusions

Speech Lab. NTNU 33 Introduction Language model Statistical n-gram model Latent semantic language model Structured language model Based on maximum entropy principle, we can integrate different features to establish optimal probability distribution.

Speech Lab. NTNU 34 Whole-Sentence Exponential Model Traditional method: Exponential form: Usage: When used for speech recognition, the model is not suitable for the first pass of the recognizer, and should be used to re-score N-best lists.

Speech Lab. NTNU 35 Whole-Sentence ME Language Model Expectation of feature function: Empirical:Actual:Constraint:

Speech Lab. NTNU 36 Whole-Sentence ME Language Model To Solve the constrained optimization problem:

Speech Lab. NTNU 37 GIS algorithm

Speech Lab. NTNU 38 Discriminative ME Language Model In general, ME can be considered as a maximum likelihood model using log-linear distribution. Propose a Discriminative language model based on whole- sentence ME model (DME)

Speech Lab. NTNU 39 Discriminative ME Language Model Acoustic features for ME estimation: Sentence-level log-likelihood ratio of competing and target sentences Feature weight parameter: Namely, we activate feature parameter to be one for those speech signals observed in training database

Speech Lab. NTNU 40 Discriminative ME Language Model New estimation: Upgrade to discriminative linguistic parameters

Speech Lab. NTNU 41 Discriminative ME Language Model

Speech Lab. NTNU 42 Experiment Corpus: TCC mixtures 12 Mel-frequency cepstral coefficients 1 log-energy and first derivation 4200 sentences for training, 450 for testing Corpus: Academia Sinica CKIP balanced corpus Five million words Vocabulary words

Speech Lab. NTNU 43 Experiment

Speech Lab. NTNU 44 Conclusions A new ME language model integrating linguistic and acoustic features for speech recognition The derived ME language model was inherent with discriminative power. DME model involved a constrained optimization procedure and was powerful for knowledge integration.

Speech Lab. NTNU 45 Relation between DME and MMI MMI criterion: Modified MMI criterion: Express ME model as ML model:

Speech Lab. NTNU 46 Relation between DME and MMI The optimal parameter:

Speech Lab. NTNU 47 Relation between DME and MMI

Speech Lab. NTNU 48 Relation between DME and MMI