APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Xiaolong Wang and Daniel Khashabi
Hierarchical Dirichlet Process (HDP)
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Hierarchical Dirichlet Processes
Bayesian dynamic modeling of latent trait distributions Duke University Machine Learning Group Presented by Kai Ni Jan. 25, 2007 Paper by David B. Dunson,
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 5. Operations on Multiple R. V.'s 1 Chapter 5. Operations on Multiple Random Variables 0. Introduction 1. Expected Value of a Function of Random.
Scalable Text Mining with Sparse Generative Models
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models C. J. Leggetter and P. C. Woodland Department of.
For Better Accuracy Eick: Ensemble Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Randomized Algorithms for Bayesian Hierarchical Clustering
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
The Infinite Hierarchical Factor Regression Model Piyush Rai and Hal Daume III NIPS 2008 Presented by Bo Chen March 26, 2009.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Lecture 2: Statistical learning primer for biologists
Amir Harati and Joseph Picone
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Ch3: Model Building through Regression
Non-Parametric Models
Clustering (3) Center-based algorithms Fuzzy k-means
Topic Models in Text Processing
Presentation transcript:

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing, Temple UniversityDepartment of Statistics, Temple University Application to Speaker Adaption Adapting speaker independent models requires balancing complexity (e.g., parameter counts) with the amount of adaptation data. Classical solutions to speaker adaptation include the use of Maximum Likelihood Linear Regression (MLLR) and regression trees. Our goal is to replace the regression tree with a DPM and to achieve better performance at a comparable or reduced level of complexity. Application to Speaker Adaption Adapting speaker independent models requires balancing complexity (e.g., parameter counts) with the amount of adaptation data. Classical solutions to speaker adaptation include the use of Maximum Likelihood Linear Regression (MLLR) and regression trees. Our goal is to replace the regression tree with a DPM and to achieve better performance at a comparable or reduced level of complexity. Training Algorithm 1.Train speaker independent (SI) model. Collect mixture components and their frequencies. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom-up approach. Merge leaf nodes based on a Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.Compute the transformation matrix using a maximum likelihood approach. Inference Algorithms Three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP). Training Algorithm 1.Train speaker independent (SI) model. Collect mixture components and their frequencies. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom-up approach. Merge leaf nodes based on a Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.Compute the transformation matrix using a maximum likelihood approach. Inference Algorithms Three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP). Observations 10% reduction in WER over MLLR. DPMs produce meaningful interpretations of the acoustic distances between data (e.g., broad phonetic classes). Decreases in performance may be related to the tree construction approach. Extensions To This Work Generalize the assignment of a distribution to one cluster using soft-tying. Expand statistical models to multiple mixtures. The nonparametric Bayesian framework provides two important features:  number of clusters is not known a priori and could possibly grow with obtaining new data  generalization of parameter sharing and model (and state) tying. Observations 10% reduction in WER over MLLR. DPMs produce meaningful interpretations of the acoustic distances between data (e.g., broad phonetic classes). Decreases in performance may be related to the tree construction approach. Extensions To This Work Generalize the assignment of a distribution to one cluster using soft-tying. Expand statistical models to multiple mixtures. The nonparametric Bayesian framework provides two important features:  number of clusters is not known a priori and could possibly grow with obtaining new data  generalization of parameter sharing and model (and state) tying. Future Work Use DPM to discover new acoustic units. Use HDP-HMM to relax the standard HMM left-to-right topology. Integrate HDP-HMM into the training loop so that training and testing operate under matched conditions. Key References E. Sudderth, “Graphical models for visual object recognition and tracking,” Ph.D. dissertation, MIT, May J. Paisley, “Machine learning with Dirichlet and beta process priors: Theory and Applications,” Ph.D. Dissertation, Duke University, May P. Liang, et al., “Probabilistic grammars and hierarchical Dirichlet processes”, The Handbook of Applied Bayesian Analysis, Oxford University Press, Future Work Use DPM to discover new acoustic units. Use HDP-HMM to relax the standard HMM left-to-right topology. Integrate HDP-HMM into the training loop so that training and testing operate under matched conditions. Key References E. Sudderth, “Graphical models for visual object recognition and tracking,” Ph.D. dissertation, MIT, May J. Paisley, “Machine learning with Dirichlet and beta process priors: Theory and Applications,” Ph.D. Dissertation, Duke University, May P. Liang, et al., “Probabilistic grammars and hierarchical Dirichlet processes”, The Handbook of Applied Bayesian Analysis, Oxford University Press, Dirichlet Processes Random probability measure over such that for any measurable partition over : is the base distribution and acts like mean of DP. is the concentration parameter and is proportional to the inverse of the variance. Example: Chinese restaurant process Dirichlet Process Mixture (DPM) models place a prior on the number of clusters. Hierarchical Dirichlet Processes (HDP): extend this by allowing sharing of parameters (atoms) across clusters. Dirichlet Processes Random probability measure over such that for any measurable partition over : is the base distribution and acts like mean of DP. is the concentration parameter and is proportional to the inverse of the variance. Example: Chinese restaurant process Dirichlet Process Mixture (DPM) models place a prior on the number of clusters. Hierarchical Dirichlet Processes (HDP): extend this by allowing sharing of parameters (atoms) across clusters. Introduction Major challenge in machine learning is generalization – can systems successfully recognize previously unseen data? Controlling complexity is key to good generalization. Data-driven modeling must be capable of preserving important modalities in the data. The complexity of the model should be adapted to available data. Introduction Major challenge in machine learning is generalization – can systems successfully recognize previously unseen data? Controlling complexity is key to good generalization. Data-driven modeling must be capable of preserving important modalities in the data. The complexity of the model should be adapted to available data. Pilot Experiment: Monophones Experiments use Resource Management (RM). 12 different speakers with 600 training utterances. Models use a single Gaussian mixture. Word error rate (WER) can be reduced by more than 10%. DPM preserves 6 clusters in the data while the regression tree finds only 2 clusters. Pilot Experiment: Monophones Experiments use Resource Management (RM). 12 different speakers with 600 training utterances. Models use a single Gaussian mixture. Word error rate (WER) can be reduced by more than 10%. DPM preserves 6 clusters in the data while the regression tree finds only 2 clusters. Pilot Experiment: Triphones ADVP better for moderate amounts of data while CDP and CSB better for larger amounts of data. Pilot Experiment: Triphones ADVP better for moderate amounts of data while CDP and CSB better for larger amounts of data. Figure 4. The Number of Discovered Clusters Figure 5. Regression Tree vs. Several DPM Inference Algorithms Figure 3. Regression Tree vs. ADVP Figure 1. Model Complexity as a Function of Available Data. (a) 20 (b) 200 (c) 2000 data points Figure 2. Mapping Speaker Independent Models to Speaker Dependent Models