Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

Slides:



Advertisements
Similar presentations
Dynamic Spatial Mixture Modelling and its Application in Cell Tracking - Work in Progress - Chunlin Ji & Mike West Department of Statistical Sciences,
Advertisements

Hierarchical Dirichlet Process (HDP)
Information retrieval – LSI, pLSI and LDA
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.
Supervised Learning Recap
Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.
Particle Swarm Optimization PSO was first introduced by Jammes Kennedy and Russell C. Eberhart in Fundamental hypothesis: social sharing of information.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
A Unifying Review of Linear Gaussian Models
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models C. J. Leggetter and P. C. Woodland Department of.
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Randomized Algorithms for Bayesian Hierarchical Clustering
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Computer Graphics and Image Processing (CIS-601).
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Variational Inference for the Indian Buffet Process
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Amir Harati and Joseph Picone
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
NIPS 2013 Michael C. Hughes and Erik B. Sudderth
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Deeply learned face representations are sparse, selective, and robust
College of Engineering Temple University
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
College of Engineering
Generally Discriminant Analysis
Presentation transcript:

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and adjust all models in a cluster together. Clusters are organized hierarchically The classical solution is to use a binary regression tree. The tree is constructed using a centroid splitting algorithm. In transform based adaption a transformation is calculated for each cluster. In MLLR, transforms are computed using maximum likelihood criterion. Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and adjust all models in a cluster together. Clusters are organized hierarchically The classical solution is to use a binary regression tree. The tree is constructed using a centroid splitting algorithm. In transform based adaption a transformation is calculated for each cluster. In MLLR, transforms are computed using maximum likelihood criterion. Introduction Performance of speaker independent acoustic models in speech recognition is significantly lower than speaker dependent models. Training speaker dependent models is impractical due to the limited amount of data. One of the most popular solutions is speaker adaptation. This transforms the mean and covariance of all Gaussian components. Because of the huge number of components; we often need to tie (cluster) components together. The complexity of the model should be adapted to available data. Introduction Performance of speaker independent acoustic models in speech recognition is significantly lower than speaker dependent models. Training speaker dependent models is impractical due to the limited amount of data. One of the most popular solutions is speaker adaptation. This transforms the mean and covariance of all Gaussian components. Because of the huge number of components; we often need to tie (cluster) components together. The complexity of the model should be adapted to available data. Amir H Harati Nejad Torbati, Joseph Picone and Marc Sobel Department of Electrical and Computer Engineering, Temple University, Philadelphia, Pennsylvania APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION College of Engineering Temple University Dirichlet Process Mixture (DPM) One of the classical problems in clustering is the determination of the number of clusters and complexity of the model. Dirichlet process (DP) mixture models use a non- parametric Bayesian framework to put a prior on the number of clusters. Dirichlet Process Mixture (DPM) One of the classical problems in clustering is the determination of the number of clusters and complexity of the model. Dirichlet process (DP) mixture models use a non- parametric Bayesian framework to put a prior on the number of clusters. Algorithm Premise: Replace the binary regression tree in MLLR with a DPM. Procedure: 1.Train speaker independent (SI) model. Collect all mixture components and their frequencies of occurrence. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom up approach. We start from terminal nodes and merge them based on Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.With the resulting tree, compute the transformation matrix using maximum likelihood approach. Inference is accomplished using three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP) Algorithm Premise: Replace the binary regression tree in MLLR with a DPM. Procedure: 1.Train speaker independent (SI) model. Collect all mixture components and their frequencies of occurrence. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom up approach. We start from terminal nodes and merge them based on Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.With the resulting tree, compute the transformation matrix using maximum likelihood approach. Inference is accomplished using three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP) Results for Monophone Models Experiments using resource management (RM) dataset. Monophone models using a single Gaussian mixture. 12 different speakers with 600 training utterances. The result of clustering resembles broad phonetic classes. DPM finds 6 clusters in the data while the regression tree finds only 2 clusters. Word error rate (WER) can be reduced by more than 10%. Results for Monophone Models Experiments using resource management (RM) dataset. Monophone models using a single Gaussian mixture. 12 different speakers with 600 training utterances. The result of clustering resembles broad phonetic classes. DPM finds 6 clusters in the data while the regression tree finds only 2 clusters. Word error rate (WER) can be reduced by more than 10%. Results for Cross word models Cross word triphone models use a single Gaussian mixture. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for moderate amounts of data while CDP and CSB work better for larger amounts of data Results for Cross word models Cross word triphone models use a single Gaussian mixture. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for moderate amounts of data while CDP and CSB work better for larger amounts of data Conclusion It has been shown, that with enough data, DPMs surpass regression tree results with clusters that have meaningful acoustical interpretation. Lower performance in some cases may be related to our tree construction approach. In this work we assigned each “distribution” to just one cluster. An obvious extension is to use some form of soft tying. RM dataset and models with one Gaussian per mixtures have been used in this research. In the future we can use more challenging datasets and with more mixtures per state. We can also use nonparametric Bayesian HMMs (HDP-HMM) in our training to further examine the applications of nonparametric methods in speech recognition. Conclusion It has been shown, that with enough data, DPMs surpass regression tree results with clusters that have meaningful acoustical interpretation. Lower performance in some cases may be related to our tree construction approach. In this work we assigned each “distribution” to just one cluster. An obvious extension is to use some form of soft tying. RM dataset and models with one Gaussian per mixtures have been used in this research. In the future we can use more challenging datasets and with more mixtures per state. We can also use nonparametric Bayesian HMMs (HDP-HMM) in our training to further examine the applications of nonparametric methods in speech recognition. Reference [1] E. Sudderth, “Graphical models for visual object recognition and tracking,” Ph.D. dissertation, Massachusetts Institute of Technology, May [2] J. Paisley, “Machine learning with Dirichlet and beta process priors: Theory and Applications”, Ph.D. Dissertation, Duke University, May [3] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, [4] C. J. Leggetter, “Improved acoustic modeling for HMMs using linear transformations,” Ph.D. Dissertation, University of Cambridge, February [5]C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, New York, USA, [6] K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). [7] K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Reference [1] E. Sudderth, “Graphical models for visual object recognition and tracking,” Ph.D. dissertation, Massachusetts Institute of Technology, May [2] J. Paisley, “Machine learning with Dirichlet and beta process priors: Theory and Applications”, Ph.D. Dissertation, Duke University, May [3] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, [4] C. J. Leggetter, “Improved acoustic modeling for HMMs using linear transformations,” Ph.D. Dissertation, University of Cambridge, February [5]C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, New York, USA, [6] K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). [7] K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Figure 1- Model Complexity as a Function of Available Data. (a) 20 (b) 200 (c) 2000 data points Figure 2- Mapping Speaker Independent Models to Speaker Dependent Models Figure 4-Chinese Restaurant Process Figure 3-Dirichlet Process Mixture Figure 5-A comparison of regression tree and ADVP approaches for monophone models. Figure 6-The number of discovered clusters Figure 7- comparison of WERs between regression tree-based MLLR and several DPM inference algorithms for cross-word acoustic models. DP induce a prior such that clusters can grow logarithmically with the number of observations. When combining with likelihoods we obtain a complete model.