Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

Slides:

Advertisements

Similar presentations

Dynamic Spatial Mixture Modelling and its Application in Cell Tracking - Work in Progress - Chunlin Ji & Mike West Department of Statistical Sciences,

Advertisements

Hierarchical Dirichlet Process (HDP)

Information retrieval – LSI, pLSI and LDA

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

Supervised Learning Recap

Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.

Particle Swarm Optimization PSO was first introduced by Jammes Kennedy and Russell C. Eberhart in Fundamental hypothesis: social sharing of information.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University.

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models C. J. Leggetter and P. C. Woodland Department of.

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.

Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,

Randomized Algorithms for Bayesian Hierarchical Clustering

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Computer Graphics and Image Processing (CIS-601).

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Variational Inference for the Indian Buffet Process

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Amir Harati and Joseph Picone

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

NIPS 2013 Michael C. Hughes and Erik B. Sudderth

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.

The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.

State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

A NONPARAMETRIC BAYESIAN APPROACH FOR

College of Engineering Temple University

Variational Bayes Model Selection for Mixture Distribution

College of Engineering

Stochastic Optimization Maximization for Latent Variable Models

Presentation transcript:

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster Models together and adjust all models in the same cluster together. Clusters should be organized in a hierarchical form, so depending on available data proper clusters are selected. The classical solution is to use binary regression tree. The tree is constructed using a centroid splitting algorithm. In transform based adaption, for each cluster, a transformation is calculated. In MLLR, transforms are computed using maximum likelihood criterion. Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster Models together and adjust all models in the same cluster together. Clusters should be organized in a hierarchical form, so depending on available data proper clusters are selected. The classical solution is to use binary regression tree. The tree is constructed using a centroid splitting algorithm. In transform based adaption, for each cluster, a transformation is calculated. In MLLR, transforms are computed using maximum likelihood criterion. Introduction Performance of speaker independent acoustic models in speech recognition is significantly lower than speaker dependent models. Training speaker dependent models is impractical due to the amount of data. Solution is the speaker adaptation. One of the most popular forms is to transform mean and covariance of all Gaussian components. Because of the huge number of components; we often need to tie (cluster) components together. The complexity of the model should be adapted to available data. Introduction Performance of speaker independent acoustic models in speech recognition is significantly lower than speaker dependent models. Training speaker dependent models is impractical due to the amount of data. Solution is the speaker adaptation. One of the most popular forms is to transform mean and covariance of all Gaussian components. Because of the huge number of components; we often need to tie (cluster) components together. The complexity of the model should be adapted to available data. Amir H Harati Nejad Torbati, Joseph Picone and Marc Sobel Department of Electrical and Computer Engineering, Temple University, Philadelphia, Pennsylvania APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION College of Engineering Temple University Dirichlet Process Mixture (DPM) One of the classical problems in clustering is determination of the number of clusters and complexity of the model. A solution based on nonparametric Bayesian framework is the Dirichlet process mixture models in which we use a Dirichlet process to put a prior on the number of clusters. DP induce a prior that number of clusters can grow logarithmically with The number of observations. Dirichlet Process Mixture (DPM) One of the classical problems in clustering is determination of the number of clusters and complexity of the model. A solution based on nonparametric Bayesian framework is the Dirichlet process mixture models in which we use a Dirichlet process to put a prior on the number of clusters. DP induce a prior that number of clusters can grow logarithmically with The number of observations. Algorithm The basic idea is to replace the binary regression tree in MLLR with a DPM. The procedure is as follow: 1.Train speaker independent (SI) models, collecting all mixture components and their frequency of occurrence. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom up approach. In this approach we start from terminal nodes and merge them based on their Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.The resulted tree is used to compute transformation matrix using maximum likelihood approach. Inference is accomplished using three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP) Algorithm The basic idea is to replace the binary regression tree in MLLR with a DPM. The procedure is as follow: 1.Train speaker independent (SI) models, collecting all mixture components and their frequency of occurrence. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom up approach. In this approach we start from terminal nodes and merge them based on their Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.The resulted tree is used to compute transformation matrix using maximum likelihood approach. Inference is accomplished using three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP) Results for Monophone Models Monophone models using a single Gaussian mixture. 12 different speakers with 600 training utterances. The Result of clustering resembles broad phonetic classes. DPM finds 6 clusters in the data while regression tree can find 2 clusters. Word error rate (WER) can be reduced by more than 10%. Results for Monophone Models Monophone models using a single Gaussian mixture. 12 different speakers with 600 training utterances. The Result of clustering resembles broad phonetic classes. DPM finds 6 clusters in the data while regression tree can find 2 clusters. Word error rate (WER) can be reduced by more than 10%. Results for Cross word models Cross word triphone models using a single Gaussian mixture. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for medium amount of data and CDP and CSB work better for more large amount of data Results for Cross word models Cross word triphone models using a single Gaussian mixture. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for medium amount of data and CDP and CSB work better for more large amount of data Conclusion It has been shown, having enough data, DPM can do a better job than regression tree and the resulted clusters have meaningful acoustical interpretation. The reason for slightly worse performance in some cases could be related to our tree construction approach. In this work we assigned each “distribution” to just one cluster. An obvious extension is to use some form of soft tying. Resource Management (RM) dataset and models with one Gaussian per mixtures have been used in this research. In the future we can use more challenging datasets and with more mixtures per state. We can also use nonparametric Bayesian HMMs (HDP-HMM) in our training to further examine the applications of nonparametric methods in speech recognition. Conclusion It has been shown, having enough data, DPM can do a better job than regression tree and the resulted clusters have meaningful acoustical interpretation. The reason for slightly worse performance in some cases could be related to our tree construction approach. In this work we assigned each “distribution” to just one cluster. An obvious extension is to use some form of soft tying. Resource Management (RM) dataset and models with one Gaussian per mixtures have been used in this research. In the future we can use more challenging datasets and with more mixtures per state. We can also use nonparametric Bayesian HMMs (HDP-HMM) in our training to further examine the applications of nonparametric methods in speech recognition. Reference [1] E. Sudderth, “Graphical models for visual object recognition and tracking,” Ph.D. dissertation, Massachusetts Institute of Technology, May [2] J. Paisley, “Machine learning with Dirichlet and beta process priors: Theory and Applications”, Ph.D. Dissertation, Duke University, May [3] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, [4] C. J. Leggetter, “Improved acoustic modeling for HMMs using linear transformations,” Ph.D. Dissertation, University of Cambridge, February [5]C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, New York, USA, [6] K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). [7] K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Reference [1] E. Sudderth, “Graphical models for visual object recognition and tracking,” Ph.D. dissertation, Massachusetts Institute of Technology, May [2] J. Paisley, “Machine learning with Dirichlet and beta process priors: Theory and Applications”, Ph.D. Dissertation, Duke University, May [3] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, [4] C. J. Leggetter, “Improved acoustic modeling for HMMs using linear transformations,” Ph.D. Dissertation, University of Cambridge, February [5]C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, New York, USA, [6] K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). [7] K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Figure 1- Model Complexity as a Function of Available Data. (a) 20 (b) 200 (c) 2000 data points Figure 2- Mapping Speaker Independent Models to Speaker Dependent Models Figure 4-Chinese Restaurant Process Figure 3-Dirichlet Process Mixture Figure 5-A comparison of regression tree and ADVP approaches for monophone models. Figure 6-The number of discovered clusters Figure 7- comparison of WERs between regression tree-based MLLR and several DPM inference algorithms for cross-word acoustic models.