Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Applications of one-class classification
Xiaolong Wang and Daniel Khashabi
Course: Neural Networks, Instructor: Professor L.Behera.
Hierarchical Dirichlet Process (HDP)
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Scalable Text Mining with Sparse Generative Models
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Chapter Two Probability Distributions: Discrete Variables
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Randomized Algorithms for Bayesian Hierarchical Clustering
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Stick-Breaking Constructions
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Amir Harati and Joseph Picone
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
A NONPARAMETRIC BAYESIAN APPROACH FOR
Bayesian Generalized Product Partition Model
Non-Parametric Models
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Hidden Markov Models Part 2: Algorithms
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Revision (Part II) Ke Chen
LECTURE 15: REESTIMATION, EM AND MIXTURES
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Presentation transcript:

Applications of Dirichlet Process Models to Speech Processing and Machine Learning Amir Harati and Joseph Picone, PhD Institute for Signal and Information Processing Temple University URL:

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong. Nonparametric models can lead to model selection/averaging solutions without paying the cost of these methods. In addition Bayesian methods often provide a mathematically well defined framework, with better extendibility. From []

Acoustic Models P(A/W) Motivation Speech recognizer architecture. Performance of the system depends on the quality of acoustic models. HMMs and mixture models are frequently used for acoustic modeling. Number of models and parameter sharing is among the most important model selection problems in speech recognizer. Can nonparametric Bayesian modeling help us? Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

Nonparametric Bayesian Bayes Rule in Machine learning : Bayesian methods are sensitive to the prior. Prior should reflect the beliefs about the model. Inflexible priors ( and models) lead to wrong conclusions. Nonparametric models are very flexible, meaning the number of parameters can grow proportionally with the amount of data. Areas: Regression, classification, clustering, time series analysis , …

Clustering A Generative approach to clustering : 1- pick one of K clusters from 2- Generate a data point from cluster specific parametric distribution (e.g. Gaussian). This yields a finite mixture model : Finite mixture model can also be expressed using an underlying measure G:

Clustering In Bayesian model based clustering we should put some prior on the parameters. θ is model specific; usually we use a conjugate prior which is case of Gaussian distribution is the normal-inverse gamma distribution. We name this prior G0. π is multinomial and therefore we use a symmetric Dirichlet distribution as its prior with concentration parameter α0 .

Dirichlet Distribution (Φ,Σ) is a measurable space where Σ is the sigma algebra. A measure μ over (Φ,Σ) is a function from Σ->R+ such that: For a probability measure μ(Φ)=1 A Dirichlet distribution is a distribution over the K-dimensional probability simplex.

Finite Mixture Model

Using model comparison methods. Going nonparametric. Finite Mixture Model How to determine K? Using model comparison methods. Going nonparametric. If we let K-> can we obtain a nonparametric model? What is the definition of G in this case? The answer is Dirichlet Process.

Dirichlet Process A Dirichlet Process (DP) is a random probability measure over (Φ,Σ) such that for any measurable partition over Φ we have DP has two parameters: Base distribution (G0) is like a mean for DP and  is the concentration parameter (inverse of the variance). We write : DP is discrete with probability one

Stick-breaking Representation Stick-breaking construction represents a DP explicitly: Consider a stick with length one. At each step, the stick is broken. The broken part is assigned as the weight of corresponding atom in DP. If  is distributed as above we write:

Polya’s Urn Scheme Consider i.i.d. draws from G : Now marginalize G and consider the conditional probabilities: Imagine picking balls of different color from an urn. Start with an empty urn. With probability proportional to  add a new ball to the urn. (draw from G0) With probability proportional to the number of balls draw a ball from the urn; return the ball into the urn and add another ball with the same color to the urn.

Customers Integers Tables  Clusters Chinese Restaurant Interpretation Consider draws 1, …, n from Polya’s urn scheme, and consider distinct values of these draws 1*,… K*. In other words, random draws from Polya’s urn scheme induces a partition over natural numbers . The induced partition over partitions is called Chinese Restaurant Process (CRP). Generating from the CRP: First Customer sits at the first table. Other customers sit at table k with probability proportional to the number of customers at table k, or start a new table with probability proportional to . Customers Integers Tables  Clusters Customers Integers Tables  Clusters

Dirichlet Process Mixture (DPM) DPs are discrete with probability one so they cannot be used as a prior on continues densities. However, we can draw parameter of a mixture model from a draw from a DP.

Applications to Speech Recognition In speech recognition technology, we deal with the complexity problem at many levels. Examples includes: The number of states and the number of mixture components in a hidden Markov model. The number of models and parameter-sharing between these models. In language modeling, we must estimate the probabilities of unseen events in very large but sparse N‑gram models. Nonparametric Bayesian modeling has been used to smooth such N-gram language models. Nonparametric Bayesian HMMs (HDP-HMM) are used in speaker diarization task and word segmentation. In this project we have investigated the replacement of binary regression tree for speaker adaption with DPM.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create clusters and adjust all models in a cluster together. Clusters are organized hierarchically. The classical solution is to use a binary regression tree. The tree is constructed using a centroid splitting algorithm. In transform-based adaption a transformation is calculated for each cluster. In Maximum Likelihood Linear Regression (MLLR), transforms are computed an ML criterion.

Algorithm Premise: Replace the binary regression tree in MLLR with a DPM. Procedure: Train speaker independent (SI) model. Collect all mixture components and their frequencies of occurrence. Generate samples for each component and cluster them using a DPM model. Construct a tree structure of the final result using a bottom-up approach. We start from terminal nodes and merge them based on Euclidean distance. Assign clusters to each component using a majority vote scheme. With the resulting tree, compute the transformation matrix using maximum likelihood approach. Inference is accomplished using three different variational algorithms: Accelerated variational Dirichlet process mixture (AVDP) . Collapsed variational stick-breaking (CSB). Collapsed Dirichlet priors (CDP)

Results for Monophones Experiments using Resource Management (RM). Monophone models using a single Gaussian mixture model. 12 different speakers with 600 training utterances. The result of clustering resembles broad phonetic classes. DPM finds 6 clusters in the data while the regression tree finds only 2 clusters. Word error rate (WER) can be reduced by more than 10%.

Results for Cross Word Triphones Cross-word triphone models use a single Gaussian mixture model. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for moderate amounts of data while CDP and CSB work better for larger amounts of data.

Results for Cross Word Triphones Cross-word triphone models use a single Gaussian mixture model. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for moderate amounts of data while CDP and CSB work better for larger amounts of data.

Future Directions Use hierarchical nonparametric models (e.g. HDP-HMMs) to model acoustical units. Using nonparametric Bayesian segmentation to find new sets of acoustical units. Nonparametric Bayesian framework provides two important features that can facilitate speaker dependent systems: 1-Number of clusters of speakers is not known a priori and could possibly grow with obtaining new data. 2-Paramter sharing and model (and state )tying can be accomplished elegantly using proper hierarchies. Depending on the available training data, the system would have different number of models for different acoustic units. All acoustic units are tied. Moreover each model has different number of sates and different number of mixtures for each state.

Brief Bibliography of Related Research

Biography Joseph Picone received his Ph.D. in Electrical Engineering in 1983 from the Illinois Institute of Technology. He is currently Professor and Chair of the Department of Electrical and Computer Engineering at Temple University. He recently completed a three-year sabbatical at the Department of Defense where he directed human language technology research and development. His primary research interests are currently machine learning approaches to acoustic modeling in speech recognition. For over 25 years he has conducted research on many aspects of digital speech and signal processing. He has also been a long-term advocate of open source technology, delivering one of the first state-of-the-art open source speech recognition systems, and maintaining one of the more comprehensive web sites related to signal processing. His research group is known for producing many innovative educational materials that have increased access to the field. Dr. Picone has previously been employed by Texas Instruments and AT&T Bell Laboratories, including a two-year assignment in Japan establishing Texas Instruments’ first international research center. He is a Senior Member of the IEEE, holds several patents in this area, and has been active in several professional societies related to human language technology.