1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Conditional Random Fields For Speech and Language Processing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
ECE 8527 Homework Final: Common Evaluations By Andrew Powell.
Modular Neural Networks CPSC 533 Franco Lee Ian Ko.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Lecture 5: Learning models using EM
Conditional Random Fields
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Pattern Recognition Applications Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.
OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.
Isolated-Word Speech Recognition Using Hidden Markov Models
Graphical models for part of speech tagging
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
1 CRFs for ASR: Extending to Word Recognition Jeremy Morris 05/16/2008.
1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.
John Lafferty Andrew McCallum Fernando Pereira
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
1 Conditional Random Fields For Speech and Language Processing Jeremy Morris 10/27/2008.
Deep Feedforward Networks
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Restricted Boltzmann Machines for Classification
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
CSC 594 Topics in AI – Natural Language Processing
CRANDEM: Conditional Random Fields for ASR
Conditional Random Fields An Overview
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
CONTEXT DEPENDENT CLASSIFICATION
Algorithms of POS Tagging
Speech recognition, machine learning
Speaker Identification:
Learning Long-Term Temporal Features
Speech recognition, machine learning
Presentation transcript:

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007

2 Outline Background Previous Work Feature Combination Experiments Viterbi Realignment Experiments Conclusions and Future Work

3 Background Goal: Integrate outputs of speech attribute detectors together for recognition  e.g. Phone classifiers, phonological feature classifiers Attribute detector outputs highly correlated  Stop detector vs. phone classifier for /t/ or /d/ Accounting for correlations in HMM  Ignore them (decreased performance)  Full covariance matrices (increased parameters)  Explicit decorrelation (e.g. Karhunen-Loeve transform)

4 Background Speech Attributes  Phonological feature attributes Detector outputs describe phonetic features of a speech signal  Place, Manner, Voicing, Vowel Height, Backness, etc. A phone is described with a vector of feature values  Phone class attributes Detector outputs describe the phone label associated with a portion of the speech signal  /t/, /d/, /aa/, etc.

5 Background Conditional Random Fields (CRFs)  Discriminative probabalistic model  Used successfully in various domains such as part of speech tagging and named entity recogntion  Directly defines a posterior probability of a sequence Y given an input sequence X e.g. P(Y|X)  Does not make independence assumptions about correlations among input features

6 Background CRFs for ASR  Phone Classification (Gunawardana et al., 2005) Uses sufficient statistics to define feature functions  Different approach than NLP tasks using CRFs Define binary feature functions to characterize observations  Our approach follows the latter method Use neural networks to provide “soft binary” feature functions (e.g. posterior phone outputs)

7 Conditional Random Fields Based on the framework of Markov Random Fields /k/ /iy/

8 Conditional Random Fields Based on the framework of Markov Random Fields  A CRF iff the graph of the label sequence is an MRF when conditioned on a set of input observations (Lafferty et al., 2001) /k/ /iy/ XXXXX

9 Conditional Random Fields Based on the framework of Markov Random Fields  A CRF iff the graph of the label sequence is an MRF when conditioned on the input observations /k/ /iy/ XXXXX State functions help determine the identity of the state

10 Conditional Random Fields Based on the framework of Markov Random Fields  A CRF iff the graph of the label sequence is an MRF when conditioned on the input observations /k/ /iy/ XXXXX State functions help determine the identity of the state Transition functions add associations between transitions from one label to another

11 Conditional Random Fields CRF defined by a weighted sum of state and transition functions  Both types of functions can be defined to incorporate observed inputs  Weights are trained by maximizing the likelihood function via gradient descent methods

12 Previous Work Implemented CRF models on data from phonetic attribute detectors  Performed phone recognition  Compared results to Tandem/HMM system on same data Experimental Data  TIMIT corpus of read speech

13 Attribute Selection Attribute Detectors  ICSI QuickNet Neural Networks Two different types of attributes  Phonological feature detectors Place, Manner, Voicing, Vowel Height, Backness, etc. N-ary features in eight different classes Posterior outputs -- P(Place=dental | X)  Phone detectors Neural networks output based on the phone labels  Trained using PLP 12+deltas

14 Experimental Setup CRF code  Built on the Java CRF toolkit from Sourceforge   Performs maximum log-likelihood training  Uses Limited Memory BGFS algorithm to perform minimization of the log-likelihood gradient

15 Experimental Setup Feature functions built using the neural net output  Each attribute/label combination gives one feature function  Phone class: s /t/,/t/ or s /t/,/s/  Feature class: s /t/,stop or s /t/,dental

16 Experimental Setup Baseline system for comparison  Tandem/HMM baseline (Hermansky et al., 2000)  Use outputs from neural networks as inputs to gaussian-based HMM system  Built using HTK HMM toolkit Linear inputs  Better performance for Tandem with linear outputs from neural network  Decorrelated using a Karhunen-Loeve (KL) transform

17 * Significantly (p<0.05) better than comparable Tandem monophone system * Significantly (p<0.05) better than comparable CRF monophone system Initial Results (Morris & Fosler-Lussier, 06) ModelParamsPhone Accuracy Tandem [1] (phones)20, % Tandem [3] (phones) 4mix420, %* CRF [1] (phones) %* Tandem [1] (feas)14, % Tandem [3] (feas) 4mix360, %* CRF [1] (feas) %* Tandem [1] (phones/feas)34, % Tandem [3] (phones/feas) 4mix774, % CRF (phones/feas) %*

18 Feature Combinations CRF model supposedly robust to highly correlated features  Makes no assumptions about feature independence Tested this claim with combinations of correlated features  Phone class outputs + Phono. Feature outputs  Posterior outputs + transformed linear outputs Also tested whether linear, decorrelated outputs improve CRF performance

19 Feature Combinations - Results ModelPhone Accuracy CRF (phone posteriors)67.32% CRF (phone linear KL)66.80% CRF (phone post+linear KL)68.13%* CRF (phono. feature post.)65.45% CRF (phono. feature linear KL)66.37% CRF (phono. feature post+linear KL)67.36%* * Significantly (p<0.05) better than comparable posterior or linear KL systems

20 Viterbi Realignment Hypothesis: CRF results obtained by using only pre-defined boundaries  HMM allows “boundaries” to shift during training  Basic CRF training process does not Modify training to allow for better boundaries  Train CRF with fixed boundaries  Force align training labels using CRF  Adapt CRF weights using new boundaries

21 Viterbi Realignment - Results ModelAccuracy CRF (phone posteriors)67.32% CRF (phone posteriors – realigned)69.92%*** Tandem[3] 4mix (phones)68.07% Tandem[3] 16mix (phones)69.34% CRF (phono. fea. linear KL)66.37% CRF (phono. fea. lin-KL – realigned)68.99%** Tandem[3] 4mix (phono fea.)68.30% Tandem[3] 16mix (phono fea.)69.13% CRF (phones+feas)68.43% CRF (phones+feas – realigned)70.63%*** Tandem[3] 16mix (phones+feas)69.40% * Significantly (p<0.05) better than comparable CRF monophone system * Significantly (p<0.05) better than comparable Tandem 4mix triphone system * Signficantly (p<0.05) better than comparable Tandem 16mix triphone system

22 Conclusions Using correlated features in the CRF model did not degrade performance  Extra features improved performance for the CRF model across the board Viterbi realignment training significantly improved CRF results  Improvement did not occur when best HMM- aligned transcript was used for training

23 Future Work Recently implemented stochastic gradient training for CRFs  Faster training, improved results Work currently being done to extend the model to word recognition Also examining the use of transition functions that use the observation data