Conditional Random Fields for ASR

Slides:

Advertisements

Similar presentations

Conditional Random Fields For Speech and Language Processing

Advertisements

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.

OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Graphical models for part of speech tagging

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

1 CRFs for ASR: Extending to Word Recognition Jeremy Morris 05/16/2008.

1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Automatic Speech Attribute Transcription (ASAT) Project Period: 10/01/04 – 9/30/08 The ASAT Team –Mark Clements –Sorin Dusan.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

1 Conditional Random Fields For Speech and Language Processing Jeremy Morris 10/27/2008.

Olivier Siohan David Rybach

Deep Feedforward Networks

Online Multiscale Dynamic Topic Models

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Statistical Models for Automatic Speech Recognition

CSC 594 Topics in AI – Natural Language Processing

Intelligent Information System Lab

Data Mining Lecture 11.

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Information Based Criteria for Design of Experiments

CRANDEM: Conditional Random Fields for ASR

Conditional Random Fields An Overview

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

LECTURE 23: INFORMATION THEORY REVIEW

LECTURE 15: REESTIMATION, EM AND MIXTURES

Speech recognition, machine learning

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Speech recognition, machine learning

Presentation transcript:

Conditional Random Fields for ASR Jeremy Morris May 5, 2006

Overview Problem Statement (Motivation) Conditional Random Fields Experiments Attribute Selection Experimental Setup Results Future Work

Problem Statement Developed as part of the ASAT Project (Automatic Speech Attribute Transcription) Goal: Develop a system for bottom-up speech recognition using 'speech attributes'

Speech Attributes? Any information that could be useful for recognizing the spoken language Phonetic attributes Speaker attributes (gender, age, etc.) Any other useful attributes that could be used for speech recognition Note that there is no guarantee that attributes will be independent of each other One part of this project is to explore ways to create a framework for easily combining new features for experimental purposes /d/ manner: stop place of artic: dental voicing: voiced /iy/ height: high backness: front roundness: nonround /t/ manner: stop place of artic: dental voicing: unvoiced

Evidence Combination Two basic ways to build hypotheses Top Down data hyp data Top Down Generate a hypothesis See if the data fits the hypothesis Bottom Up Examine the data Search for a hypothesis that fits

Top Down Traditional Automated Speech Recogintion Systems (ASR) use a top-down approach Hypothesis is the phone we are predicting Data is some encoding of the acoustic speech signal A likelihood of the signal given the phone label is learned from data A prior probability for the phone label is learned from the data These are combined through Bayes Rule to give us the posterior probability /iy/ X P(/iy/) P(X|/iy/)

Bottom Up Bottom-up models have the same high-level goal – determine the label from the observation But instead of a likelihood, the posterior probability is learned from the data Neural Networks have been used to learn these probabilities /iy/ X P(/iy/|X)

Speech is a Sequence Speech is not a single, independent event /k/ /iy/ Speech is not a single, independent event It is a combination of multiple events over time A model to recognize spoken language should take into account dependencies across time

Speech is a Sequence /k/ /iy/ X A top down (generative) model can be extended into a time sequence as a Hidden Markov Model (HMM) Now our likelihood of the data is over the entire sequence instead of a single phone

Speech is a Sequence /k/ /iy/ Y Tandem is a method for using evidence bottom up (discriminative) Hypothesis output of Neural Network is used to train an HMM Not a pure discriminative method, but a combination of generative and discriminative methods X X X

Bottom up Modelling The idea is to have a system that combines evidence layer by layer Speech attributes contribute to phone attribute detection Phone attributes contribute to “syllable” attribute detection, and so on Each layer combines information from previous layers to form its hypotheses We want to do this probabalistically – no hard decisions

Conditional Random Fields A form of discriminative modelling Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks Processes evidence bottom-up Combines multiple features of the data Builds the probability P( sequence | data)

Conditional Random Fields Conceptual Overview Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label A positive value if the attribute appears in the data A zero value if the attribute is not in the data Each feature function carries a weight that gives the strength of that feature function for the proposed label High positive weights indicate a good association between the feature and the proposed label High negative weights indicate a negative association between the feature and the proposed label Weights close to zero indicate the feature has little or no impact on the identity of the label

Conditional Random Fields /k/ /k/ /iy/ /iy/ /iy/ X X X X X CRFs have transition feature functions and state feature functions Transition functions add associations between transitions from one label to another State functions help determine the identity of the state

Conditional Random Fields State Feature Weight λ=10 One possible weight value for this state feature (Strong) Transition Feature Weight μ=4 One possible weight value for this transition feature State Feature Function f([x is stop], /t/) One possible state feature function For our attributes and labels Transition Feature Function g(x, /iy/,/k/) One possible transition feature function Indicates /k/ followed by /iy/

Experiments Goal: Implement a Conditional Random Field Model on ASAT-style data Perform phone recognition Compare results to those obtained via a Tandem system Experimental Data TIMIT read speech corpus Moderate-sized corpus of clean, prompted speech, complete with phonetic-level transcriptions

Attribute Selection Attribute Detectors ICSI QuickNet Neural Networks Two different types of attributes Phonological feature detectors Place, Manner, Voicing, Vowel Height, Backness, etc. Features are grouped into eight classes, with each class having a variable number of possible values based on the IPA phonetic chart Phone detectors Neural networks output based on the phone labels – one output per label Classifiers were applied to 2960 utterances from the TIMIT training set

Experimental Setup Code built on the Java CRF toolkit on Sourceforge http://crf.sourceforge.net Performs training to maximize the log-likelihood of the training set with respect to the model Uses a Limited Memory BGFS algorithm to minimize the gradient of the log-likelihood For CRF models, maximizing the log-likelihood of the empirical distribution of the data as predicted by the model is the same as maximizing the entropy (Berger et. al.)

Experimental Setup Output from the Neural Nets are themselves treated as feature functions for the observed sequence – each attribute/label combination gives us a value for one feature function Note that this makes the feature functions non-binary features.

Results Model Phone Accuracy Phone Correctness Tandem (phones) 67.32% 73.81% CRF (phones) 66.89% 68.49% Tandem (features) 66.85% 72.42% CRF (features) 63.84% 65.45% CRF (phones/feas) 67.87% 69.47%

Future Work More features Tuning Word recogntion Other corpora What kinds of features can we add to improve our transitions? Tuning HMM model has parameters that can be tuned for better performance – can we tweak the CRF to do something similar? Word recogntion How does this model do at the full word recognition level, instead of just phones Other corpora Can we extend this method beyond TIMIT to different types of corpora? (e.g. WSJ)