Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Slides:

Advertisements

Similar presentations

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Conditional Random Fields For Speech and Language Processing

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

SPEECH RECOGNITION BASED ON BAYESIAN NETWORKS WITH ENERGY AS AN AUXILIARY VARIABLE Jaume Escofet Carmona IDIAP, Martigny, Switzerland UPC, Barcelona, Spain.

Hidden Markov Models Theory By Johan Walters (SR 2003)

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

Non-native Speech Languages have different pronunciation spaces

Why is ASR Hard? Natural speech is continuous

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Introduction to Automatic Speech Recognition

1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.

OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Graphical models for part of speech tagging

Speech and Language Processing

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

1 CRFs for ASR: Extending to Word Recognition Jeremy Morris 05/16/2008.

1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.

Algoritmi e Programmazione Avanzata

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.

PoS tagging and Chunking with HMM and CRF

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

1 Conditional Random Fields For Speech and Language Processing Jeremy Morris 10/27/2008.

Automatic Speech Recognition

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Conditional Random Fields An Overview

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Learning Long-Term Temporal Features

Presentation transcript:

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008

Overview Traditional Automatic Speech Recognition (ASR) systems built using components – Feature extraction – Acoustic models – Lexicon – Language model

Overview – ASR System

Overview AFRL SCREAM Lab and OSU SLaTe Lab have both investigated using phonetic features for ASR – Based on linguistic phonetic attributes of speech – Language independent model of speech – phonetic attributes cross languages – Acoustic models built using these features have the potential to be language independent – Both SCREAM and SLaTe Labs have investigated these features as replacements for acoustic features

Objective Investigate new methods for integrating phonetic features into ASR systems – Phonetic features can provide a level of language independence that acoustic features do not – Phonetic features can provide a method for leveraging linguistic theory and observations in our ASR systems

Approach Conditional Random Fields – We make use of a discriminative statistical model known as a Conditional Random Field (CRF) – Traditional ASR systems use a generative statistical model known as a Hidden Markov Model (HMM) Any HMM can be transformed into a CRF, though the reverse is not true As a model, a CRF has fewer independence assumptions than a corresponding HMM, potentially giving us a better model for our data

Approach Two methods of integration – Use the CRF model to generate inputs for a traditional ASR HMM-based system – Use the CRF model as a replacement for an HMM in a new type of ASR system We have performed some investigation into both of these approaches

Approach 1 - CRANDEM Use the CRF to generate inputs to an HMM- based system – This approach parallels Tandem ASR systems Use neural networks to generate inputs to a standard HMM-based ASR system CRF-Tandem -> CRANDEM – CRFs are used to generate local posterior probabilities for each frame of input speech data These posteriors are then used as inputs to an HMM

Progress 1 - CRANDEM Pilot system 1 – Train on phonetically transcribed speech data (TIMIT corpus) – Test to see if phone recognition accuracy shows improvement over baseline – Two experiments – phone classes and phonological feature classes

Progress 1 – Pilot System Experiment (61 phone classes)Phn. Acc HMM Reference baseline (PLP)68.1% Tandem baseline70.8% CRF baseline70.7% CRANDEM71.8% (Fosler-Lussier & Morris, ICASSP 2008)

Progress 1 – Pilot System Experiment (44 phonetic feature classes)Phn. Acc HMM Reference baseline (PLP)68.1% Tandem baseline71.2% CRF baseline*71.6% CRANDEM72.4% (Fosler-Lussier & Morris, ICASSP 2008) * Best score for CRF is currently 74.5% (Heinz, Fosler-Lussier & Brew, submitted)

Progress 1 - CRANDEM CRANDEM system – Use results from pilot system to train system on word transcribed speech data (WSJ corpus) – Test to see if word recognition accuracy shows improvement over baseline

Progress 1 – CRANDEM System Experiment (54 phone classes)WER HMM Reference baseline (MFCC)9.15% Tandem baseline (MLP+MFCC)7.79% CRANDEM (CRF+MFCC)11.01%

Progress 1 Pilot CRANDEM systems showed: – Improvement in phone accuracy – Degradation in word accuracy More tuning may be required – Initial CRANDEM results had 21.32% WER – Tuning has dropped this to 11% so far Error may be due to different assumptions between CRF and HMMs – CRFs very “sure” of themselves, even when they are wrong – CRF outputs do not fit a Gaussian distribution, which the HMMs assume

Approach 2 – CRF ASR Use the CRF as a replacement for the HMM – CRFs generate local acoustic probabilities as a lattice of possibilities – This lattice is composed with a probabilistic word network (language model) to recognize word strings

Progress 2 Pilot system – Train a CRF over phonetically transcribed speech corpus (TIMIT) – Use this to derive training data from word transcribed, restricted vocabulary speech corpus (TI-DIGITS) – Perform training and testing over restricted vocabulary

Progress 2 – TI-DIGITS recognition Experiment (PLPs only)WER HMM baseline (32 mix)0.18% HMM (1 mix)1.55% CRF (PLPs only)1.19%

Progress 2 – TI-DIGITS recognition Experiment (PLPs + phone classes)WER HMM baseline (32 mix)0.25% CRF (PLPs only)1.01%

Progress 2 – TI-DIGITS recognition Results show CRF performing within reach of state- of-the-art for this task – Some more tuning may be necessary – certain parameters of the CRF have not been exploited to full potential yet – HMMs have had over thirty years of exploration in how to tune parameters for speech recognition – the space for CRFs is just beginning to be explored – Our initial results were much worse – over 10% WER – but experimentation has brought them down to almost 1%

Progress Summary Phone recognition – Over the last year, we have improved CRF phone recognition results on TIMIT to near best-reported HMM result (75% accuracy) Large Vocabulary Recognition – Code framework now in place for Large Vocabulary recognition experiments – Challenges have been identified, and ideas are being examined for tackling these challenges

Future Work CRF model has not yet been fully explored for word recognition with these experiments – More tuning possibly required for both CRANDEM experiments and direct recognition experiments – Direct word recognition experiments need to be performed on larger vocabulary corpus (WSJ) – More analysis on CRANDEM results Results suggest the need for pronunciation modeling for the direct CRF model – These experiments have been begun, but no results to report as yet

Thank You