Speech and Language Processing

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition. What makes speech recognition hard?
1 LSA 352 Summer 2007 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 5: Intro to ASR+HMMs: Forward, Viterbi, Baum-Welch IP Notice:
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
COMP 4060 Natural Language Processing Speech Processing.
Why is ASR Hard? Natural speech is continuous
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 5: Acoustic Modeling with Gaussians.
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Isolated-Word Speech Recognition Using Hidden Markov Models
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
CS 124/LINGUIST 180: From Languages to Information
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II)
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Auto Speech Recognition by İlkay ATIL Outline-1 Introduction Today and Future of ASR Automatic Speech Recognition Types of ASR systems Fundamentals.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Speech Recognition and Synthesis
CS 224S / LINGUIST 285 Spoken Language Processing
Statistical Models for Automatic Speech Recognition
Speech Processing Speech Recognition
Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky
Automatic Speech Recognition
Speech Processing Speech Recognition
Hidden Markov Models Part 2: Algorithms
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Speech Processing Speech Recognition
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LECTURE 15: REESTIMATION, EM AND MIXTURES
Automatic Speech Recognition
Automatic Speech Recognition
Presentation transcript:

Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition

Outline for ASR ASR Architecture Five easy pieces of an ASR system The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction Acoustic Model Language Model Lexicon/Pronunciation Model (Introduction to HMMs again) Decoder Evaluation 4/22/2017 Speech and Language Processing Jurafsky and Martin

Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 4/22/2017 Speech and Language Processing Jurafsky and Martin

LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Current error rates Task Vocabulary Error Rate% Digits 11 0.5 Ballpark numbers; exact numbers depend very much on the specific corpus Task Vocabulary Error Rate% Digits 11 0.5 WSJ read speech 5K 3 20K Broadcast news 64,000+ 10 Conversational Telephone 20 4/22/2017 Speech and Language Processing Jurafsky and Martin

HSR versus ASR Conclusions: Machines about 5 times worse than humans Task Vocab ASR Hum SR Continuous digits 11 .5 .009 WSJ 1995 clean 5K 3 0.9 WSJ 1995 w/noise 9 1.1 SWBD 2004 65K 20 4 Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt 4/22/2017 Speech and Language Processing Jurafsky and Martin

Why is conversational speech harder? A piece of an utterance without context The same utterance with more context 4/22/2017 Speech and Language Processing Jurafsky and Martin

LVCSR Design Intuition Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 4/22/2017 Speech and Language Processing Jurafsky and Martin

Speech Recognition Architecture 4/22/2017 Speech and Language Processing Jurafsky and Martin

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform. 4/22/2017 Speech and Language Processing Jurafsky and Martin

The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn 4/22/2017 Speech and Language Processing Jurafsky and Martin

Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S = W: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: 4/22/2017 Speech and Language Processing Jurafsky and Martin

Noisy channel model likelihood prior 4/22/2017 Speech and Language Processing Jurafsky and Martin

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Speech Architecture meets Noisy Channel 4/22/2017 Speech and Language Processing Jurafsky and Martin

Architecture: Five easy pieces (only 3-4 for today) HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this already) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Lexicon A list of words Each one with a pronunciation in terms of phones We get these from on-line pronucniation dictionary CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/cmudict We’ll represent the lexicon as an HMM 4/22/2017 Speech and Language Processing Jurafsky and Martin

HMMs for speech: the word “six” 4/22/2017 Speech and Language Processing Jurafsky and Martin

Phones are not homogeneous! 4/22/2017 Speech and Language Processing Jurafsky and Martin

Each phone has 3 subphones 4/22/2017 Speech and Language Processing Jurafsky and Martin

Resulting HMM word model for “six” with their subphones 4/22/2017 Speech and Language Processing Jurafsky and Martin

HMM for the digit recognition task 4/22/2017 Speech and Language Processing Jurafsky and Martin

Detecting Phones Two stages Feature extraction Basically a slice of a spectrogram Building a phone classifier (using GMM classifier) 4/22/2017 Speech and Language Processing Jurafsky and Martin

MFCC: Mel-Frequency Cepstral Coefficients 4/22/2017 Speech and Language Processing Jurafsky and Martin

MFCC process: windowing 4/22/2017 Speech and Language Processing Jurafsky and Martin

MFCC process: windowing 4/22/2017 Speech and Language Processing Jurafsky and Martin

Hamming window on the signal, and then computing the spectrum 4/22/2017 Speech and Language Processing Jurafsky and Martin

Final Feature Vector 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 4/22/2017 Speech and Language Processing Jurafsky and Martin

Acoustic Modeling (= Phone detection) Given a 39-dimensional vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods Neural nets, CRFs, SVM, etc 4/22/2017 Speech and Language Processing Jurafsky and Martin

Gaussian Mixture Models Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 4/22/2017 Speech and Language Processing Jurafsky and Martin

Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 4/22/2017 Speech and Language Processing Jurafsky and Martin

Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation We could just compute the mean and variance from the data: 4/22/2017 Speech and Language Processing Jurafsky and Martin

But we need 39 gaussians, not 1! The observation o is really a vector of length 39 So need a vector of Gaussians: 4/22/2017 Speech and Language Processing Jurafsky and Martin

Actually, mixture of gaussians Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 4/22/2017 Speech and Language Processing Jurafsky and Martin

Gaussians acoustic modeling Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 4/22/2017 Speech and Language Processing Jurafsky and Martin

Where we are Given: A wave file Goal: output a string of words What we know: the acoustic model How to turn the wavefile into a sequence of acoustic feature vectors, one every 10 ms If we had a complete phonetic labeling of the training set, we know how to train a gaussian “phone detector” for each phone. We also know how to represent each word as a sequence of phones What we knew from Chapter 4: the language model Next time: Seeing all this back in the context of HMMs Search: how to combine the language model and the acoustic model to produce a sequence of words 4/22/2017 Speech and Language Processing Jurafsky and Martin

Decoding In principle: In practice:

Why is ASR decoding hard?

HMMs for speech

HMM for digit recognition task

The Evaluation (forward) problem for speech The observation sequence O is a series of MFCC vectors The hidden states W are the phones and words For a given phone/word string W, our job is to evaluate P(O|W) Intuition: how likely is the input to have been generated by just that word string W

Evaluation for speech: Summing over all different paths! f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay v f f ay ay ay ay ay ay v f f ay ay ay ay ay ay ay ay v f f ay v v v v v v v

The forward lattice for “five”

The forward trellis for “five”

Viterbi trellis for “five”

Viterbi trellis for “five”

Search space with bigrams

Viterbi trellis

Viterbi backtrace

Training

Evaluation How to evaluate the word string output by a speech recognizer?

Word Error Rate Word Error Rate = 100 (Insertions+Substitutions + Deletions) ------------------------------ Total Word in Correct Transcript Aligment example: REF: portable **** PHONE UPSTAIRS last night so HYP: portable FORM OF STORES last night so Eval I S S WER = 100 (1+2+0)/6 = 50%

NIST sctk-1.3 scoring softare: Computing WER with sclite http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the recognizer) with a correct or reference text (REF) (human transcribed) id: (2347-b-013) Scores: (#C #S #D #I) 9 3 1 2 REF: was an engineer SO I i was always with **** **** MEN UM and they HYP: was an engineer ** AND i was always with THEM THEY ALL THAT and they Eval: D S I I S S

Sclite output for error analysis CONFUSION PAIRS Total (972) With >= 1 occurances (972) 1: 6 -> (%hesitation) ==> on 2: 6 -> the ==> that 3: 5 -> but ==> that 4: 4 -> a ==> the 5: 4 -> four ==> for 6: 4 -> in ==> and 7: 4 -> there ==> that 8: 3 -> (%hesitation) ==> and 9: 3 -> (%hesitation) ==> the 10: 3 -> (a-) ==> i 11: 3 -> and ==> i 12: 3 -> and ==> in 13: 3 -> are ==> there 14: 3 -> as ==> is 15: 3 -> have ==> that 16: 3 -> is ==> this

Sclite output for error analysis 17: 3 -> it ==> that 18: 3 -> mouse ==> most 19: 3 -> was ==> is 20: 3 -> was ==> this 21: 3 -> you ==> we 22: 2 -> (%hesitation) ==> it 23: 2 -> (%hesitation) ==> that 24: 2 -> (%hesitation) ==> to 25: 2 -> (%hesitation) ==> yeah 26: 2 -> a ==> all 27: 2 -> a ==> know 28: 2 -> a ==> you 29: 2 -> along ==> well 30: 2 -> and ==> it 31: 2 -> and ==> we 32: 2 -> and ==> you 33: 2 -> are ==> i 34: 2 -> are ==> were

Better metrics than WER? WER has been useful But should we be more concerned with meaning (“semantic error rate”)? Good idea, but hard to agree on Has been applied in dialogue systems, where desired semantic output is more clear

Summary: ASR Architecture Five easy pieces: ASR Noisy Channel architecture Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: what phones can follow each other Language Model N-grams for computing p(wi|wi-1) Decoder Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech!

Accents: An experiment A word by itself The word in context 4/22/2017 Speech and Language Processing Jurafsky and Martin

Summary ASR Architecture Phonetics Background The Noisy Channel Model Phonetics Background Five easy pieces of an ASR system Lexicon/Pronunciation Model (Introduction to HMMs again) Feature Extraction Acoustic Model Language Model Decoder Evaluation 4/22/2017 Speech and Language Processing Jurafsky and Martin