1 Automatic Speech Recognition: An Overview Julia Hirschberg CS 4706 (special thanks to Roberto Pieraccini)

Slides:



Advertisements
Similar presentations
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Advertisements

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Hidden Markov Models in NLP
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
ASR Intro: Outline ASR Research History Difficulties and Dimensions Core Technology Components 21st century ASR Research (Next two lectures)
Automatic Speech Recognition: An Overview
VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES Roberto Pieraccini, CTO, Tell-Eureka Corporation 535 West 34 th Street New York, NY
CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Automatic Speech Recognition: An Overview
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Dynamic Time Warping Applications and Derivation
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
Natural Language Understanding
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Graphical models for part of speech tagging
Speech and Language Processing
7-Speech Recognition Speech Recognition Concepts
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Introduction to CL & NLP CMSC April 1, 2003.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
1 Components of Spoken Dialogue Systems Julia Hirschberg LSA 353.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Speech Recognition UNIT -5.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: An Overview
Automatic Speech Recognition, Text-to-Speech, and Natural Language Understanding Technologies Julia Hirschberg LSA.
Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Speech Processing Speech Recognition
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
CS4705 Natural Language Processing
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
Automatic Speech Recognition
Automatic Speech Recognition
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Presentation transcript:

1 Automatic Speech Recognition: An Overview Julia Hirschberg CS 4706 (special thanks to Roberto Pieraccini)

2 Recreating the Speech Chain DIALOG SEMANTICS SYNTAX LEXICON MORPHOLOG Y PHONETICS VOCAL-TRACT ARTICULATORS INNER EAR ACOUSTIC NERVE SPEECH RECOGNITION DIALOG MANAGEMENT SPOKEN LANGUAGE UNDERSTANDING SPEECH SYNTHESIS

3 Speech Recognition: the Early Years 1952 – Automatic Digit Recognition (AUDREY) –Davis, Biddulph, Balashek (Bell Laboratories)

4 1960’s – Speech Processing and Digital Computers  AD/DA converters and digital computers start appearing in the labs James Flanagan Bell Laboratories

5 The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value: ))

6 The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value: )) Intra-speaker variability Noise/reverberation Coarticulation Context-dependency Word confusability Word variations Speaker Dependency Multiple Interpretations Limited vocabulary Ellipses and Anaphors

– Whither Speech Recognition? General purpose speech recognition seems far away. Social- purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish… It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but no sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour… Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach). The Journal of the Acoustical Society of America, June 1969 J. R. Pierce Executive Director, Bell Laboratories

: The ARPA SUR project Despite anti-speech recognition campaign led by Pierce Commission ARPA launches 5 year Spoken Understanding Research program Goal: 1000 word vocabulary, 90% understanding rate, near real time on a 100 MIPS machine 4 Systems built by the end of the program –SDC (24%) –BBN’s HWIM (44%) –CMU’s Hearsay II (74%) –CMU’s HARPY (95% -- but 80 times real time!) HARPY based on engineering approach: search on network of all the possible utterances Raj Reddy -- CMU LESSON LEARNED: Hand-built knowledge does not scale up Need of a global “optimization” criterion

9 Lack of scientific evaluation Speech Understanding: too early for its time Project not extended

’s – Dynamic Time Warping The Brute Force of the Engineering Approach TEMPLATE (WORD 7) UNKNOWN WORD T.K. Vyntsyuk (1969) H. Sakoe, S. Chiba (1970) Isolated Words Speaker Dependent Connected Words Speaker Independent Sub-Word Units

s -- The Statistical Approach Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research Foundations of modern speech recognition engines Fred Jelinek S1S1 S2S2 S3S3 a 11 a 12 a 22 a 23 a 33 Acoustic HMMs Word Tri-grams  No Data Like More Data  Whenever I fire a linguist, our system performance improves (1988)  Some of my best friends are linguists (2004) Jim Baker

– Statistical approach becomes ubiquitous Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.

s-1990s – The Power of Evaluation Pros and Cons of DARPA programs + Continuous incremental improvement - Loss of “bio-diversity” SPOKEN DIALOG INDUSTRY SPEECHWORKS NUANCE MIT SRI TECHNOLOGY VENDORS PLATFORM INTEGRATORS APPLICATION DEVELOPERS HOSTING TOOLS STANDARDS

14 Today’s State of the Art Low noise conditions Large vocabulary –~20,000-64,000 words (or more…) Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) World’s best research systems: Human-human speech: ~13-20% Word Error Rate (WER) Human-machine or monologue speech: ~3-5% WER

15 Building an ASR System Build a statistical model of the speech-to-words process –Collect lots of speech and transcribe all the words –Train the model on the labeled speech Paradigm: –Supervised Machine Learning + Search –The Noisy Channel Model

16 The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

17 The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations –O = o 1,o 2,o 3,…,o t Define a sentence as a sequence of words: –W = w 1,w 2,w 3,…,w n

18 Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

19 Speech Recognition Meets Noisy Channel: Acoustic Likelihoods and LM Priors

20 Components of an ASR System Corpora for training and testing of components Representation for input and method of extracting Pronunciation Model Acoustic Model Language Model Feature extraction component Algorithms to search hypothesis space efficiently

21 Training and Test Corpora Collect corpora appropriate for recognition task at hand –Small speech + phonetic transcription to associate sounds with symbols (Acoustic Model) –Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) –Very large text corpus to identify unigram and bigram probabilities (Language Model)

22 Building the Acoustic Model Model likelihood of phones or subphones given spectral features, pronunciation models, and prior context Usually represented as HMM –Set of states representing phones or other subword units –Transition probabilities on states: how likely is it to see one phone after seeing another? –Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1?

23 Initial estimates for Transition probabilities between phone states Observation probabilities associating phone states with acoustic examples Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) I.e., we tell the HMM the ‘right’ answers -- which words to associate with which sequences of sounds Iteratively retrain the transition and observation probabilities by running the training data through the model and scoring output until no improvement

24 HMMs for speech

25 Building the Pronunciation Model Models likelihood of word given network of candidate phone hypotheses –Multiple pronunciations for each word –May be weighted automaton or simple dictionary Words come from all corpora; pronunciations from pronouncing dictionary or TTS system

26 ASR Lexicon: Markov Models for Pronunciation

27 Language Model Models likelihood of word given previous word(s) Ngram models: –Build the LM by calculating bigram or trigram probabilities from text training corpus –Smoothing issues Grammars –Finite state grammar or Context Free Grammar (CFG) or semantic grammar Out of Vocabulary (OOV) problem

28 Search/Decoding Find the best hypothesis P(O|s) P(s) given –Lattice of phone units (AM) –Lattice of words (segmentation of phone lattice into all possible words via Pronunciation Model) –Probabilities of word sequences (LM) How to reduce this huge search space? –Lattice minimization and determinization –Pruning: beam search –Calculating most likely paths

29 Evaluating Success Transcription –Low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER Or That was the dentist calling. 125% WER Understanding –High concept accuracy How many domain concepts were correctly recognized? I want to go from Boston to Baltimore on September 29

30 Domain conceptsValues –source cityBoston –target cityBaltimore –travel dateSeptember 29 –Score recognized string “Go from Boston to Washington on December 29” vs. “Go to Boston from Baltimore on September 29” –(1/3 = 33% CA)

31 Summary ASR today –Combines many probabilistic phenomena: varying acoustic features of phones, likely pronunciations of words, likely sequences of words –Relies upon many approximate techniques to ‘translate’ a signal ASR future –Can we include more language phenomena in the model?

32 Next Class Speech disfluencies: a challenge for ASR AND….have an EXCELLENT vacation!!