Automatic Speech Recognition: An Overview

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models in NLP
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Natural Language Processing - Speech Processing -
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Automatic Speech Recognition: An Overview
VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES Roberto Pieraccini, CTO, Tell-Eureka Corporation 535 West 34 th Street New York, NY
CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
COMP 4060 Natural Language Processing Speech Processing.
Automatic Speech Recognition: An Overview
1 Automatic Speech Recognition: An Overview Julia Hirschberg CS 4706 (special thanks to Roberto Pieraccini)
A PRESENTATION BY SHAMALEE DESHPANDE
Natural Language Understanding
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
7-Speech Recognition Speech Recognition Concepts
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Introduction to CL & NLP CMSC April 1, 2003.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Speech Recognition and Synthesis
Juicer: A weighted finite-state transducer speech decoder
Speech Recognition UNIT -5.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Automatic Speech Recognition Introduction
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
8.0 Search Algorithms for Speech Recognition
Automatic Speech Recognition, Text-to-Speech, and Natural Language Understanding Technologies Julia Hirschberg LSA.
Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Speech Processing Speech Recognition
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LECTURE 15: REESTIMATION, EM AND MIXTURES
Automatic Speech Recognition
Speech recognition, machine learning
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Automatic Speech Recognition
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Speech recognition, machine learning
Presentation transcript:

Automatic Speech Recognition: An Overview Julia Hirschberg CS 4706 (Thanks to Roberto Pieraccini and Francis Ganong for some slides)

Recreating the Speech Chain DIALOG SEMANTICS SYNTAX LEXICON MORPHOLOGY PHONETICS VOCAL-TRACT ARTICULATORS INNER EAR ACOUSTIC NERVE SPEECH RECOGNITION DIALOG MANAGEMENT SPOKEN LANGUAGE UNDERSTANDING SYNTHESIS Metropolis (1927) Directed By: Fritz Lang

The Problem of Segmentation... or... Why Speech Recognition is so Difficult m I n & b r i s e v th E z o t ü f O (user:Roberto (attribute:telephone-num value:7360474)) NP VP Phone segmentation  word identification  syntactic structure  semantic representation MY NUMBER IS SEVEN THREE NINE ZERO TWO FOUR

The Illusion of Segmentation... or... Why Speech Recognition is so Difficult Intra-speaker variability Noise/reverberation Coarticulation Context-dependency Word confusability Word variations Speaker Dependency Multiple Interpretations Limited vocabulary Ellipses and Anaphors m I n & b r i s e v th E z o t ü f O (user:Roberto (attribute:telephone-num value:7360474)) NP VP MY NUMBER IS SEVEN THREE NINE ZERO TWO FOUR

Speech Recognition: the Early Years 1952 – Automatic Digit Recognition (AUDREY) Davis, Biddulph, Balashek (Bell Laboratories) The first serious speech recognizer was developed in 1952 by Davis, Biddulph, and Balashek of Bell Labs. Analog circuit using filters, amplifiers and integrators. Using a simple frequency splitter, it generated plots of the first two formants, which it identified by matching them against prestored patterns in an analog memory. With training, it was reported, the machine achieved 97 percent accuracy on the spoken forms of ten digits. Thus an utterance of an unknown digit could be recognized if its pattern on the formant-one/formant-two plane was compared with those of all digits plotted during the experimental phase. That was the principle on which Audrey was based. Rather than comparing the whole patterns point by point, twenty numbers were computed and stored for each digit, characterizing the corresponding average pattern on the formant-one/formant-two plane. Those twenty characteristic numbers were then computed, in the same way, for any incoming utterance to be recognized, and compared with the stored sets. The digit with the stored set of twenty numbers more similar to the newly computed one was the digit that most probably was uttered. Speaker dependent.

1960’s – Speech Processing and Digital Computers AD/DA converters and digital computers start appearing in the labs James Flanagan Bell Laboratories Flanagan joined Bell Labs in 1957. Although there was already some work done in speech processing it was mostly using analog circuits. Jim helped to introduce AD/DA and start pioneering new research in speech processing using computers. Much easier, more flexible and cheaper than building electric circuits. Much larger experiments possible altho not in real time for quite a while.

1969 – Whither Speech Recognition? J. R. Pierce Executive Director, Bell Laboratories General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish… It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but not sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour… Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach). The Journal of the Acoustical Society of America, June 1969 Every technology has its own negative people. Never believed. He was right that he never saw good performance.

1971-1976: The ARPA SUR project Despite anti-speech recognition campaign led by Pierce Commission ARPA launches 5 year Spoken Understanding Research program Goal: 1000-word vocabulary, 90% understanding rate, near real time on 100 mips machine 4 Systems built by the end of the program SDC (24%) BBN’s HWIM (44%) CMU’s Hearsay II (74%) CMU’s HARPY (95% -- but 80 times real time!) Rule-based systems except for Harpy Engineering approach: search network of all the possible utterances LESSON LEARNED: Hand-built knowledge does not scale up Need of a global “optimization” criterion the objective of ASR is, properly, speech understanding, not simply correct identification of words in a spoken message. Systems, therefore, should be evaluated in terms of their ability to respond correctly to spoken messages about such pragmatic problems as travel budget management. Limited domain, somewhat speaker tuned, quiet room. The objective goal of ARPA SUR was a recognition system with 90 percent sentence accuracy for continuous-speech sentences, using thousand-word vocabularies, not in real time. Of four principal ARPA SUR projects, the only one to meet the stated goal was Carnegie Mellon University's Harpy system, which achieved a 5 percent error rate on a 1,011-word vocabulary on continuous speech. One of the ways the CMU team achieved the goal was clever: they made the task easier by restricting word order; that is, by limiting spoken words to certain sequences in the sentence. Beginnings of Language Modeling. Raj Reddy -- CMU

Lack of clear evaluation criteria ARPA felt systems had failed Project not extended Speech Understanding: too early for its time Need a standard evaluation method so that progress within and across systems could be measured

1970’s – Dynamic Time Warping The Brute Force of the Engineering Approach T.K. Vyntsyuk (1968) H. Sakoe, S. Chiba (1970) Isolated Words Speaker Dependent Connected Words Speaker Independent Sub-Word Units TEMPLATE (WORD 7) Compare input to word templates. What features discriminate between one word and another? Spectrograms divided into frames (cells) with avg intensity computed over each frame. Word is seven. Compute the least cost path of best matches of feature vectors between unknown word and templates, using dynamic programming. Recognizing one word could take seconds to hours, depending on vocabulary size and word length. Isolated word recognition. UNKNOWN WORD

1980s -- The Statistical Approach Fred Jelinek Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research S1 S2 S3 a11 a12 a22 a23 a33 Acoustic HMMs Word Tri-grams Harpy represented all possible sentences as a finite state machine with 15K nodes of acoustic templates. Problem of finding best path thru network much like finding best path comparing word templates. Again, dynamic programming. IBM added probabilities to the mix, estimating statistical models of phonemes automatically from data. Performance went from 35% with hand-estimated models to 75%. Jim Baker No Data Like More Data “Whenever I fire a linguist, our system performance improves” (1988) Some of my best friends are linguists (2004)

1980-1990 – Statistical approach becomes ubiquitous Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989. HMMs represent sequential constraints of language. Actual processes of the vocal tract in producing a particular phone are ‘hidden states’ – but we can observe their result in the acoustic signal.

1980s-1990s – The Power of Evaluation TECHNOLOGY VENDORS PLATFORM INTEGRATORS APPLICATION DEVELOPERS HOSTING TOOLS STANDARDS 1997 1995 1996 1998 1999 2000 2001 2002 2003 2004 … SPOKEN DIALOG INDUSTRY MIT SRI SPEECHWORKS NUANCE Pros and Cons of DARPA programs + Continuous incremental improvement - Loss of “bio-diversity”

NUANCE Today

Large Vocabulary Continuous Speech Recognition (LVCSR) Today ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) Conversational speech, dialogue (vs. dictation) Many languages (vs. English, Japanese, Swedish, French) 9/16/2018 15 Speech and Language Processing Jurafsky and Martin 15

Current error rates Task Vocabulary Error (%) Digits 11 0.5 WSJ read speech 5K 3 20K Broadcast news 64,000+ 10 Conversational Telephone 20 Ballpark numbers; exact numbers depend very much on the specific corpus 9/16/2018 16 Speech and Language Processing Jurafsky and Martin 16

Human vs. Machine Error Rates Task Vocab ASR Hum SR Continuous digits 11 .5 .009 WSJ 1995 clean 5K 3 0.9 WSJ 1995 w/noise 9 1.1 SWBD 2004 65K 20 4 Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough… 9/16/2018 17 Speech and Language Processing Jurafsky and Martin 17

How to Build an ASR System Build a statistical model of the speech-to-text process Collect lots of speech and transcribe all the words Train the acoustic model on the labeled speech Train a language model on the transcription + lots more text Paradigms: Supervised Machine Learning + Search The Noisy Channel Model

ASR Paradigm Given an acoustic observation: Two problems: What is the most likely sequence of words to explain the input? Using an Acoustic Model and a Language Model Two problems: How to score hypotheses (Modeling) How to pick hypotheses to score (Search) 1sentence, or utterance, or “acoustic input up to now” 2or the meaning, or the sequence of words with lowest expected error rate, or the best sequence of words for a google search 3 Does not include prosody, nor metalinguistic descriptions (who is the speaker, is the speaker happy, or lying, or speaking with an accent)

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform Decoding Decoding is like translating from input sounds to output text

The Noisy Channel Model: Assumptions What is the most likely sentence out of all sentences in the language L, given some acoustic input O? Treat acoustic input O as sequence of individual acoustic observations or frames O = o1,o2,o3,…,ot Define a sentence W as a sequence of words: W = w1,w2,w3,…,wn

Noisy Channel Model Probabilistic implication: Pick the highest probable sequence: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: P(W) – prior probability – from the LM – how likely is this word to occur – perhaps in some context P(O|W) – the observation likelihood – from the AM – like comparing acoustic templates that are known to correspond to each word – but at a much finer grained level.

Speech Recognition Meets Noisy Channel: Acoustic Likelihoods and LM Priors

Components of an ASR System Corpora for training and testing of components Representation for input and method of extracting features Pronunciation Model (a lexicon) Acoustic Model (acoustic characteristics of subword units) Language Model (word sequence probabilities) Algorithms to search hypothesis space efficiently

Training and Test Corpora Collect corpora appropriate for recognition task at hand Small speech + phonetic transcription to associate sounds with symbols (Acoustic Model) Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model+) Very large text corpus to identify ngram probabilities or build a grammar (Language Model)

Building the Acoustic Model Goal: Model likelihood of sounds given spectral features, pronunciation models, and prior context Usually represented as Hidden Markov Model States represent phones or other subword units for each word in the lexicon Transition probabilities (a) on states: how likely to transition from one unit to itself? To the next? Observation likelihoods (b): how likely is spectral feature vector (the acoustic information) to be observed at state i?

Training a Word HMM

Initial estimates of acoustic models from phonetically transcribed corpus or flat start Transition probabilities between phone states Observation probabilities associating phone states with acoustic features of windows of waveform Embedded training: Re-estimate probabilities using initial phone HMMs + orthographically transcribed corpus + pronunciation lexicon to create whole sentence HMMs for each sentence in training corpus Iteratively retrain transition and observation probabilities by running the training data through the model until convergence

Training the Acoustic Model Iteratively sum over all possible segmentations of words and phones – given the transcript -- re-estimating HMM parameters accordingly until convergence

Building the Pronunciation Model Models likelihood of word given network of candidate phone hypotheses Multiple pronunciations for each word May be weighted automaton or simple dictionary Words come from all corpora (including text) Pronunciations come from pronouncing dictionary or TTS system

ASR Lexicon: Markov Models for Pronunciation

Building the Language Model Models likelihood of word given previous word(s) Ngram models: Build the LM by calculating bigram or trigram probabilities from text training corpus: how likely is one word to follow another? To follow the two previous words? Smoothing issues: sparse data Grammars Finite state grammar or Context Free Grammar (CFG) or semantic grammar Out of Vocabulary (OOV) problem

Search/Decoding Find the best hypothesis Ŵ (P(O|W) P(W)) given For O A sequence of acoustic feature vectors (O) A trained HMM (Acoustic Model) Lexicon (Pronunciation Model) Probabilities of word sequences (Language Model) For O Calculate most likely state sequences in HMM given transition and observation probabilities  lattice w/AM probabilities Trace backward thru lattice to assign words to states using AM and LM probabilities (decoding) N best vs. 1-best vs. lattice output Limiting search Lattice minimization and determinization Pruning: beam search

Evaluating Success Transcription Understanding Goal: Low Word Error Rate (Subst+Ins+Del)/N * 100 This is a test Thesis test. (1subst+2del)/4*100=75% WER That was the dentist calling. (4 subst+1ins)/4words * 100=125% WER Understanding Goal: High Concept Accuracy How many domain concepts were correctly recognized? I want to go from Boston to Washington on September 29

Domain concepts Values source city Boston target city Washington travel month September travel day 29 Ref: Go from Boston to Washington on September 29 vs. ASR: Go to Boston from Washington on December 29 3concepts/4concepts * 100 = 25% Concept Error Rate or 75% Concept Accuracy 3subst/8words * 100 = 37.5% WER or 62.5% Word Accuracy Which is better?

ASR Today Combines many probabilistic phenomena: varying acoustic features of phones, likely pronunciations of words, likely sequences of words Relies upon many approximate techniques to ‘translate’ a signal Finite State Transducers now a standard representation for all components

ASR Future 3->5 Bigger acoustic models Bigger lexicons Triphones -> Quinphones Trigrams -> Pentagrams Bigger acoustic models More parameters More mixtures Bigger lexicons 65k -> 256k

Larger language models More data, more parameters Better back-offs More kinds of adaptation Feature space adaptation Discriminative training instead of MLE Rover: combinations of recognizers FSM flattening of knowledge into uniform structure

But What is Human about State-of-the-Art ASR? Input Wave Acoustic Features Acoustic Models Front End Search Lexicon Language Models

Front End: MFCC Input Wave Sampling, Windowing FastFourierTransform Acoustic Features Acoustic Models Front End Front End: MFCC Search Postprocessing Lexicon Language Models Input Wave Sampling, Windowing FastFourierTransform cosine transform first 8-12 coefficients Mel Filter Bank: Mel(ody) scale of pitch perception: linear up to ~500Hz and logarithmic above Stacking, computation of deltas:Normalizations: filtering, etc Linear Transformations:dimensionality reduction Acoustic Features

Basic Lexicon A list of spellings and pronunciations Input Wave Acoustic Features Acoustic Models Front End Search Lexicon Language Models A list of spellings and pronunciations Canonical pronunciations And a few others Limited to 64k entries Support simple stems and suffixes Linguistically naïve No phonological rewrites Doesn’t support all languages

Language Model Frequency sensitive, like humans Context sensitive, like humans

Next Class Building an ASR system: the Pocket Sphinx Toolkit