Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Speech Recognition

Similar presentations


Presentation on theme: "Automatic Speech Recognition"— Presentation transcript:

1 Automatic Speech Recognition
Julia Hirschberg CS 6998 11/11/2018

2 What is speech recognition?
Transcribing words? Understanding meaning? 11/11/2018

3 It’s hard to recognize speech...
People speak in very different ways Across speaker variation Within speaker variation Speech sounds vary according to the speech context Environment varies wrt noise Transcription task must handle all of this and produce a transcript of spoken words 11/11/2018

4 Success: low WER (S+I+D)/N * 100
Thesis test vs. This is a test. 75% WER Progress: Very large training corpora Fast machines and cheap storage Bake-offs Market for real-time systems New representations and algorithms: Finite State Transducers 11/11/2018

5 Varieties of Speech Recognition
Mode Isolated words  continuous Style Read, prepared, spontaneous Enrollment Spkr-dependent or independent Vocabulary size <20  5K --> 60K -->~1M Language Model Finite state, ngrams, CFGs, CSGs Perplexity <10  > 100 SNR > 30dB (high)  < 10dB (low) Input device Telephone, microphones 11/11/2018

6 ASR and the Noisy Channel Model
Source --> noisy channel --> Hypothesis Find the most likely input to have generated the (observed) “noisy” sentence by finding most likely sentence W in language given acoustic input O W’= P(W|O) Bayes rule W’= 11/11/2018

7 P(O) same for all hypothetical W, so W’=P(O|W)P(W)
P(W) the prior; P(O|W) the (acoustic) likelihood 11/11/2018

8 Simple Isolated Digit Recognition
Train 10 acoustic templates Mi: one per digit Compare input x with each Select most similar template j according to some comparison function, minimizing differences j = min{f(x,Mi)} 11/11/2018

9 Scaling Up: Continuous Speech Recognition
Collect training and test corpora of Speech + word transcription Speech + phonetic transcription Built by hand or using TTS Text corpus Determine a representation for the signal Build probabilitistic Acoustic model: signal to phones 11/11/2018

10 Pronunciation model: phones to words
Language model: words to sentences Select search procedures to decode new input given these training models 11/11/2018

11 Representing the Signal
What parameters (features) of the waveform Can be extracted automatically Will preserve phonetic identity and distinguish it from other phones Will be independent of speaker variability and channel conditions Will not take up too much space …Power Spectrum 11/11/2018

12 Speech captured by microphone and digitized Signal divided into frames
Power spectrum computed to represent energy in different bands of the signal LPC spectrum, Cepstra, PLP Each frame’s spectral features represented by small set of numbers 11/11/2018

13 Why it works? Different phonemes have different spectral characteristics Why it doesn’t work? Phonemes can have different properties in different acoustic contexts, spoken by different people, ... 11/11/2018

14 Acoustic Models Model likelihood of phone given spectral features and prior context Usually represented as HMM Set of states representing phones or other subword units Transition probabilities on states: how likely is it to see one phone after another? Observation/output likelihoods: how likely is spectral feature vector to be observed from state i, given state i-1? 11/11/2018

15 Tune parameters on large corpus with only transcription
Train initial model on small hand-labeled corpus to get estimate of transition and observation probabilities Tune parameters on large corpus with only transcription Iterate until no further improvement 11/11/2018

16 Pronunciation Model Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) Allophones: butter vs. but Lexicon may be HMM or simple dictionary 11/11/2018

17 Language Models Models likelihood of word sequence given candidate word hypotheses Grammars Finite state or CFG Ngrams Corpus trained Smoothing issues Out of Vocabulary (OOV) problem 11/11/2018

18 Search Find the best hypothesis given Lattice of subword units (AM)
Segmentation of all paths into possible words (PM) Probabilities of word sequences (LM) Huge search space Viterbi decoding Beam search 11/11/2018

19 Challenges for Transcription
Robustness to channel characteristics and noise Portability to new applications Adapatation: to speakers, to environments LMs: simple ngrams need help Confidence measures OOV words New speaking styles/genres New applications 11/11/2018

20 Challenges for Understanding
Recognizing communicative ‘problems’ ASR errors User corrections Disfluencies and self-repairs 11/11/2018

21 An Unsuccessful Dialogue
S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning. Note that the previous speaker was not a native speaker of English -- non-native speakers are often harder for recognizers to understand. Here however is a native speaker who has a terrible time with TOOT -- which never seems to understand ‘Baltimore’. Note the different ways she produces that city name during the dialogue. You’ll here some clear examples of hyperarticulation. $TOOT/non_adapt/MixedImplicit/award/ -- TOOT can never understand Baltimore 11/11/2018

22 S: You can say the arrival city name, such as “New York City."
U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore. Note how she changes her pronunciation of “Baltimore”, hoping TOOT will finally here it clearly enough! 11/11/2018

23 S: Do you want to continue and find a new set of trains? U: Yes.
U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev- 11/11/2018

24 Disfluencies and Self-Repairs
Disfluencies abound in spontaneous speech every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. Hard to recognize Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago. Kasl & Mahl: 41% more filled pauses in audio only vs ftf; Oviatt: 8.83 to 5.50% disfluencies in phone conversations vs. non /n/u118/exp98/adapt/MixedImplicit/mccoy/task1 line 198, rehesitation (mor- morning --> not really) /n/u118/exp98/adapt/UserNoConfirm/sgoel/task1 line 1315, filled pause (um baltimore --> from baltimoreten) /n/u118/exp98/adapt/UserNoConfirm/selina/task1, line 1132, repair (Ba- uh Chicago --> for Boston Chicago) 11/11/2018

25 Possibilities for Understanding
Recognizing speaker emotion Identifying speech acts: okay Locating topic boundaries for topic tracking, audio browsing, speech data mining 11/11/2018

26 Next Week 11/11/2018

27 11/11/2018


Download ppt "Automatic Speech Recognition"

Similar presentations


Ads by Google