Presentation on theme: "BravoBrava Mississippi State University Spontaneous telephone speech is still a “grand challenge”. Telephone-quality speech is still central to the problem."— Presentation transcript:
BravoBrava Mississippi State University Spontaneous telephone speech is still a “grand challenge”. Telephone-quality speech is still central to the problem. Vision for speech technology continues to evolve. Broadcast news is a very dynamic domain. 0% 10% 30% 40% 20% Word Error Rate Level Of Difficulty Digits Continuous Digits Command and Control Letters and Numbers Broadcast News Read Speech Conversational Speech Evaluation Metrics Evolution
BravoBrava Mississippi State University 0% 5% 15% 20% 10% 10 dB16 dB22 dB Quiet Wall Street Journal (Additive Noise) Machines Human Listeners (Committee) Word Error Rate Speech-To-Noise Ratio Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task. On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. The nature of the noise is as important as the SNR (e.g., cellular phones). A primary failure mode for humans is inattention. A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). Evaluation Metrics Human Performance
BravoBrava Mississippi State University 100% 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 1% 10% Read Speech 1k 5k 20k Noisy Varied Microphones Spontaneous Speech Conversational Speech Broadcast Speech (Foreign) 10 X Common evaluations fuel technology development. Tasks become progressively more ambitious and challenging. A Word Error Rate (WER) below 10% is considered acceptable. Performance in the field is typically 2x to 4x worse than performance on an evaluation. Evaluation Metrics Machine Performance
BravoBrava Mississippi State University Information extraction is the analysis of natural language to collect information about specified types of entities. As the focus shifts to providing enhanced annotations, WER may not be the most appropriate measure of performance (content-based scoring). F-Measure 0%10%20% 30% 70% 90% 80% 100% Word Error Rate (Hub-4 Eval’98) Evaluation Metrics Beyond WER: Named Entity Recall = # slots correctly filled # slots filled in key Precision = # slots correctly filled # slots filled by system F-Measure = 2 x recall x precision (recall + precision) An example of named entity annotation: Mr. Sears bought a new suit at Sears in Washington yesterday Evaluation Metrics:
BravoBrava Mississippi State University Our measurements of the signal are ambiguous. Region of overlap represents classification errors. Reduce overlap by introducing acoustic and linguistic context (e.g., context-dependent phones). Feature No. 1 Feature No. 2 Ph_1 Ph_3 Ph_2 Comparison of “aa” in “IOck” vs. “iy” in bEAt for conversational speech (SWB) Recognition Architectures Why Is Speech Recognition So Difficult?
BravoBrava Mississippi State University Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition: P(W|A) = P(A|W) P(W) / P(A) Recognition Architectures A Communication Theoretic Approach Objective: minimize the word error rate Approach: maximize P(W|A) during training Components: P(A|W) : acoustic model (hidden Markov models, mixtures) P(W) : language model (statistical, finite state networks, etc.) The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams).
BravoBrava Mississippi State University Input Speech Recognition Architectures Incorporating Multiple Knowledge Sources Acoustic Front-end Acoustic Front-end The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. Acoustic Models P(A/W) Acoustic Models P(A/W) Acoustic models represent sub-word units, such as phonemes, as a finite- state machine in which states model spectral structure and transitions model temporal structure. Recognized Utterance Search Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence. The language model predicts the next set of words, and controls which models are hypothesized. Language Model P(W)
BravoBrava Mississippi State University Fourier Transform Fourier Transform Cepstral Analysis Cepstral Analysis Perceptual Weighting Perceptual Weighting Time Derivative Time Derivative Time Derivative Time Derivative Energy + Mel-Spaced Cepstrum Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum Input Speech Incorporate knowledge of the nature of speech sounds in measurement of the features. Utilize rudimentary models of human perception. Acoustic Modeling Feature Extraction Measure features 100 times per sec. Use a 25 msec window for frequency domain analysis. Include absolute energy and 12 spectral measurements. Time derivatives to model spectral change.
BravoBrava Mississippi State University Acoustic models encode the temporal evolution of the features (spectrum). Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. Phonetic model topologies are simple left-to-right structures. Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. Sharing model parameters is a common strategy to reduce complexity. Acoustic Modeling Hidden Markov Models
BravoBrava Mississippi State University Closed-loop data-driven modeling supervised only from a word-level transcription. The expectation/maximization (EM) algorithm is used to improve our parameter estimates. Computationally efficient training algorithms (Forward-Backward) have been crucial. Batch mode parameter updates are typically preferred. Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge. Acoustic Modeling Parameter Estimation Initialization Single Gaussian Estimation 2-Way Split Mixture Distribution Reestimation 4-Way Split Reestimation
BravoBrava Mississippi State University Language Modeling Is A Lot Like Wheel of Fortune
BravoBrava Mississippi State University Language Modeling N-Grams: The Good, The Bad, and The Ugly Bigrams (SWB): Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” Rank-100: “do it”, “that we”, “don’t think” Least Common:“raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” Rank-100: “it was a”, “you know that” Least Common:“you have parents”, “you seen Brooklyn” Unigrams (SWB): Most Common: “I”, “and”, “the”, “you”, “a” Rank-100: “she”, “an”, “going” Least Common: “Abraham”, “Alastair”, “Acura”
BravoBrava Mississippi State University Language Modeling Integration of Natural Language Natural language constraints can be easily incorporated. Lack of punctuation and search space size pose problems. Speech recognition typically produces a word-level time-aligned annotation. Time alignments for other levels of information also available.
BravoBrava Mississippi State University Typical LVCSR systems have about 10M free parameters, which makes training a challenge. Large speech databases are required (several hundred hours of speech). Tying, smoothing, and interpolation are required. Implementation Issues Search Is Resource Intensive
BravoBrava Mississippi State University Dynamic programming is used to find the most probable path through the network. Beam search is used to control resources. Implementation Issues Dynamic Programming-Based Search Search is time synchronous and left-to-right. Arbitrary amounts of silence must be permitted between each word. Words are hypothesized many times with different start/stop times, which significantly increases search complexity.
BravoBrava Mississippi State University Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. Cross-word decoding significantly increases memory requirements. Implementation Issues Cross-Word Decoding Is Expensive
BravoBrava Mississippi State University Implementation Issues Decoding Example
BravoBrava Mississippi State University Implementation Issues Internet-Based Speech Recognition
BravoBrava Mississippi State University Technology Conversational Speech Conversational speech collected over the telephone contains background noise, music, fluctuations in the speech rate, laughter, partial words, hesitations, mouth noises, etc. WER has decreased from 100% to 30% in six years. Laughter Singing Unintelligible Spoonerism Background Speech No pauses Restarts Vocalized Noise Coinage
BravoBrava Mississippi State University Technology Audio Indexing of Broadcast News Broadcast news offers some unique challenges: Lexicon: important information in infrequently occurring words Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”) Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”) Language: multilingual systems? language-independent acoustic modeling?
BravoBrava Mississippi State University From President Clinton’s State of the Union address (January 27, 2000): “These kinds of innovations are also propelling our remarkable prosperity... Soon researchers will bring us devices that can translate foreign languages as fast as you can talk... molecular computers the size of a tear drop with the power of today’s fastest supercomputers.” Technology Real-Time Translation Imagine a world where: You book a travel reservation from your cellular phone while driving in your car without ever talking to a human (database query) You converse with someone in a foreign country and neither speaker speaks a common language (universal translator) You place a call to your bank to inquire about your bank account and never have to remember a password (transparent telephony) You can ask questions by voice and your Internet browser returns answers to your questions (intelligent query) Human Language Engineering: a sophisticated integration of many speech and language related technologies... a science for the next millennium.
BravoBrava Mississippi State University What have we learned? supervised training is a good machine learning technique large databases are essential for the development of robust statistics What are the challenges? discrimination vs. representation generalization vs. memorization pronunciation modeling human-centered language modeling What are the algorithmic issues for the next decade: Better features by extracting articulatory information? Bayesian statistics? Bayesian networks? Decision Trees? Information-theoretic measures? Nonlinear dynamics? Chaos? Technology Future Directions 1970 Hidden Markov Models Analog Filter Banks Dynamic Time-Warping 1980 1990 2000 1960
BravoBrava Mississippi State University To Probe Further References Journals and Conferences:  N. Deshmukh, et. al., “Hierarchical Search for LargeVocabulary Conversational Speech Recognition,” IEEE Signal Processing Magazine, vol. 1, no. 5, pp. 84- 107, September 1999.  N. Deshmukh, et. al., “Benchmarking Human Performance for Continuous Speech Recognition,” Proceedings of the Fourth International Conference on Spoken Language Processing, pp. SuP1P1.10, Philadelphia, Pennsylvania, USA, October 1996.  R. Grishman, “Information Extraction and Speech Recognition,” presented at the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, USA, February 1998.  R. P. Lippmann, “Speech Recognition By Machines and Humans,” Speech Communication, vol. 22, pp. 1-15, July 1997.  M. Maybury (editor), “News on Demand,” Communications of the ACM, vol. 43, no. 2, February 2000. D. Miller, et. al., “Named Entity Extraction from Broadcast News,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999.  D. Pallett, et. al., “Broadcast News Benchmark Test Results,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999.  J. Picone, “Signal Modeling Techniques in Speech Recognition,” IEEE Proceedings, vol. 81, no. 9, pp. 1215- 1247, September 1993.  P. Robinson, et. al., “Overview: Information Extraction from Broadcast News,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999.  F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1998. URLs and Resources:  “Speech Corpora,” The Linguistic Data Consortium, http://www.ldc.upenn.edu.  “Technology Benchmarks,” Spoken Natural Language Processing Group, The National Institute for Standards, http://www.itl.nist.gov/iaui/894.01/index.html.  “Signal Processing Resources,” Institute for Signal and Information Technology, Mississippi State University, http://www.isip.msstate.edu.  “Internet- Accessible Speech Recognition Technology,” http://www.isip.msstate.edu/projects/speech/index.html.  “A Public Domain Speech Recognition System,” http://www.isip.msstate.edu/projects/speech/software/index.html.  “Remote Job Submission,” http://www.isip.msstate.edu/projects/speech/experiments/index.html.  “The Switchboard Corpus,” http://www.isip.msstate.edu/projects/switchboard/index.html.