Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab

Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab
Leiden University

Outline Introduction and State of the Art
A Speech Recognition Architecture Acoustic modeling Language modeling Practical issues Applications NB Some of the slides are adapted from the presentation: “Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?”, an excellent presentation by: Dr. Patti Price, Speech Technology Consulting Menlo Park, California 94025, and Dr. Joseph Picone Institute for Signal and Information Processing Dept. of Elect. and Comp. Eng. Mississippi State University The aim of this talk is to provide an overview of automatic speech recognition. I’ll start by defining some terms and giving some examples of the current state of the art. Then Joe will represent the research part of the story including acoustic modeling, language modeling, issues involved in moving research into applications, and technology trends. I’ll summarize and give some hints at future directions. 1 1

Research Areas Speech Analysis (Production, Perception, Parameter Estimation) Speech Coding/Compression Speech Synthesis (TTS) Speaker Identification/Recognition/Verification (Sprint, TI) Language Identification (Transparent Dialogue) Speech Recognition (Dragon, IBM, ATT) Speech recognition sub-categories: Discrete/Connected/Continuous Speech/Word Spotting Speaker Dependent/Independent Small/Medium/Large/Unlimited Vocabulary Speaker-Independent Large Vocabulary Continuous Speech Recognition (or LVCSR for short :)

Introduction What is Speech Recognition?
Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal Other interesting area’s: Who is talker (speaker recognition, identification) Speech output (speech synthesis) What the words mean (speech understanding, semantics)

Introduction Applications
Command and control Manufacturing Consumer products Database query Resource management Air travel information Stock quote Nuance, American Airlines: , touch 1 Current applications were not possible 20 years ago These are all using speech as input, or way to access information (not as source of information) Command and control/devices, Philips: Users of mobile telephones are able to obtain a connection simply by speaking the name of the person they wish to call. Database query (mostly over telephone, cost can be centralized, no keyboard - american airlines: (You they can call) - schwab: (play sample?) Dictation example on web (some people are still using keyboard and mice) - ibm via voice: (play last minute?) Dictation

Introduction: State of the Art
Speech-recognition software IBM (Via Voice, Voice Server Applications,...) Speaker independent, continuous command recognition Large vocabulary recognition Text-to-speech confirmation Barge in (The ability to interrupt an audio prompt as it is playing) Dragon Systems, Lernout & Hauspie (L&H Voice Xpress™ (:( ) Philips Dictation Telephone Voice Control (SpeechWave, VoCon SDK, chip-sets) Microsoft (Whisper, Dr Who) -Dragon: New AudioMining™ Technology Uses Award-Winning Speech Recognition Engine to Quickly Capture and Index Information Contained in Recorded Video Footage, Radio Broadcasts, Telephone Conversations, Call Center Dialogues, Help Desk Recordings, and More

Speech over the telephone.: AT&T Bell Labs pioneered the use of speech- recognition systems for telephone transactions companies such as Nuance, Philips and SpeechWorks are active in this field for some years now. IBM Applications over telephone: request news, internet pages, stock quotes, traveling info weather information - Bell Labs text-to-speech technology now lets mobile phone users access information on the web -- using natural language commands.

Speech over the telephone (Philips): SpeechPearl® large vocabulary natural language recognition (up to 200,000 words) SpeechMania® mixed initiative dialog gives the caller the impression of a truly natural dialogue: full replacement of the human operator. SpeechWave™ relatively small vocabularies (up to hundreds of words) available in nearly 40 languages Voice ReQuest The system recognizes the request and routes the call to the appropriate extension, all without the intervention of an operator. - SpeechPearl® large vocabulary natural language recognition (up to 200,000 words) - SpeechMania® mixed initiative dialog gives the caller the impression of a truly natural dialogue: full replacement of the human operator. - SpeechWave™ relatively small vocabularies (up to hundreds of words) available in nearly 40 languages - Voice ReQuest recognizes the request and routes the call to the appropriate extension, all without the intervention of an operator.

Speech over the telephone. : Brokerage: E*Trade,... Banking Travel: United Airlines, Continental Airlines BellSouth HP Federal Express Foodline - Foodline gives the nearby restaurant

Speech over the telephone. - Speech recognition: SUMMIT converts the spoken sentence into text - Language understanding: TINA parses the text into a semantic frame -- a grammatical structure containing the basic terms needed to query the Jupiter database - Language generation: GENESIS uses the semantic frame's basic terms to build a Structured Query Language (SQL) query for the database - Information retrieval: Jupiter executes the SQL query and retrieves the requested information from the database - Language generation: TINA and GENESIS convert the query result into a natural language sentence - Information delivery: Jupiter delivers the generated sentence to the user via voice (using a speech synthesizer) and/or display

The following is a transcript of a telephone conversation between Victor Zue and the Jupiter system for weather information. Jupiter's toll-free number is : [Phone ringing] Jupiter: Hi. Welcome to Jupiter, the M.I.T. Lab for Computer Science Weather Information System. This call is being recorded for system development. You may hang up or ask for help at any time. How can I help you? V.Z.: What is the forecast for Boston today? Jupiter: In Boston this afternoon, partly sunny and windy. High around 50. Tonight, clear, lows around 35. What other information can I give you? V.Z.: Will it rain tomorrow? Jupiter: To my knowledge, the forecast calls for no rain in Boston tomorrow. Is there anything else? - Speech recognition: SUMMIT converts the spoken sentence into text - Language understanding: TINA parses the text into a semantic frame -- a grammatical structure containing the basic terms needed to query the Jupiter database - Language generation: GENESIS uses the semantic frame's basic terms to build a Structured Query Language (SQL) query for the database - Information retrieval: Jupiter executes the SQL query and retrieves the requested information from the database - Language generation: TINA and GENESIS convert the query result into a natural language sentence - Information delivery: Jupiter delivers the generated sentence to the user via voice (using a speech synthesizer) and/or display

V.Z.: How about Seattle? Jupiter: To my knowledge, the forecast calls for no rain in Seattle tomorrow. Is there something else you would like to know? V.Z.: Is it sunny anywhere in the Caribbean? Jupiter: Sorry, I am not sure what you said. Jupiter: I am sorry, I may not know the city you are asking about. Try asking for the state or country. For example, what cities do you know about in Massachusetts? V.Z.: What is the temperature in Montego Bay? Jupiter: In Montego Bay today, high 86 and low 73. Is there something else? V.Z.: Good-bye. - Speech recognition: SUMMIT converts the spoken sentence into text - Language understanding: TINA parses the text into a semantic frame -- a grammatical structure containing the basic terms needed to query the Jupiter database - Language generation: GENESIS uses the semantic frame's basic terms to build a Structured Query Language (SQL) query for the database - Information retrieval: Jupiter executes the SQL query and retrieves the requested information from the database - Language generation: TINA and GENESIS convert the query result into a natural language sentence - Information delivery: Jupiter delivers the generated sentence to the user via voice (using a speech synthesizer) and/or display

Factors that Affect Performance of Speech Recognition Systems
all speakers of the language including foreign application independent or adaptive all styles including human-human (unaware) wherever speech occurs 2005 vehicle noise radio cell phones regional accents native speakers competent foreign speakers some application– specific data and one engineer year natural human- machine dialog (user can adapt) 2000 USER POPULATION expert years to create app– specific language model speaker independent and adaptive normal office various microphones telephone planned speech 1995 NOISE ENVIRONMENT 1985 quiet room fixed high – quality mic careful reading speaker- dep. application – specific speech and language SPEECH STYLE COMPLEXITY This slide is animated 1. Here is about what we could do in 1985. Several factors, then, as now, affect performance--- and each of these depend on how well the testing conditions match the training conditions: 2. noise, 3. user variability, 3. speech style, and 4. the hardest to quantify: task complexity (how much can you do and carry over from one application to the next). As we expand capabilities on any dimension, we expand the space of potential applications. The dimensions can be more or less traded off, e.g., if we need a noisier environment, do a simpler task, require more careful speech, or live with a higher error rate. 5. The changes in 1995, research made progress on these dimensions. App-development a bottleneck. Experience of getting some applications out, however, led to new research issues, e.g., confidence measures, barge-in. , activity aimed at using speech as source of information (broadcast news) and conversational systems. Applications lagging behind (as you saw in the Schwab demo: system has full control, not person). This is largely the bottleneck of application development. The technology can do it, but in many cases takes too much effort to be commercially viable. As costs of memory and compute cycles and platform size shrink, and as tools become easier to use, this will change. ambitious goals (solving the problem). I don’t really believe that. But I believe that progress will continue (if funding does) and that people will still be better at some things, and that machines will be better than people at other things. Hopefully we’ll make progress on how humans and machines can use their complementary attributes better. Speech as source of information and collaborative partner requires recognizing speech wherever it occurs

How Do You Measure the Performance?
USC, October 15, 1999: “the world's first machine system that can recognize spoken words better than humans can.” “ In benchmark testing using just a few spoken words, USC's Berger-Liaw … System not only bested all existing computer speech recognition systems but outperformed the keenest human ears.” What benchmarks? What was training? What was the test? Were they independent? How large was the vocabulary and the sample size? Did they really test all existing systems?Is that different from chance? Was the noise added or coincident with speech? What kind of noise? Was it independent of the speech? USC press release, October 15,1999, Eric Mankin “USC biomedical engineers have created the world's first machine system that can recognize spoken words better than humans can. … In benchmark testing using just a few spoken words, USC's Berger-Liaw Neural Network Speaker Independent Speech Recognition System not only bested all existing computer speech recognition systems but outperformed the keenest human ears.” [Joe will show some data on humans vs. machines. There are some situations in which machines are better than humans: e.g., transcribing 12 digits correctly (UPS tracking), modeling correlation of noise and speech (too good at that!)] [benchmark tests? Which ones? Standard?] [bested all existing: did they really test them all?] “The system can distinguish words in vast amounts of random "white noise" and can pluck words from the background clutter of other voices-the hubbub heard in bus stations and cocktail parties, for example.” [was this tested? Was training and testing independent? How large was test? How vast was the amount of noise?] “Berger and Liaw's system functions at 60 percent recognition with a hubbub level 560 times the strength of the target stimulus.” [are they recognizing the noise and not the speech?] [I wrote to the authors for data on what they did when the press release occurred. No answer. Feb 10, I wrote to the press release contact. No answer. In fact, you can get any error rate you want, if you set up the test favorably!

Evaluation Metrics Word Error Rate (WER) 40% Conversational Speech Spontaneous telephone speech is still a “grand challenge”. Telephone-quality speech is still central to the problem. Broadcast news is a very dynamic domain. 30% Broadcast News 20% Read Speech 10% Continuous Digits Letters and Numbers Digits Command and Control 0% Level Of Difficulty

Evaluation Metrics Human Performance
Word Error Rate 20% Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task. On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. The nature of the noise is as important as the SNR (e.g., cellular phones). A primary failure mode for humans is inattention. A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). Wall Street Journal (Additive Noise) 15% Machines 10% 5% Human Listeners (Committee) 0% 10 dB 16 dB 22 dB Quiet Speech-To-Noise Ratio

Evaluation Metrics Machine Performance
100% (Foreign) Read Speech Conversational Speech 20k vocabularies Broadcast Speech Spontaneous Speech Varied Microphones (Foreign) 10 X 10% 5k Noisy 1k A Word Error Rate (WER) below 10% is considered acceptable. Performance in the field is typically 2x to 4x worse than performance on an evaluation. 1%

What does a speech signal look like?

Spectrogram

Speech Recognition

Recognition Architectures Why Is Speech Recognition So Difficult?
Comparison of “aa” in “IOck” vs. “iy” in bEAt for conversational speech (SWB) Feature No. 2 Ph_1 Ph_2 Ph_3 Feature No. 1 Measurements of the signal are ambiguous. Region of overlap represents classification errors. Reduce overlap by introducing acoustic and linguistic context (e.g., context-dependent phones).

Overlap in the ceptral space (alphadigits)
Female “aa” Female “iy” Male “aa” Male “iy”

Overlap in the cepstral space (alphadigits)
Male “aa” (green) vs. Female “aa” (black) Male “iy” (blue) vs. Female “iy” (red) Combined Comparisons: Male "aa" (green) Female "aa" (black) Male "iy" (blue) Female "iy" (red)

OVERLAP IN THE CEPSTRAL SPACE (SWB-All)
The following plots demonstrate overlap of recognition features in the cepstral space. These plots consist of all vowels excised from tokens in the SWITCHBOARD conversational speech corpus. All Male Vowels All Female Vowels All Vowels

Recognition Architectures A Communication Theoretic Approach
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: Message Words Sounds Features Bayesian formulation for speech recognition: P(W|A) = P(A|W) P(W) / P(A) Objective: minimize the word error rate Approach: maximize P(W|A) during training Components: P(A|W) : acoustic model (hidden Markov models, mixtures) P(W) : language model (statistical, finite state networks, etc.) The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams).

Recognition Architectures Incorporating Multiple Knowledge Sources
Acoustic Front-end The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. Input Speech Acoustic Models P(A/W) Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions model temporal structure. The language model predicts the next set of words, and controls which models are hypothesized. Language Model P(W) Recognized Utterance Search Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.

Acoustic Modeling Feature Extraction
Fourier Transform Incorporate knowledge of the nature of speech sounds in measurement of the features. Utilize rudimentary models of human perception. Input Speech Typical: 512 samples (16kHz sampling rate) => Use a ~30 msec window for frequency domain analysis. Include absolute energy and 12 spectral measurements. Time derivatives to model spectral change. Cepstral Analysis Perceptual Weighting Time Derivative Time Derivative Energy + Mel-Spaced Cepstrum Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum

Acoustic Modeling Hidden Markov Models
Acoustic models encode the temporal evolution of the features (spectrum). Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. Phonetic model topologies are simple left-to-right structures. Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. Sharing model parameters is a common strategy to reduce complexity.

Acoustic Modeling Parameter Estimation
Word level transcription Supervises a closed-loop data-driven modeling Initial parameter estimation The expectation/maximization (EM) algorithm is used to improve our parameter estimates. Computationally efficient training algorithms (Forward-Backward) are crucial. Batch mode parameter updates are typically preferred. Decision trees and the use of additional linguistic knowledge are used to optimize parameter-sharing, and system complexity,. Initialization Single Gaussian Estimation 2-Way Split Mixture Distribution Reestimation 4-Way Split Reestimation •••

Language Modeling Is A Lot Like Wheel of Fortune

Language Modeling N-Grams: The Good, The Bad, and The Ugly
Unigrams (SWB): Most Common: “I”, “and”, “the”, “you”, “a” Rank-100: “she”, “an”, “going” Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” Rank-100: “do it”, “that we”, “don’t think” Least Common: “raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” Rank-100: “it was a”, “you know that” Least Common: “you have parents”, “you seen Brooklyn”

Language Modeling Integration of Natural Language
Natural language constraints can be easily incorporated. Lack of punctuation and search space size pose problems. Speech recognition typically produces a word-level time-aligned annotation. Time alignments for other levels of information also available.

Implementation Issues Dynamic Programming-Based Search
Dynamic programming is used to find the most probable path through the network. Beam search is used to control resources. Search is time synchronous and left-to-right. Arbitrary amounts of silence must be permitted between each word. Words are hypothesized many times with different start/stop times, which significantly increases search complexity.

Implementation Issues Cross-Word Decoding Is Expensive
Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. Cross-word decoding significantly increases memory requirements.

Implementation Issues Search Is Resource Intensive
Typical LVCSR systems have about 10M free parameters, which makes training a challenge. Large speech databases are required (several hundred hours of speech). Tying, smoothing, and interpolation are required.

Applications Conversational Speech
Conversational speech collected over the telephone contains background noise, music, fluctuations in the speech rate, laughter, partial words, hesitations, mouth noises, etc. WER (Word Error Rate) has decreased from 100% to 30% in six years. Laughter Singing Unintelligible Spoonerism Background Speech No pauses Restarts Vocalized Noise Coinage

Applications Audio Indexing of Broadcast News
Broadcast news offers some unique challenges: Lexicon: important information in infrequently occurring words Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”) Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”) Language: multilingual systems? language-independent acoustic modeling?

Applications Automatic Phone Centers
Portals: Bevocal, TellMe, HeyAniat VoiceXML 2.0 Automatic Information Desk Reservation Desk Automatic Help-Desk With Speaker identification bank account services services corporate services

Applications Real-Time Translation
From President Clinton’s State of the Union address (January 27, 2000): “These kinds of innovations are also propelling our remarkable prosperity... Soon researchers will bring us devices that can translate foreign languages as fast as you can talk... molecular computers the size of a tear drop with the power of today’s fastest supercomputers.” Imagine a world where: You book a travel reservation from your cellular phone while driving in your car without ever talking to a human (database query) You converse with someone in a foreign country and neither speaker speaks a common language (universal translator) You place a call to your bank to inquire about your bank account and never have to remember a password (transparent telephony) You can ask questions by voice and your Internet browser returns answers to your questions (intelligent query) Human Language Engineering: a sophisticated integration of many speech and language related technologies... a science for the next millennium.

Technology Future Directions
1970 Hidden Markov Models Analog Filter Banks Dynamic Time-Warping 1980 1990 2000 1960 Conclusions: supervised training is a good machine learning technique large databases are essential for the development of robust statistics Challenges: discrimination vs. representation generalization vs. memorization pronunciation modeling human-centered language modeling The algorithmic issues for the next decade: Better features by extracting articulatory information? Bayesian statistics? Bayesian networks? Decision Trees? Information-theoretic measures? Nonlinear dynamics? Chaos?

Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab

Similar presentations

Presentation on theme: "Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab

Similar presentations

Presentation on theme: "Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab"— Presentation transcript:

Similar presentations

About project

Feedback