Presentation is loading. Please wait.

Presentation is loading. Please wait.

Seminar Speech Recognition a Short Overview E.M. Bakker

Similar presentations


Presentation on theme: "Seminar Speech Recognition a Short Overview E.M. Bakker"— Presentation transcript:

1 Seminar Speech Recognition a Short Overview E.M. Bakker
LIACS Media Lab Leiden University

2 Introduction What is Speech Recognition?
Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal Other interesting area’s: Who is talker (speaker recognition, identification) Speech output (speech synthesis) What the words mean (speech understanding, semantics)

3 Recognition Architectures A Communication Theoretic Approach
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: Message Words Sounds Features Bayesian formulation for speech recognition: P(W|A) = P(A|W) P(W) / P(A) Objective: minimize the word error rate Approach: maximize P(W|A) during training Components: P(A|W) : acoustic model (hidden Markov models, mixtures) P(W) : language model (statistical, finite state networks, etc.) The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams).

4 Recognition Architectures Incorporating Multiple Knowledge Sources
Acoustic Front-end The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. Input Speech Acoustic Models P(A/W) Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions model temporal structure. The language model predicts the next set of words, and controls which models are hypothesized. Language Model P(W) Recognized Utterance Search Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.

5 Acoustic Modeling Feature Extraction
Fourier Transform Incorporate knowledge of the nature of speech sounds in measurement of the features. Utilize rudimentary models of human perception. Input Speech Measure features 100 times per sec. Use a 25 msec window for frequency domain analysis. Include absolute energy and 12 spectral measurements. Time derivatives to model spectral change. Cepstral Analysis Perceptual Weighting Time Derivative Time Derivative Energy + Mel-Spaced Cepstrum Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum

6 Acoustic Modeling Hidden Markov Models
Acoustic models encode the temporal evolution of the features (spectrum). Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. Phonetic model topologies are simple left-to-right structures. Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. Sharing model parameters is a common strategy to reduce complexity.

7 Acoustic Modeling Parameter Estimation
Closed-loop data-driven modeling supervised only from a word-level transcription. The expectation/maximization (EM) algorithm is used to improve our parameter estimates. Computationally efficient training algorithms (Forward-Backward) have been crucial. Batch mode parameter updates are typically preferred. Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge. Initialization Single Gaussian Estimation 2-Way Split Mixture Distribution Reestimation 4-Way Split Reestimation •••

8 Language Modeling Is A Lot Like Wheel of Fortune

9 Language Modeling N-Grams: The Good, The Bad, and The Ugly
Unigrams (SWB): Most Common: “I”, “and”, “the”, “you”, “a” Rank-100: “she”, “an”, “going” Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” Rank-100: “do it”, “that we”, “don’t think” Least Common: “raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” Rank-100: “it was a”, “you know that” Least Common: “you have parents”, “you seen Brooklyn”

10 Language Modeling Integration of Natural Language
Natural language constraints can be easily incorporated. Lack of punctuation and search space size pose problems. Speech recognition typically produces a word-level time-aligned annotation. Time alignments for other levels of information also available.

11 Implementation Issues Search Is Resource Intensive
Typical LVCSR systems have about 10M free parameters, which makes training a challenge. Large speech databases are required (several hundred hours of speech). Tying, smoothing, and interpolation are required.

12 Implementation Issues Dynamic Programming-Based Search
Dynamic programming is used to find the most probable path through the network. Beam search is used to control resources. Search is time synchronous and left-to-right. Arbitrary amounts of silence must be permitted between each word. Words are hypothesized many times with different start/stop times, which significantly increases search complexity.

13 Implementation Issues Cross-Word Decoding Is Expensive
Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. Cross-word decoding significantly increases memory requirements.

14 General Specification
The aim of this talk is to provide an overview of automatic speech recognition. I’ll start by defining some terms and giving some examples of the current state of the art. Then Joe will represent the research part of the story including acoustic modeling, language modeling, issues involved in moving research into applications, and technology trends. I’ll summarize and give some hints at future directions. 1 1

15 Applications Conversational Speech
Conversational speech collected over the telephone contains background noise, music, fluctuations in the speech rate, laughter, partial words, hesitations, mouth noises, etc. WER (Word Error Rate) has decreased from 100% to 30% in six years. Laughter Singing Unintelligible Spoonerism Background Speech No pauses Restarts Vocalized Noise Coinage

16 Applications Audio Indexing of Broadcast News
Broadcast news offers some unique challenges: Lexicon: important information in infrequently occurring words Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”) Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”) Language: multilingual systems? language-independent acoustic modeling?

17 Applications Real-Time Translation
From President Clinton’s State of the Union address (January 27, 2000): “These kinds of innovations are also propelling our remarkable prosperity... Soon researchers will bring us devices that can translate foreign languages as fast as you can talk... molecular computers the size of a tear drop with the power of today’s fastest supercomputers.” Imagine a world where: You book a travel reservation from your cellular phone while driving in your car without ever talking to a human (database query) You converse with someone in a foreign country and neither speaker speaks a common language (universal translator) You place a call to your bank to inquire about your bank account and never have to remember a password (transparent telephony) You can ask questions by voice and your Internet browser returns answers to your questions (intelligent query) Human Language Engineering: a sophisticated integration of many speech and language related technologies... a science for the next millennium.

18 A Generic Solution

19 A Pattern Recognition Formulation

20 Solution: Signal Modeling

21 Erwin M. Bakker Leiden University
Speech Recognition Erwin M. Bakker Leiden University

22 THE SPEECH RECOGNITION PROBLEM
Boundaries between words or phonemes Large variations in speaking rates in fluent speech words and word-endings are less pronounced Great deal of inter- as well as intra-speaker variability Quality of speech signal Task-inherent syntactic-semantic constraints should be exploited - spoken word-, phoneme-, and word- boundaries are unknown - words and word-endings are less pronounced in fluent speech - variability: sex, physiological, psychological - environmental noise, microphone, telephone - similar to human-to-human interaction

23 SEARCH ALGORITHMS

24 STATISTICAL METHODS IN SPEECH RECOGNITION
The Bayesian Approach Acoustic Models Language Models

25 a statistical speech recognition system
- Schematic overview of a statistical speech recognition system

26 Acoustic Models (HMM) Some typical HMM topologies used for acoustic modeling in large vocabulary speech recognition: (a) typical triphone, (b) short pause (c)silence. The shaded states denote the start and stop states for each model. - Some typical HMM topologies used for acoustic modeling in large vocabulary speech recognition: (a) typical triphone (b) short pause (c) silence. The shaded states denote the start and stop states for each model.

27 Language Models N-grams indirectly encode syntax, semantics and pragmatics by concentrating on the local dependencies between words. Also, N-gram probabilities can be directly computed from text data and therefore do not require explicit linguistic rules like a formal language grammar. N-grams are a good example of how deeply rooted statistical methods are in speech recognition. Most systems use a trigram back-off language model [44], though there are some systems that have ventured into higher-order N-grams [26], long-range dependencies [33], cache [32], link [34] and trigger models [35], class grammars [28], and decision-tree clustered language models [4].

28 SEARCH ALGORITHMS The Complexity of Search. Typical Search Algorithms
Viterbi Search Stack Decoders Multi-Pass Search Forward-Backward Search

29 Hierarchical representation of the search space.

30 An outline of the Viterbi search algorithm

31 Simple overview of the stack decoding algorithm.

32 Multi-Pass Search An example of the N-best list of hypotheses generated for a simple utterance, and the resulting word graph with N equal to 20. Note that most of the paths are almost equally probable, and only minor variants of each other in terms of segmentation. This indicates the severity of the acoustic confusability in spontaneous, conversational speech recognition.

33 Complexity of Search lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word) acoustic models: HMMs that represent the basic sound units the system is capable of recognizing language model: determines the possible word sequences allowed by the system (encodes knowledge of the syntax and semantics of the language)

34 A schematic diagram of the control flow of the decoder in the ISIP automatic speech rec-ognition
system, for a single utterance N frames long. The shaded region represents the core search. The preprocessing (data loading etc.) and postprocessing (best path backtrace, word graph generation) are also described.

35 An illustration of the definition of a path instance for two paths in the lexical tree of Figure 15.
Also shown are the actual C++ class definitions for the path marker or Trace class and the Instance class.

36 References Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone
Hierarchical Search for Large Vocabulary Conversational Speech Recognition IEEE Signal Processing Magazine, September 1999. H. Ney and S. Ortmanns Dynamic Programming Search for Continuous Speech Recognition V. Zue Talking with Your Computer Scientific American, August 1999

37 relative complexity of the search problem for large vocabulary conversational speech recognition
An overview of the relative complexity of the search problem for large vocabulary conversational speech recognition that shows the impact of various types of acoustic and language models.

38 A TIME-SYNCHRONOUS VITERBI-BASED DECODER
Complexity of Search Network Decoding N-Gram Decoding Cross-Word Acoustic Models Search Space Organization Lexical Trees Language Model Lookahead Acoustic Evaluation

39 Network decoding using word-internal context-dependent models.
The word network providing linguistic constraints The pronunciation lexicon for the words involved The network expanded using the corresponding word-internal triphones derived from the pronunciations of the words. An example of network decoding using word-internal context-dependent models. (a) The word network providing linguistic constraints (b) The pronunciation lexicon for the words involved (c) The network expanded using the corresponding word-internal triphones derived from the pronunciations of the words. Note that every pronunciation of a word needs to be treated as a different word (e.g. “A”), and every instance of a word needs to be hypothesized separately. The two circled triphones represent different instances as they belong to two different words.

40 An example of network decoding using word-internal context-dependent models.
(a) The word network providing linguistic constraints (b) The pronunciation lexicon for the words involved (c) The network expanded using the corresponding word-internal triphones derived from the pronunciations of the words. Note that every pronunciation of a word needs to be treated as a different word (e.g. “A”), and every instance of a word needs to be hypothesized separately. The two circled triphones represent different instances as they belong to two different words.

41 Search Space Organization
A small part of the expanded network from Figure 11 using cross-word triphones. Note the ex-plosion in the number of paths at the end and start of each word.

42 Lexical Tree An example lexical tree used in the decoder. The triphones are generated dynamically, on the fly for each of the lexical tree nodes. Each lexical node contains a list of the words (or lattice nodes) on that path covered by the monophone held in the lexical node. The dark circles represent starts and ends of words, the word identity is unknown till a word-end lexical node is reached.

43 Generation of triphones
Generation of triphones from the lexical tree consisting of monophone lexical nodes. Note the increase in the number of triphones at word boundaries due to cross-word context.

44 A TIME-SYNCHRONOUS VITERBI-BASED DECODER
Search Space Reduction Pruning setting pruning beams based on the hypothesis score limiting the total number of model instances active at a given time setting an upper bound on the number of words allowed to end at a given frame Path Merging Word Graph Compaction

45 An illustration of the word graph compaction in the decoder
An illustration of the word graph compaction in the decoder. The reduced word graph yields the same unique word sequences as the original, but its size is significantly smaller. On large word graphs, the compaction results in a 2 to 5 times drop in the word graph size.

46 A TIME-SYNCHRONOUS VITERBI-BASED DECODER
System Architecture PERFORMANCE ANALYSIS a substitution error refers to the case where the decoder miss-recognizes a word in the reference sequence as another in the hypothesis. a deletion error occurs when the there is no word recognized corresponding to a word in the reference transcription. an insertion error corresponds to the case where the hypothesis contains an extra word that has no counterpart in the reference.

47 A TIME-SYNCHRONOUS VITERBI-BASED DECODER
scalability: Can the algorithm scale gracefully from small constrained tasks to large unconstrained tasks? recognition accuracy: How accurate is the best word sequence found by the system? word graph accuracy: Can the system generate alternate choices that contain the correct word sequence? How large must this list of choices be? memory: What memory is required to achieve optimal performance? How does performance vary with the amount of memory required? run-time: How many seconds of CPU time per second of speech are required (xRT) to achieve optimal performance? How does run-time vary with performance (run-time should decrease significantly as error rates increase)?

48 A TIME-SYNCHRONOUS VITERBI-BASED DECODER
Alphadigits Switchboard Beam Pruning MAPMI Pruning

49 The language model for the Alphadigits corpus is a fully connected grammar. The empty word
cells do not correspond to a word, but are used to denote the loop-back for the grammar.

50 Comparisons performed on a 333 MHz Pentium II processor with 512MB RAM.
Summary of the decoder performance on the LDC-SWB task for word graph generation using a bigram language model. Note that WER is slightly higher because word graphs are generated with tighter pruning thresholds than decoding. Also, real-time rates double when cross-word models are used.

51 Forward-backward search
Forward-backward search — the combined score is the normalized product of the forward and backward path scores.

52 Introduction Speech in the Information Age
Speech & text were revolutionary because of information access New media and connectivity yield information overload Can speech technology help? Time Source of Information Speech Text Film, video, multimedia, voice mail, radio, television, conferences, web, on-line resources Text is different from spoken language Sometimes no keyboard (devices are shrinking, pc watch) Harder to find what you want as information explodes Can technology enable information to become a proactive partner in collaboration with humans? Access to Information Listen, remember Read books Computer typing Careful spoken, written input Conversational language

53 Conclusion and Future Directions Trends
Speech as Access Speech as Source Information as Partner What are the words? What does it mean? Here’s what you need. We need new technology to help with information overload Speech information sources are everywhere Voice mail messages Professional talk Lectures, broadcasts Speech sources of information will increase As devices shrink As mobility increases New uses: annotation, documentation Do we need words? We need meaning! We need information access (as we said at the beginning) words are a stepping stone to that. Meaning is fuzzy, but can we gradually approximate enough of it to be useful? From Fred Juang and Sadaoki Furui Worldwide investment in speech recognition and synthesis is estimated at $400M annually people in the field. Trade magazines cautioned users to lower expectations of PC voice-recognition software - “Treat it like your dog”? Application programmers complain about the “bugs” - same voice commands, different results at different time. (not very graceful, intuitive degradation; using human adaptation technigues (louder, hyperarticulate) make systems perform worse Many people turn off PC speech recognition or synthesis features after < 1 week of use - not sticky enough?

54 Conclusion and Future Directions Applications on the Horizon
Beginnings of speech as source of information ISLIP Virage Why doesn’t belong in the classroom Beulah Arnott: also true of indoor plumbing Speech technology in education and training Cliff Stoll, High Tech Heretic Good schools need no computers Bad schools won’t be improved by them BravoBrava: Co-evolving technology and people can Dramatically reduce the cost of delivery of content Increase its timeliness, quality and appropriateness Target needs of individual and/or group Reading Pal demo Speech and other technology may also help resolve a growing problem for the TV industry--finding specific images hidden in vast archives of programming. New digital tools from Virage of San Mateo, Calif., and Islip Media of Orlando can create searchable guides to video recordings by detecting and classifying changes in scenes, extracting closed-caption text, and recognizing speech. Add reading prototype - teacher to create characteristics (e.g., reading level, vocabulary) - child to select within those criteria - this demo aimed at reading practice (basically knows letter to sound rules but needs to be more automatic) - help (for a cost as in video games) dictionary, hints, pronouncing - assessment . Of texts for reading level . Of student for vocabulary, pronunciation (some immediate, some in diagnostics), fluency . Eventually, recommendations for further work

55 OVERLAP IN THE CEPSTRAL SPACE (ALPHADIGITS)
The following plots demonstrate overlap of recognition features in the cepstral space. These plots consist of the vowels "aa" (as in "lock") and "iy" (as in "beat") excised from tokens in the OGI Alphadigit speech corpus In these plots, the first two cepstral coefficients are shown (c[1] and c[2]; energy, which is c[0], is not shown). Comparisons are provided as a function of the vowel spoken and the gender of the speaker: Vowel Comparison: a comparison of male "aa" to male "iy" Vowel Comparison: a comparison of female "aa" to female "iy" Vowel Comparison: a combined plot of the above conditions Gender Comparisons: a comparison of males and females for the vowels "aa" and "iy" Combined Comparisons: a comparison of "aa" to "iy" for both genders The Alphadigits vowel data used to generate these plots is available for classification experiments.

56 OVERLAP IN THE CEPSTRAL SPACE (SWB)
The following plots demonstrate overlap of recognition features in the cepstral space. These plots consist of the vowels "aa" (as in "lock") and "iy" (as in "beat") excised from tokens in the SWITCHBOARD conversational speech corpus In these plots, the first two cepstral coefficients are shown (c[1] and c[2]; energy, which is c[0], is not shown). Comparisons are provided as a function of the vowel spoken and the gender of the speaker: Vowel Comparison: a comparison of male "aa" to male "iy" Vowel Comparison: a comparison of female "aa" to female "iy" Vowel Comparison: a combined plot of the above conditions Gender Comparisons: a comparison of males and females for the vowels "aa" and "iy" Combined Comparisons: a comparison of "aa" to "iy" for both genders The Switchboard vowel data used to generate these plots is available for classification experiments.

57 Implementation Issues Decoding Example

58 Implementation Issues Internet-Based Speech Recognition


Download ppt "Seminar Speech Recognition a Short Overview E.M. Bakker"

Similar presentations


Ads by Google