Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech recognition and the EM algorithm

Similar presentations


Presentation on theme: "Speech recognition and the EM algorithm"— Presentation transcript:

1 Speech recognition and the EM algorithm
Karthik Visweswariah IBM Research India

2 Speech recognition: The problem
2 2 Speech recognition: The problem Input: Audio data with a speaker saying a sentence in English Output: string of words corresponding to the words spoken Data resources Large corpus (thousands of hours) of audio recordings with associated text

3 Agenda (for next two lectures)
Overview of statistical approach to speech recognition Discuss sub-components indicating specific problems to be solved Deeper dive into couple of areas with general applicability EM algorithm Maximum likelihood estimation Gaussian mixture models EM algorithm itself Application to machine translation Decision trees (?)

4 4 4 Evolution: 1960-present Isolated digits: Filter bank analysis, time normalisation, dynamic programming Isolated words, continuous digits: Pattern recognition, LPC analysis, clustering algorithms Connected words: Statistical approaches, Hidden Markov Models Continuous speech, Large Vocabulary Speaker independence Speaker adaptation Discriminative training Deep learning

5 5 5 Early attempts Trajectory of formant frequencies (resonant frequencies of vocal tract) Automatic speech recognition: Brief history of technology development, B. J. Huang et. al. 2006

6 6 6 Simpler problem Given weight of the person determine gender of the person Clearly cannot be done deterministically Model probabilistically Joint distribution: P(gender, weight) Bayes: P(gender | weight) = P(gender) P(weight | gender)/P(weight) P(gender): Just count up gender of persons in database P(weight|gender) Non-parametric: Histogram of weights for each gender Parametric: Assume Normal (Gaussian) distribution, estimate mean/variance separately for each gender Choose gender with higher posterior probability given weight

7 How a speech recognizer works
Acoustic model P(x|w) Feature vectors: x Signal processing Search: argmax P(w)P(x|w) Words Audio Language model: P(w)

8 How speech recognizer works
Acoustic model P(x|w) Feature vectors: x Signal processing Search: argmax P(w)P(x|w) Words Audio Language model: P(w)

9 Feature extraction: How do we represent the data
9 9 Usually the most important step in data science or data analytics Also a function of amount of data Converts data into a vector of real numbers Represent documents for classification into spam/non-spam Counts of various characters (?) Counts of various words (?) All words equally important? Special characters used, colors used (?) Predict attrition of employee: Performance, salary, … Should we capture salary change as percentage change rather than absolute numbers Should we look at performance of the manager Salaries of team members? Interacts with algorithm that is to be used downstream Is algorithm invariant to scale? Can the algorithm handle correlations in features What other assumptions? Domain/background knowledge comes in here

10 Signal processing Input: Raw sampled audio, 11 KHz or 22 KHz on desktop, 8KHz for telephony Output: 40 dimensional features, vectors per second Ideally: Different sounds represented differently Unnecessary variations removed Noise Speaker Channel Match modeling assumptions

11 Signal processing (contd.)
Windowed FFT: sounds easier to distinguish in frequency space Mel binning: Measure sensitivity to frequencies by listening experiments Sensitivity to a fixed difference in tone decreases with tone frequency Log scale: Humans perceive volume on roughly a log scale Decorrelate data (use DCT) <- Called MFCC upto this Subtract mean: scale invariance, channel invariance

12 Signal processing (contd.)
Model dynamics: Concatenate previous and next few feature vectors Project down to throw away noise/reduce computation (Linear/Fisher Discriminant Analysis) Linear transform learned to match diagonal Gaussian modeling assumption

13 How speech recognizer works
Acoustic model P(x|w) Feature vectors: x Signal processing Search: argmax P(w)P(x|w) Words Audio Language model: P(w)

14 Language modeling

15 Language modeling

16 How speech recognizer works
Acoustic model P(x|w) Feature vectors: x Signal processing Search: argmax P(w)P(x|w) Words Audio Language model: P(w)

17 Acoustic modeling Need to model acoustic sequences given words P(x|w)
Obviously cannot create a model for every word Need to break words into the fundamental sounds Cat K AE T - represent the pronunciation using phonemes At IBM we used phonemes for English Dictionaries Hand created lists of words with their alternate pronunciations Handling new words Automatic generation of pronunciations from spellings Clearly a tricky task for English e.g foreign names

18 Acoustic modeling (contd)

19 Acoustic modeling (contd.)

20 Acoustic modeling (contd.)
Pronunciations change in continuous speech depending on neighboring words “Give me” might sound more like “gimme” Emission probabilities should depend on context Use a different distribution for each different context? Even with 40 phonemes looking two phones to either side gives us 2.5 million possibilites => Way too many Learn which contexts the acoustics is different Tie together contexts using a decision tree At each node allowed to ask questions about two (typically) phones to the left and right Eg. Is the first phoneme to the right a glottal stop Use entropy gain to grow a tree End up with 2000 to context dependent states from 120 context independent states

21 Acoustic modeling

22 How a speech recognizer works
Acoustic model P(x|w) Feature vectors: x Signal processing Search: argmax P(w)P(x|w) Words Audio Language model: P(w)

23 Search

24 Search Current approach is to precompile Language model, dictionary, phone HMMs and decision tree into a complete graph Use Weighted Finite State Machine technology heavily Complications Space of words is large (five gram language model) Context dependent acoustic models look across word boundaries Need to prune to keep perform search at reasonable speeds Throw away states that are far enough below the best state

25 Speaker/condition dependent systems
Humans can certainly do better with a little data: “adapt” to an unfamilar accent or noise With minutes of data we can certainly do better Could change our Acoustic models (Gaussian Mixture models) based on the new data Can change the signal processing Techniques described work even without supervision Do a speaker independent decode, and pretend that the obtained word sequence is the truth

26 Adaptation Acoustic model P(x|w) Feature vectors: x Search:
Signal processing Search: argmax P(w)P(x|w) Words Audio Language model: P(w)

27 Vocal tract length normalization
Different speakers have different vocal tract lengths Frequency stretched or squished At test time estimate this frequency stretching/squishing and undo Just a single parameter, quantized to 10 different values Try each value and pick one that gives best likelihood To get full benefit need to retrain in this canonical feature space Gaussian Mixture Models and decision trees benefit from being trained in this “cleaned up” feature space

28 Adaptation of models

29 Adaptation of features

30 Improvements obtained
Conversational telephony data, test set from a call center Training data 2000 hours of Fisher data (0.7 billion frames of acoustic data) Language model built with hundreds of millions of words from various sources Including data from the domain of interest (call center conversations IT help desk) Roughly 30 million parameters in the acoustic model System performance measured by word error rates Speaker independent system: % Vocal Tract Length Normalized system: 29.0% Linear transform adaptation: % Discriminative feature space: % Discriminative training of model: % Its hard work to improve on the best systems, no silver bullet!

31 Current state of the art
Progress tracked on Switchboard (conversational telephony test set) System Word error rate 1995 “high performance HMM recognizer” 45% Cambridge Univ. (2000) 19.3% 2004 IBM system 15.2% 2015 IBM system (with Deep Learning) 8% Estimate of human performance 4% Replace GMMs for acoustic modeling with deep networks Source:

32 Conclusions Gave a brief overview of various components in practical state of the art speech recognition systems Speech recognition technology has relied on generative statistical models with parameters learned from data Moved away from hand coded knowledge Discriminative estimation techniques are more expensive but give significant improvements Deep learning has shown significant gains for speech recognition Speech recognition systems are good enough to support several useful applications But they are still sensitive to variations that humans can handle with ease


Download ppt "Speech recognition and the EM algorithm"

Similar presentations


Ads by Google