Deep Learning for Speech Recognition Hung-yi Lee Conventional How good is DNN The reason
Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing
Conventional Speech Recognition How good is DNN The reason
Machine Learning helps Speech Recognition 天氣真好 𝑊 X This is a structured learning problem. Evaluation function: F 𝑋,𝑊 =𝑃 𝑊|𝑋 Inference: 𝑊 =𝑎𝑟𝑔 max 𝑊 𝐹 𝑋,𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑊|𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋|𝑊 : Acoustic Model, 𝑃 𝑊 : Language Model
Input Representation Audio is represented by a vector sequence X: X: …… Mel-scale Frequency Cepstral Coefficients x1 x2 x3 …… 39 dim MFCC
Input Representation - Splice To consider some temporal information …… …… MFCC 接合;疊接 …… …… Splice
Phoneme Phoneme: basic unit Each word corresponds to a sequence of phonemes. Lexicon what do you think Different words can correspond to the same phonemes Lexicon hh w aa t d uw y uw th ih ng k
State Each phoneme correspond to a sequence of states what do you think Phone: hh w aa t d uw y uw th ih ng k Tri-phone: …… t-d+uw d-uw+y uw-y+uw y-uw+th …… t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3 State:
Gaussian Mixture Model (GMM) State Each state has a stationary distribution for acoustic features Gaussian Mixture Model (GMM) t-d+uw1 P(x|”t-d+uw1”) d-uw+y3 P(x|”d-uw+y3”)
State Each state has a stationary distribution for acoustic features Tied-state pointer pointer P(x|”t-d+uw1”) P(x|”d-uw+y3”) Same Address
Acoustic Model 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 P(X|W) = P(X|S) W: what do you think? W: a b c d e …… S: Assume we also know the alignment 𝑠 1 ⋯ 𝑠 𝑇 . …… s1 s2 s3 s4 s5 X: …… x1 x2 x3 x4 x5 𝑃 𝑋|𝑆,ℎ = 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 transition emission
Acoustic Model 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 P(X|W) = P(X|S) W: what do you think? W: a b c d e …… S: Actually, we don’t know the alignment. s1 s2 s3 s4 s5 X: …… x1 x2 x3 x4 x5 𝑃 𝑋|𝑆 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 (Viterbi algorithm)
How to use Deep Learning? Conventional How good is DNN The reason
People imagine …… …… Acoustic features DNN This can not be true! DNN can only take fixed length vectors as input. “Hello”
What DNN can do is …… DNN input: One acoustic feature DNN output: P(a|xi) P(b|xi) P(c|xi) …… DNN input: One acoustic feature …… DNN output: Size of output layer = No. of states DNN Probability of each state senones (tied triphone states) RNN may be preferred, but it still can not address the problem Dynamics should be considered Three ways to doing that …… …… xi
Low rank approximation Output layer W: M X N M N is the size of the last hidden layer W N M is the size of output layer Number of states M can be large if the outputs are the states of tri-phone. Input layer
Low rank approximation U V K M ≈ M K < M,N Less parameters Output layer Output layer M M U W linear K V N N
How we use deep learning There are three ways to use DNN for acoustic modeling Way 1. Tandem Way 2. DNN-HMM hybrid Way 3. End-to-end Efforts for exploiting deep learning
How to use Deep Learning? Way 1: Tandem Conventional How good is DNN The reason
Last hidden layer or bottleneck layer are also possible. Way 1: Tandem system P(a|xi) P(b|xi) P(c|xi) …… …… new feature 𝑥 𝑖 ′ Size of output layer = No. of states Input of your original speech recognition system DNN 一前一後地 …… …… xi Last hidden layer or bottleneck layer are also possible.
How to use Deep Learning? Way 2: DNN-HMM hybrid Conventional How good is DNN The reason
Way 2: DNN-HMM Hybrid 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑊|𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 From DNN …… 𝑃 𝑥 𝑡 | 𝑠 𝑡 = 𝑃 𝑥 𝑡 , 𝑠 𝑡 𝑃 𝑠 𝑡 𝑃 𝑠 𝑡 | 𝑥 𝑡 DNN = 𝑃 𝑠 𝑡 | 𝑥 𝑡 𝑃 𝑥 𝑡 𝑃 𝑠 𝑡 Count from training data 𝑥 𝑡
Way 2: DNN-HMM Hybrid 𝑃 𝑋|𝑊 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 DNN 𝑃 𝑠 𝑡 | 𝑥 𝑡 …… 𝑥 𝑡 𝑃 𝑋|𝑊 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑠 𝑡 | 𝑥 𝑡 𝑃 𝑠 𝑡 𝑃 𝑋|𝑊;𝜃 𝜃 From original HMM Count from training data This assembled vehicle works …….
Way 2: DNN-HMM Hybrid Sequential Training 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊;𝜃 𝑃 𝑊 Given training data 𝑋 1 , 𝑊 1 , 𝑋 2 , 𝑊 2 ,⋯ 𝑋 𝑟 , 𝑊 𝑟 ,⋯ Find-tune the DNN parameters 𝜃 such that 𝑃 𝑋 𝑟 | 𝑊 𝑟 ;𝜃 𝑃 𝑊 𝑟 increase 𝑃 𝑋 𝑟 |𝑊;𝜃 𝑃 𝑊 decrease (𝑊 is any word sequence different from 𝑊 𝑟 )
How to use Deep Learning? Way 3: End-to-end Conventional How good is DNN The reason
Way 3: End-to-end - Character Input: acoustic features (spectrograms) Output: characters (and space) + null (~) No phoneme and lexicon CTC!!!!!!!!!!!!!!!! Baidu Research – Silicon Valley AI Lab We do not need a phoneme dictionary, nor even the concept of a “phoneme.” CTC: Connectionist Temporal Classification Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: ´ Labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369– 376. ACM, 2006. [14] A. Graves and N. J http://www.jmlr.org/proceedings/papers/v32/graves14.pdf The combination of bidirectional LSTM and CTC has been applied to characterlevel speech recognition before (Eyben et al., 2009), however the relatively shallow architecture used in that work did not deliver compelling results (the best character error rate was almost 20%). The basic system is enhanced b (No OOV problem) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.
Way 3: End-to-end - Character HIS FRIEND’S ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.
Way 3: End-to-end – Word? DNN … … apple bog cat …… Output layer ~50k words in the lexicon DNN Size = 200 times of acoustic features Input layer padding with zero! http://bengio.abracadoudou.com/cv/publications/pdf/bengio_2014_interspeech.pdf How about our homework? … … Use other systems to get the word boundaries Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition.“, Interspeech. 2014.
Why Deep Learning? Manily based on DNN-HMM nybid
Deeper is better? Word error rate (WER) multiple layers 1 hidden layer Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB
Deeper is better? Word error rate (WER) multiple layers 1 hidden layer Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB For a fixed number of parameters, a deep model is clearly better than the shallow one.
Input Acoustic Feature (MFCC) What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 1-st Hidden Layer
Input Acoustic Feature (MFCC) What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 8-th Hidden Layer
What does DNN do? In ordinary acoustic models, all the states are modeled independently Not effective way to model human voice high low front back The sound of vowel is only controlled by a few factors. http://www.danielpovey.com/files/icassp10_multiling.pdf http://www.ipachart.com/
What does DNN do? /i/ /u/ /e/ /o/ /a/ front back high low front back Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014. Output of hidden layer reduce to two dimensions /a/ /i/ /e/ /o/ /u/ The lower layers detect the manner of articulation http://www.ipachart.com/ http://csl.anthropomatik.kit.edu/downloads/354_Paper.pdf 143-1500-42-1500-152 This is BN acturally a rounded vowel, the lips form a circular opening All the states share the results from the same set of detectors. Use parameters effectively
Speaker Adaptation
Speaker Adaptation Speaker adaptation: use different models to recognition the speech of different speakers Collect the audio data of each speaker A DNN model for each speaker Challenge: limited data for training Not enough data for directly training a DNN model Not enough data for just fine-tune a speaker independent DNN model models. In fact, it has been shown that much less improvement was observed with fDLR when the network goes from shallow to deep [30]. It Supervised: with labels Unsupervised: without labels
Categories of Methods Conservative training Transformation methods Re-train the whole DNN with some constraints Transformation methods Only train the parameter of one layer Speaker-aware Training Do not really change the DNN parameters Need less training data
Conservative Training Output layer output close Output layer parameter close initialization Input layer Input layer Just train cannot work Extra constraint To evaluate how the size of the adaptation set affects the result, the number of adaptation utterances varies from 5 (32 s) to 200 (22min). Audio data of Many speakers A little data from target speaker
Transformation methods Add an extra layer Output layer Output layer Fix all the other parameters layer i+1 W layer i+1 extra layer Where shold I add -> I don’t know W Wa layer i layer i A little data from target speaker Input layer Input layer
Transformation methods Add the extra layer between the input and first layer With splicing …… …… layer 1 layer 1 extra layers extra layer Wa Wa Wa Wa Larger Wa More data Smaller Wa less data
Transformation methods SVD bottleneck adaptation Output layer M U Wa linear K K is usually small V N
Speaker-aware Training Speaker information Can also be noise-aware, devise aware training, …… Data of Speaker 1 Fixed length low dimension vectors Text transcription is not needed for extracting the vectors. Lots of mismatched data Data of Speaker 2 I-vector is original used in speaker identification Data of Speaker 3
Speaker-aware Training Training data: train Speaker 1 Output layer Speaker 2 Acoustic features are appended with speaker information features I-vector is original used in speaker identification Testing data: test Acoustic feature Speaker Information All the speaker use the same DNN model Different speaker augmented by different features
Multi-task Learning Outline
Multitask Learning The multi-layer structure makes DNN suitable for multitask learning Task A Task B Task A Task B Input feature Input feature for task A Input feature for task B
Multitask Learning - Multilingual states of French states of German states of Spanish states of Italian states of Mandarin Should I put Tanja’s AF here? Human languages share some common characteristics. acoustic features
Multitask Learning - Multilingual Mandarin only Character Error Rate With European Language Hours of training data for Mandarin Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013
Multitask Learning - Different units B A= state B = phoneme A= state B = gender A= state B = grapheme Dongpeng Chen, Mak, B., Cheung-Chi Leung, Sivadas, S., "Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition,“ ICASSP 2014 (character) acoustic features
Deep Learning for Acoustic Modeling New acoustic features Outline
MFCC … …… spectrogram DFT Waveform DCT log Input of DNN filter bank So strong, usually cascade DCT log Input of DNN … filter bank MFCC
Filter-bank Output … …… spectrogram DFT Waveform log Input of DNN Kind of standard now
Spectrogram spectrogram DFT Waveform Input of DNN common today http://ttic.uchicago.edu/~haotang/speech/06707746.pdf 5% relative improvement over fitlerbank output Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013
Spectrogram
Waveform? If success, no Signal & Systems Input of DNN Waveform People tried, but not better than spectrogram yet http://ttic.uchicago.edu/~haotang/speech/06707746.pdf Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014 Still need to take Signal & Systems ……
Waveform?
Convolutional Neural Network (CNN) Outline
CNN Speech can be treated as images Spectrogram Frequency Time Spectrogram reading? Spectrogram painting? http://www.ultimaserial.com/UltimaSound.html Taken an image spectral waterfalls, voiceprints, or voicegram Time
Probabilities of states CNN CNN Probabilities of states CNN Replace DNN by CNN Why it works? Image
CNN Maxout Max a1 b1 Max Max a2 b2 Max pooling
CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition”, Interspeech, 2014. This is the state-of-the-art results https://www.inf.u-szeged.hu/~tothl/pubs/IS2013.pdf http://www.inf.u-szeged.hu/~tothl/pubs/IS2014.pdf Network in network MTA-SZTE Research Group on Artificial Intelligence Hungarian Academy of Sciences and University of Szeged, Hungary
Applications in Acoustic Signal Processing Deep Learning for Regression
DNN for Speech Enhancement Clean Speech DNN for mobile commination or speech recognition https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/Speak14To15/sequence_error_vc_is2014.pdf Noisy Speech Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html
DNN for Voice Conversion Female DNN https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/Speak14To15/sequence_error_vc_is2014.pdf Male Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx
Concluding Remarks And probably future direction???
Concluding Remarks Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing
Thank you for your attention! Conventional How good is DNN The reason
More Researches related to Speech lecture recordings Find the lectures related to “deep learning” Summary Spoken Content Retrieval Speech Summarization Speech Recognition 你好 core techniques http://www.skyscanner.net/news/7-secrets-learning-language-fast Computer Assisted Language Learning I would like to leave Taipei on November 2nd Hi Hello Information Extraction Dialogue Information extraction