Deep Learning for Speech Recognition

Name: Deep Learning for Speech Recognition
Uploaded: 2017-07-02T21:43:02+00:00
Duration: PTM30S30
Channel: Conrad McKenzie
Description: Deep Learning for Speech Recognition

Deep Learning for Speech Recognition
Hung-yi Lee Conventional How good is DNN The reason

Outline Conventional Speech Recognition
How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing

Conventional Speech Recognition
How good is DNN The reason

Machine Learning helps
Speech Recognition 天氣真好 𝑊 X This is a structured learning problem. Evaluation function: F 𝑋,𝑊 =𝑃 𝑊|𝑋 Inference: 𝑊 =𝑎𝑟𝑔 max 𝑊 𝐹 𝑋,𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑊|𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋|𝑊 : Acoustic Model, 𝑃 𝑊 : Language Model

Input Representation Audio is represented by a vector sequence X: X:
…… Mel-scale Frequency Cepstral Coefficients x1 x2 x3 …… 39 dim MFCC

Input Representation - Splice
To consider some temporal information …… …… MFCC 接合；疊接 …… …… Splice

Phoneme Phoneme: basic unit
Each word corresponds to a sequence of phonemes. Lexicon what do you think Different words can correspond to the same phonemes Lexicon hh w aa t d uw y uw th ih ng k

State Each phoneme correspond to a sequence of states
what do you think Phone: hh w aa t d uw y uw th ih ng k Tri-phone: …… t-d+uw d-uw+y uw-y+uw y-uw+th …… t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3 State:

Gaussian Mixture Model (GMM)
State Each state has a stationary distribution for acoustic features Gaussian Mixture Model (GMM) t-d+uw1 P(x|”t-d+uw1”) d-uw+y3 P(x|”d-uw+y3”)

State Each state has a stationary distribution for acoustic features
Tied-state pointer pointer P(x|”t-d+uw1”) P(x|”d-uw+y3”) Same Address

How to use Deep Learning?
Conventional How good is DNN The reason

People imagine …… …… Acoustic features DNN This can not be true!
DNN can only take fixed length vectors as input. “Hello”

What DNN can do is …… DNN input: One acoustic feature DNN output:
P(a|xi) P(b|xi) P(c|xi) …… DNN input: One acoustic feature …… DNN output: Size of output layer = No. of states DNN Probability of each state senones (tied triphone states) RNN may be preferred, but it still can not address the problem Dynamics should be considered Three ways to doing that …… …… xi

Low rank approximation
Output layer W: M X N M N is the size of the last hidden layer W N M is the size of output layer Number of states M can be large if the outputs are the states of tri-phone. Input layer

Low rank approximation
U V K M ≈ M K < M,N Less parameters Output layer Output layer M M U W linear K V N N

How we use deep learning
There are three ways to use DNN for acoustic modeling Way 1. Tandem Way 2. DNN-HMM hybrid Way 3. End-to-end Efforts for exploiting deep learning

Way 1: Tandem Conventional How good is DNN The reason

Last hidden layer or bottleneck layer are also possible.
Way 1: Tandem system P(a|xi) P(b|xi) P(c|xi) …… …… new feature 𝑥 𝑖 ′ Size of output layer = No. of states Input of your original speech recognition system DNN 一前一後地 …… …… xi Last hidden layer or bottleneck layer are also possible.

Way 2: DNN-HMM hybrid Conventional How good is DNN The reason

Way 2: DNN-HMM Hybrid Sequential Training 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊;𝜃 𝑃 𝑊
Given training data 𝑋 1 , 𝑊 1 , 𝑋 2 , 𝑊 2 ,⋯ 𝑋 𝑟 , 𝑊 𝑟 ,⋯ Find-tune the DNN parameters 𝜃 such that 𝑃 𝑋 𝑟 | 𝑊 𝑟 ;𝜃 𝑃 𝑊 𝑟 increase 𝑃 𝑋 𝑟 |𝑊;𝜃 𝑃 𝑊 decrease (𝑊 is any word sequence different from 𝑊 𝑟 )

Way 3: End-to-end Conventional How good is DNN The reason

Way 3: End-to-end - Character
Input: acoustic features (spectrograms) Output: characters (and space) + null (~) No phoneme and lexicon CTC!!!!!!!!!!!!!!!! Baidu Research – Silicon Valley AI Lab We do not need a phoneme dictionary, nor even the concept of a “phoneme.” CTC: Connectionist Temporal Classification Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: ´ Labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369– 376. ACM, [14] A. Graves and N. J The combination of bidirectional LSTM and CTC has been applied to characterlevel speech recognition before (Eyben et al., 2009), however the relatively shallow architecture used in that work did not deliver compelling results (the best character error rate was almost 20%). The basic system is enhanced b (No OOV problem) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv: v2, 2014.

Way 3: End-to-end - Character
HIS FRIEND’S ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14)

Way 3: End-to-end – Word? DNN … … apple bog cat …… Output layer
~50k words in the lexicon DNN Size = 200 times of acoustic features Input layer padding with zero! How about our homework? … … Use other systems to get the word boundaries Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition.“, Interspeech

Why Deep Learning? Manily based on DNN-HMM nybid

Deeper is better? Word error rate (WER) multiple layers 1 hidden layer
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB

Deeper is better? Word error rate (WER) multiple layers 1 hidden layer
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB For a fixed number of parameters, a deep model is clearly better than the shallow one.

Input Acoustic Feature (MFCC)
What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 1-st Hidden Layer

Input Acoustic Feature (MFCC)
What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 8-th Hidden Layer

What does DNN do? In ordinary acoustic models, all the states are modeled independently Not effective way to model human voice high low front back The sound of vowel is only controlled by a few factors.

What does DNN do? /i/ /u/ /e/ /o/ /a/ front back high
low front back Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech Output of hidden layer reduce to two dimensions /a/ /i/ /e/ /o/ /u/ The lower layers detect the manner of articulation This is BN acturally a rounded vowel, the lips form a circular opening All the states share the results from the same set of detectors. Use parameters effectively

Speaker Adaptation

Speaker Adaptation Speaker adaptation: use different models to recognition the speech of different speakers Collect the audio data of each speaker A DNN model for each speaker Challenge: limited data for training Not enough data for directly training a DNN model Not enough data for just fine-tune a speaker independent DNN model models. In fact, it has been shown that much less improvement was observed with fDLR when the network goes from shallow to deep [30]. It Supervised: with labels Unsupervised: without labels

Categories of Methods Conservative training Transformation methods
Re-train the whole DNN with some constraints Transformation methods Only train the parameter of one layer Speaker-aware Training Do not really change the DNN parameters Need less training data

Conservative Training
Output layer output close Output layer parameter close initialization Input layer Input layer Just train cannot work Extra constraint To evaluate how the size of the adaptation set affects the result, the number of adaptation utterances varies from 5 (32 s) to 200 (22min). Audio data of Many speakers A little data from target speaker

Transformation methods
Add an extra layer Output layer Output layer Fix all the other parameters layer i+1 W layer i+1 extra layer Where shold I add -> I don’t know W Wa layer i layer i A little data from target speaker Input layer Input layer

Add the extra layer between the input and first layer With splicing …… …… layer 1 layer 1 extra layers extra layer Wa Wa Wa Wa Larger Wa More data Smaller Wa less data

SVD bottleneck adaptation Output layer M U Wa linear K K is usually small V N

Speaker-aware Training
Speaker information Can also be noise-aware, devise aware training, …… Data of Speaker 1 Fixed length low dimension vectors Text transcription is not needed for extracting the vectors. Lots of mismatched data Data of Speaker 2 I-vector is original used in speaker identification Data of Speaker 3

Speaker-aware Training
Training data: train Speaker 1 Output layer Speaker 2 Acoustic features are appended with speaker information features I-vector is original used in speaker identification Testing data: test Acoustic feature Speaker Information All the speaker use the same DNN model Different speaker augmented by different features

Multi-task Learning Outline

Multitask Learning The multi-layer structure makes DNN suitable for multitask learning Task A Task B Task A Task B Input feature Input feature for task A Input feature for task B

Multitask Learning - Multilingual
states of French states of German states of Spanish states of Italian states of Mandarin Should I put Tanja’s AF here? Human languages share some common characteristics. acoustic features

Multitask Learning - Multilingual
Mandarin only Character Error Rate With European Language Hours of training data for Mandarin Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013

Multitask Learning - Different units
B A= state B = phoneme A= state B = gender A= state B = grapheme Dongpeng Chen, Mak, B., Cheung-Chi Leung, Sivadas, S., "Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition,“ ICASSP 2014 (character) acoustic features

Deep Learning for Acoustic Modeling
New acoustic features Outline

MFCC … …… spectrogram DFT Waveform DCT log Input of DNN filter bank
So strong, usually cascade DCT log Input of DNN … filter bank MFCC

Filter-bank Output … …… spectrogram DFT Waveform log Input of DNN
Kind of standard now

Spectrogram spectrogram DFT Waveform Input of DNN common today
5% relative improvement over fitlerbank output Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013

Spectrogram

Waveform? If success, no Signal & Systems  Input of DNN Waveform
People tried, but not better than spectrogram yet Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014 Still need to take Signal & Systems …… 

Waveform?

Convolutional Neural Network (CNN)
Outline

CNN Speech can be treated as images Spectrogram Frequency Time
Spectrogram reading? Spectrogram painting? Taken an image spectral waterfalls, voiceprints, or voicegram Time

Probabilities of states
CNN CNN Probabilities of states CNN Replace DNN by CNN Why it works? Image

CNN Maxout Max a1 b1 Max Max a2 b2 Max pooling

CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition”, Interspeech, 2014. This is the state-of-the-art results Network in network MTA-SZTE Research Group on Artificial Intelligence Hungarian Academy of Sciences and University of Szeged, Hungary

Applications in Acoustic Signal Processing
Deep Learning for Regression

DNN for Speech Enhancement
Clean Speech DNN for mobile commination or speech recognition Noisy Speech Demo for speech enhancement:

DNN for Voice Conversion
Female DNN Male Demo for Voice Conversion:

Concluding Remarks And probably future direction???

Concluding Remarks Conventional Speech Recognition
How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing

Thank you for your attention!
Conventional How good is DNN The reason

More Researches related to Speech
lecture recordings Find the lectures related to “deep learning” Summary Spoken Content Retrieval Speech Summarization Speech Recognition 你好 core techniques Computer Assisted Language Learning I would like to leave Taipei on November 2nd Hi Hello Information Extraction Dialogue Information extraction

Deep Learning for Speech Recognition

Similar presentations

Presentation on theme: "Deep Learning for Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning for Speech Recognition

Similar presentations

Presentation on theme: "Deep Learning for Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback