Deep Learning for Speech Recognition

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
Building an ASR using HTK CS4706
Speech Recognition with Hidden Markov Models Winter 2011
Advanced topics.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
CSC321: Neural Networks Lecture 3: Perceptrons
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Optimal Adaptation for Statistical Classifiers Xiao Li.
Yajie Miao Florian Metze
Machine Learning and having it deep and structured
Why is ASR Hard? Natural speech is continuous
Speech Recognition Deep Learning and Neural Nets Spring 2015.
Stanford CS224S Spring 2014 CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2014 Lecture 16: Acoustic Modeling.
Deep Learning and its applications to Speech EE 225D - Audio Signal Processing in Humans and Machines Oriol Vinyals UC Berkeley.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Deep Learning Neural Network with Memory (1)
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Speech recognition and the EM algorithm
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Algoritmi e Programmazione Avanzata
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Performance Comparison of Speaker and Emotion Recognition
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Abstract Deep neural networks are becoming a fundamental component of high performance speech recognition systems. Performance of deep learning based systems.
Speech Enhancement based on
語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山 助教: 熊信寬
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.
Olivier Siohan David Rybach
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Why Deep Learning?.
Deep Learning Amin Sobhani.
Data Mining, Neural Network and Genetic Programming
Automatic Speech Recognition Introduction
Lecture 8 Why deep? We explain deep learning from two aspects
Restricted Boltzmann Machines for Classification
Conditional Random Fields for ASR
Intelligent Information System Lab
Deep Neural Networks based Text- Dependent Speaker Verification
Lecture 5 Smaller Network: CNN
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Convolutional Networks
Speech Processing Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
John H.L. Hansen & Taufiq Al Babba Hasan
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Advances in Deep Audio and Audio-Visual Processing
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Presenter: Shih-Hsiang(士翔)
Language Transfer of Audio Word2Vec:
Listen Attend and Spell – a brief introduction
Huawei CBG AI Challenges
Presentation transcript:

Deep Learning for Speech Recognition Hung-yi Lee Conventional How good is DNN The reason

Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing

Conventional Speech Recognition How good is DNN The reason

Machine Learning helps Speech Recognition 天氣真好 𝑊 X This is a structured learning problem. Evaluation function: F 𝑋,𝑊 =𝑃 𝑊|𝑋 Inference: 𝑊 =𝑎𝑟𝑔 max 𝑊 𝐹 𝑋,𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑊|𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 𝑃 𝑋|𝑊 : Acoustic Model, 𝑃 𝑊 : Language Model

Input Representation Audio is represented by a vector sequence X: X: …… Mel-scale Frequency Cepstral Coefficients x1 x2 x3 …… 39 dim MFCC

Input Representation - Splice To consider some temporal information …… …… MFCC  接合;疊接 …… …… Splice

Phoneme Phoneme: basic unit Each word corresponds to a sequence of phonemes. Lexicon what do you think Different words can correspond to the same phonemes Lexicon hh w aa t d uw y uw th ih ng k

State Each phoneme correspond to a sequence of states what do you think Phone: hh w aa t d uw y uw th ih ng k Tri-phone: …… t-d+uw d-uw+y uw-y+uw y-uw+th …… t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3 State:

Gaussian Mixture Model (GMM) State Each state has a stationary distribution for acoustic features Gaussian Mixture Model (GMM) t-d+uw1 P(x|”t-d+uw1”) d-uw+y3 P(x|”d-uw+y3”)

State Each state has a stationary distribution for acoustic features Tied-state pointer pointer P(x|”t-d+uw1”) P(x|”d-uw+y3”) Same Address

Acoustic Model 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 P(X|W) = P(X|S) W: what do you think? W: a b c d e …… S: Assume we also know the alignment 𝑠 1 ⋯ 𝑠 𝑇 . …… s1 s2 s3 s4 s5 X: …… x1 x2 x3 x4 x5 𝑃 𝑋|𝑆,ℎ = 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 transition emission

Acoustic Model 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 P(X|W) = P(X|S) W: what do you think? W: a b c d e …… S: Actually, we don’t know the alignment. s1 s2 s3 s4 s5 X: …… x1 x2 x3 x4 x5 𝑃 𝑋|𝑆 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 (Viterbi algorithm)

How to use Deep Learning? Conventional How good is DNN The reason

People imagine …… …… Acoustic features DNN This can not be true! DNN can only take fixed length vectors as input. “Hello”

What DNN can do is …… DNN input: One acoustic feature DNN output: P(a|xi) P(b|xi) P(c|xi) …… DNN input: One acoustic feature …… DNN output: Size of output layer = No. of states DNN Probability of each state senones (tied triphone states) RNN may be preferred, but it still can not address the problem Dynamics should be considered Three ways to doing that …… …… xi

Low rank approximation Output layer W: M X N M N is the size of the last hidden layer W N M is the size of output layer Number of states M can be large if the outputs are the states of tri-phone. Input layer

Low rank approximation U V K M ≈ M K < M,N Less parameters Output layer Output layer M M U W linear K V N N

How we use deep learning There are three ways to use DNN for acoustic modeling Way 1. Tandem Way 2. DNN-HMM hybrid Way 3. End-to-end Efforts for exploiting deep learning

How to use Deep Learning? Way 1: Tandem Conventional How good is DNN The reason

Last hidden layer or bottleneck layer are also possible. Way 1: Tandem system P(a|xi) P(b|xi) P(c|xi) …… …… new feature 𝑥 𝑖 ′ Size of output layer = No. of states Input of your original speech recognition system DNN 一前一後地 …… …… xi Last hidden layer or bottleneck layer are also possible.

How to use Deep Learning? Way 2: DNN-HMM hybrid Conventional How good is DNN The reason

Way 2: DNN-HMM Hybrid 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑊|𝑋 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊 𝑃 𝑊 From DNN …… 𝑃 𝑥 𝑡 | 𝑠 𝑡 = 𝑃 𝑥 𝑡 , 𝑠 𝑡 𝑃 𝑠 𝑡 𝑃 𝑠 𝑡 | 𝑥 𝑡 DNN = 𝑃 𝑠 𝑡 | 𝑥 𝑡 𝑃 𝑥 𝑡 𝑃 𝑠 𝑡 Count from training data 𝑥 𝑡

Way 2: DNN-HMM Hybrid 𝑃 𝑋|𝑊 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑥 𝑡 | 𝑠 𝑡 DNN 𝑃 𝑠 𝑡 | 𝑥 𝑡 …… 𝑥 𝑡 𝑃 𝑋|𝑊 ≈ max 𝑠 1 ⋯ 𝑠 𝑇 𝑡=1 𝑇 𝑃 𝑠 𝑡 | 𝑠 𝑡−1 𝑃 𝑠 𝑡 | 𝑥 𝑡 𝑃 𝑠 𝑡 𝑃 𝑋|𝑊;𝜃 𝜃 From original HMM Count from training data This assembled vehicle works …….

Way 2: DNN-HMM Hybrid Sequential Training 𝑊 =𝑎𝑟𝑔 max 𝑊 𝑃 𝑋|𝑊;𝜃 𝑃 𝑊 Given training data 𝑋 1 , 𝑊 1 , 𝑋 2 , 𝑊 2 ,⋯ 𝑋 𝑟 , 𝑊 𝑟 ,⋯ Find-tune the DNN parameters 𝜃 such that 𝑃 𝑋 𝑟 | 𝑊 𝑟 ;𝜃 𝑃 𝑊 𝑟 increase 𝑃 𝑋 𝑟 |𝑊;𝜃 𝑃 𝑊 decrease (𝑊 is any word sequence different from 𝑊 𝑟 )

How to use Deep Learning? Way 3: End-to-end Conventional How good is DNN The reason

Way 3: End-to-end - Character Input: acoustic features (spectrograms) Output: characters (and space) + null (~) No phoneme and lexicon CTC!!!!!!!!!!!!!!!! Baidu Research – Silicon Valley AI Lab We do not need a phoneme dictionary, nor even the concept of a “phoneme.” CTC: Connectionist Temporal Classification Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: ´ Labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369– 376. ACM, 2006. [14] A. Graves and N. J http://www.jmlr.org/proceedings/papers/v32/graves14.pdf The combination of bidirectional LSTM and CTC has been applied to characterlevel speech recognition before (Eyben et al., 2009), however the relatively shallow architecture used in that work did not deliver compelling results (the best character error rate was almost 20%). The basic system is enhanced b (No OOV problem) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.

Way 3: End-to-end - Character HIS FRIEND’S ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.

Way 3: End-to-end – Word? DNN … … apple bog cat …… Output layer ~50k words in the lexicon DNN Size = 200 times of acoustic features Input layer padding with zero! http://bengio.abracadoudou.com/cv/publications/pdf/bengio_2014_interspeech.pdf How about our homework? … … Use other systems to get the word boundaries Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition.“, Interspeech. 2014.

Why Deep Learning? Manily based on DNN-HMM nybid

Deeper is better? Word error rate (WER) multiple layers 1 hidden layer Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB

Deeper is better? Word error rate (WER) multiple layers 1 hidden layer Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Word error rate (WER) multiple layers 1 hidden layer Deeper is Better SWB For a fixed number of parameters, a deep model is clearly better than the shallow one.

Input Acoustic Feature (MFCC) What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 1-st Hidden Layer

Input Acoustic Feature (MFCC) What does DNN do? A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012. Speaker normalization is automatically done in DNN ] Lower later for speaker adp . This fits with human speech recognition which appears to use many layers of feature extractors and event detectors [7]. Input Acoustic Feature (MFCC) 8-th Hidden Layer

What does DNN do? In ordinary acoustic models, all the states are modeled independently Not effective way to model human voice high low front back The sound of vowel is only controlled by a few factors. http://www.danielpovey.com/files/icassp10_multiling.pdf http://www.ipachart.com/

What does DNN do? /i/ /u/ /e/ /o/ /a/ front back high low front back Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014. Output of hidden layer reduce to two dimensions /a/ /i/ /e/ /o/ /u/ The lower layers detect the manner of articulation http://www.ipachart.com/ http://csl.anthropomatik.kit.edu/downloads/354_Paper.pdf 143-1500-42-1500-152 This is BN acturally a rounded vowel, the lips form a circular opening All the states share the results from the same set of detectors. Use parameters effectively

Speaker Adaptation

Speaker Adaptation Speaker adaptation: use different models to recognition the speech of different speakers Collect the audio data of each speaker A DNN model for each speaker Challenge: limited data for training Not enough data for directly training a DNN model Not enough data for just fine-tune a speaker independent DNN model models. In fact, it has been shown that much less improvement was observed with fDLR when the network goes from shallow to deep [30]. It Supervised: with labels Unsupervised: without labels

Categories of Methods Conservative training Transformation methods Re-train the whole DNN with some constraints Transformation methods Only train the parameter of one layer Speaker-aware Training Do not really change the DNN parameters Need less training data

Conservative Training Output layer output close Output layer parameter close initialization Input layer Input layer Just train cannot work Extra constraint To evaluate how the size of the adaptation set affects the result, the number of adaptation utterances varies from 5 (32 s) to 200 (22min). Audio data of Many speakers A little data from target speaker

Transformation methods Add an extra layer Output layer Output layer Fix all the other parameters layer i+1 W layer i+1 extra layer Where shold I add -> I don’t know W Wa layer i layer i A little data from target speaker Input layer Input layer

Transformation methods Add the extra layer between the input and first layer With splicing …… …… layer 1 layer 1 extra layers extra layer Wa Wa Wa Wa Larger Wa More data Smaller Wa less data

Transformation methods SVD bottleneck adaptation Output layer M U Wa linear K K is usually small V N

Speaker-aware Training Speaker information Can also be noise-aware, devise aware training, …… Data of Speaker 1 Fixed length low dimension vectors Text transcription is not needed for extracting the vectors. Lots of mismatched data Data of Speaker 2 I-vector is original used in speaker identification Data of Speaker 3

Speaker-aware Training Training data: train Speaker 1 Output layer Speaker 2 Acoustic features are appended with speaker information features I-vector is original used in speaker identification Testing data: test Acoustic feature Speaker Information All the speaker use the same DNN model Different speaker augmented by different features

Multi-task Learning Outline

Multitask Learning The multi-layer structure makes DNN suitable for multitask learning Task A Task B Task A Task B Input feature Input feature for task A Input feature for task B

Multitask Learning - Multilingual states of French states of German states of Spanish states of Italian states of Mandarin Should I put Tanja’s AF here? Human languages share some common characteristics. acoustic features

Multitask Learning - Multilingual Mandarin only Character Error Rate With European Language Hours of training data for Mandarin Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013

Multitask Learning - Different units B A= state B = phoneme A= state B = gender A= state B = grapheme Dongpeng Chen, Mak, B., Cheung-Chi Leung, Sivadas, S., "Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition,“ ICASSP 2014 (character) acoustic features

Deep Learning for Acoustic Modeling New acoustic features Outline

MFCC … …… spectrogram DFT Waveform DCT log Input of DNN filter bank So strong, usually cascade DCT log Input of DNN … filter bank MFCC

Filter-bank Output … …… spectrogram DFT Waveform log Input of DNN Kind of standard now

Spectrogram spectrogram DFT Waveform Input of DNN common today http://ttic.uchicago.edu/~haotang/speech/06707746.pdf 5% relative improvement over fitlerbank output Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013

Spectrogram

Waveform? If success, no Signal & Systems  Input of DNN Waveform People tried, but not better than spectrogram yet http://ttic.uchicago.edu/~haotang/speech/06707746.pdf Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014 Still need to take Signal & Systems …… 

Waveform?

Convolutional Neural Network (CNN) Outline

CNN Speech can be treated as images Spectrogram Frequency Time Spectrogram reading? Spectrogram painting? http://www.ultimaserial.com/UltimaSound.html Taken an image spectral waterfalls, voiceprints, or voicegram Time

Probabilities of states CNN CNN Probabilities of states CNN Replace DNN by CNN Why it works? Image

CNN Maxout Max a1 b1 Max Max a2 b2 Max pooling

CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition”, Interspeech, 2014. This is the state-of-the-art results https://www.inf.u-szeged.hu/~tothl/pubs/IS2013.pdf http://www.inf.u-szeged.hu/~tothl/pubs/IS2014.pdf Network in network MTA-SZTE Research Group on Artificial Intelligence Hungarian Academy of Sciences and University of Szeged, Hungary

Applications in Acoustic Signal Processing Deep Learning for Regression

DNN for Speech Enhancement Clean Speech DNN for mobile commination or speech recognition https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/Speak14To15/sequence_error_vc_is2014.pdf Noisy Speech Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html

DNN for Voice Conversion Female DNN https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/Speak14To15/sequence_error_vc_is2014.pdf Male Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx

Concluding Remarks And probably future direction???

Concluding Remarks Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing

Thank you for your attention! Conventional How good is DNN The reason

More Researches related to Speech lecture recordings Find the lectures related to “deep learning” Summary Spoken Content Retrieval Speech Summarization Speech Recognition 你好 core techniques http://www.skyscanner.net/news/7-secrets-learning-language-fast Computer Assisted Language Learning I would like to leave Taipei on November 2nd Hi Hello Information Extraction Dialogue Information extraction