Convolutional Neural Networks for Speech Recognition

Convolutional Neural Networks for Speech Recognition
Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, Dong Yu Convolutional Neural Networks for Speech Recognition IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING VOL. 22, NO. 10, OCTOBER 2014 Ming-Han Yang Convolutional Neural Networks For Speech Recognition | Page 1

Convolutional Neural Networks For Speech Recognition | Page
Outline Introduction Review : Deep Neural Networks Convolution Neural Networks and their use in ASR Experiments Conclusion 花比較多力氣來講CNN的部分 Convolutional Neural Networks For Speech Recognition | Page 1

ASR tasks The aim of automatic speech recognition (ASR) is the transcription of human speech into spoken words. It is a very challenging task because human speech signals are highly variable due to various speaker attributes, different speaking styles, uncertain environmental noises, and so on. ASR, moreover, needs to map variable-length speech signals into variable-length sequences of words or phonetic symbols. ASR的目的是把人類的語音轉錄文字這是個非常具挑戰的任務, 因為人類的語音訊號會隨著語者的狀態, 不同的說話方式及不確定的環境噪音而有不同的變化 ASR需要將可變長度的語音訊號轉換成可變長度的phone的序列 Convolutional Neural Networks For Speech Recognition | Page 1

Hidden Markov Model It is well known that hidden Markov models (HMMs) have been very successful in handling variable length sequences as well as modeling the temporal behavior of speech signals using a sequence of states, each of which is associated with a particular probability distribution of observations. 知名的Hidden Markov Model(HMM)已經能很成功處理可變長度的序列, 以及利用語音訊號的時間特性來建立狀態序列, 其中每一個狀態在可觀測的對象上都有一機率分布 Convolutional Neural Networks For Speech Recognition | Page 1

Gaussian Mixture Models
Gaussian mixture models (GMMs) have been, until very recently, regarded as the most powerful model for estimating the probabilistic distribution of speech signals associated with each of these HMM states. Meanwhile, the generative training methods of GMM-HMMs have been well developed for ASR based on the popular expectation maximization (EM) algorithm. In addition, a plethora of discriminative training methods, as reviewed in [1], [2], [3], are typically employed to further improve HMMs to yield the state-of-the-art ASR systems. 高斯混合模型(GMM)，一直到最近，被認為是最強大的模型, 用來估計這些HMM狀態相關的語音訊號的機率分佈。同時, 運用在ASR的GMM-HMM的generative training方法已經發展成熟, based on最常用的expectation maximization(EM)法此外, 也有很多discriminative training 的方法用來改善ASR系統中的HMM [1] H. Jiang, “Discriminative training for automatic speech recognition: A survey,” [2] X. He, L. Deng, and W. Chou, “Discriminative learning in sequential pattern recognition—A unifying review for optimization-oriented speech recognition,” [3] L. Deng and X. Li, “Machine learning paradigms for speech recognition: An overview,” Convolutional Neural Networks For Speech Recognition | Page 1

Artificial Neural Networks
Very recently, HMM models that use artificial neural networks (ANNs) instead of GMMs have witnessed a significant resurgence of research interest [4], [5], [6], [7], [8], [9]. Initially on the TIMIT phone recognition task with mono-phone HMMs for MFCC features, and shortly thereafter on several large vocabulary ASR tasks with triphone HMM models; 最近, 用ANN來取代GMM模型有顯著地復甦趨勢起始於使用mono-phone HMMs來做TIMIT的phone recognition task, 之後慢慢擴展到使用triphone HMM model來做大詞彙的語音辨識 [4] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition with the mean-covariance restricted Boltzmann machine,” [5] A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny, “Deep belief networks using discriminative features for phone recognition,” [6] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” [7] G. Dahl, D. Yu, L. Deng, and A. Acero, “Large vocabulary continuous speech recognition with context-dependent DBN-HMMs,” [8] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” [9] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” Convolutional Neural Networks For Speech Recognition | Page 1

In retrospect, the performance improvements of these recent attempts have been ascribed to their use of “deep” learning, a reference both to the number of hidden layers in the neural network as well as to the abstractness and, by some accounts, psychological plausibility of representations obtained in the layers furthest removed from the input, which hearkens back to the appeal of ANNs to cognitive scientists thirty years ago. A great many other design decisions have been made in these alternative ANN-based models to which signiﬁcant improvements might have been attributed. 現在回想起來, 最近這些嘗試的效能改進歸因於他們使用deep learning, neural network中距離input較遠的hidden layer可以萃取出一些抽象性的資訊, 這讓人回想起三十年前ANN剛出現時, 認知科學家們對它有很大的興趣。有許多的設計用來改進這些替代ANN based的模型。 Convolutional Neural Networks For Speech Recognition | Page 1

Even without deep learning, ANNs are powerful discriminative models that can directly represent arbitrary classiﬁcation surfaces in the feature space without any assumptions about the data’s structure. GMMs, by contrast, assume that each data sample is generated from one hidden expert (i.e., a Gaussian) and a weighted sum of those Gaussian components is used to model the entire feature space. ANNs have been used for speech recognition for more than two decades. 即使沒有Deep learning, ANN也是很強大的discriminative model, 它不需要對資料的分佈做任何的假設, 可以直接表現特徵空間中的分類相比之下, GMM假設每組資料樣本是由一個hidden expert生成的,那些高斯component的weight sum被用來對整個特徵空間進行建模 ANN用在語音辨識已經超過20年 Convolutional Neural Networks For Speech Recognition | Page 1

Early trials worked on static and limited speech inputs where a fixed-sized buffer was used to hold enough information to classify a word in an isolated speech recognition scheme They have been used in continuous speech recognition as feature extractors, in both the TANDEM approach and in so-called bottleneck feature methods , and also as nonlinear predictors to aid the recognition of speech units [25], [26]. Their first successful application to continuous speech recognition, however, was in a manner that almost exactly parallels the use of GMMs now, i.e., as sources of HMM state posterior probabilities, given a fixed number of feature frames. 早期試驗是使用靜態和有限的語音輸入，其中有一個固定大小的buffer來保留足夠的資訊，以一個字一個獨立的語音識別來進行分類 ANN在連續語音辨識中當作feature extractor, 又被稱作bottleneck feature method, 同時也幫語音辨識單元作非線性預測 ANN的第一個成功應用是在連續語音辨識, 使用起來與GMM類似, 給予一堆feature frame, 輸入posterior probabilities給HMM state Convolutional Neural Networks For Speech Recognition | Page 1

How do the recent ANN-HMM hybrids differ from earlier approaches? They are simply much larger. Advances in computing hardware over the last twenty years have played a significant role in the advance of ANN-based approaches to acoustic modeling because training ANNs with so many hidden units on so many hours of speech data has only recently become feasible. The recent trend towards ANN-HMM hybrids began with using restricted Boltzmann machines (RBMs), which can take (temporally) subsequent context into account. 這種新的htbrid ANN-HMM系統跟先前的研究有什麼不同點? 噢, 他們差異可大了過去20年間, 硬體的進步在ANN-based的acoustic model的研究中扮演了一個很重要的角色, 如果training data資料有好幾個小時, 又使用數量很多的hidden unit, training就得需要花很多時間, 而近年來硬體進步讓這件事可以很快地完成最近Hybrid ANN-HMM通常會使用RBM Convolutional Neural Networks For Speech Recognition | Page 1

ANN vs GMM Comparatively recent advances in learning through minimizing “contrastive divergence” enable us to approximate learning with RBMs. Compared to conventional GMM-HMMs, ANNs can easily leverage highly correlated feature inputs, such as those found in much wider temporal contexts of acoustic frames, typically 9-15 frames. Hybrid ANN-HMMs also now often directly use log mel-frequency spectral coefficients without a decorrelating discrete cosine transform , DCTs being largely an artifact of the decorrelated mel-frequency cepstral coefficients (MFCCs) that were popular with GMMs. All of these factors have had a significant impact upon performance. 透過最小化對比散度(CD)讓我們可以使用RBM 與傳統的GMM-HMM相比, ANN可以輕易地利用input的特徵, 像是acoustic frames中, 可以看9~15個frame這種較寬的時間間距 Hybrid ANN-HMM現在通常直接使用MFSC, 不經過離散餘弦轉換, DCT通常用在GMM中的MFCC 這些因素都能改善performancce Convolutional Neural Networks For Speech Recognition | Page 1

This historical deconstruction is important because the premise of the present paper is that very wide input contexts and domain-appropriate representational invariance are so important to the recent success of neural-network-based acoustic models that an ANN-HMM architecture embodying these advantages can in principle outperform other ANN architectures of potentially unlimited depth for at least some tasks. We present just such a novel architecture below, which is based upon convolutional neural networks (CNNs) [31]. 這種歷史的解構是重要的，因為這篇paper的前提是取一段很寬的input context, 從中找到每個domain中的代表性和不變性, 而神經網路為基礎的acoustic model的架構包含了這些優點, 原則上某些任務上可以勝過其他層數近似無限層的ANN的架構我們提出了一個新的架構, based on CNN Convolutional Neural Networks For Speech Recognition | Page 1

CNN CNNs are among the oldest deep neural-network architectures , and have enjoyed great popularity as a means for handwriting recognition. A modification of CNNs will be presented here, called limited weight sharing, however, which to some extent impairs their ability to be stacked unboundedly deep. We moreover illustrate the application of CNNs to ASR in detail, and provide additional experimental results on how different CNN configurations may affect final ASR performance. 1980年就有學者把CNN架構用在手寫辨識, 在手寫辨識中也有很高的知名度我們提出了修改過的CNN架構, 叫做limited weight sharing, 某種程度上能避免它們無限制的越疊越深  ? 而且我們會解釋將CNN用在ASR的細節, 並就實驗結果來看不同CNN的設定對ASR performance有什麼樣的影響 Convolutional Neural Networks For Speech Recognition | Page 1

Why HMM? Not TDNN? CNNs have been applied to acoustic modeling before, notably by [33] and [34], in which convolution was applied over windows of acoustic frames that overlap in time in order to learn more stable acoustic features for classes such as phone, speaker and gender. Weight sharing over time is actually a much older idea that dates back to the so-called time-delay neural networks (TDNNs) of the late 1980s, but TDNNs had emerged initially as a competitor with HMMs for modeling time-variation in a “pure” neural-network-based approach. That purity may be of some value to the aforementioned cognitive scientists, but it is less so to engineers. 以前就有學者研究將CNN使用在acoustic model上,其中convolution應用在重疊的時間，以了解更多穩定的聲學特徵, 像是phone, 語者和性別沿著時間軸的weight sharing實際上是一個很老的做法, 1980的時候就有了, 又叫做time-delay neural networks (TDNN) TDNN最早是為了辨識日語語音的濁音b, d, g而提出的, TDNNs最初出現是和HMM在競爭, , 因為兩者都是在處理時間變化的modeling - TDNN: 1989年，在識別"B", "D", "G"三個濁音中得到98.5%的準確率，高於HMM的93.7%。是CNN的先驅工程上用HMM是因為TDNN的training用back propagation, 其中的neuron是按照不同時間間隔產生值, 例如input layer是按照input frame的輸入速率, 而output layer則是整個發音段都處理完才會產生值, 所以training的時候必須使用特殊的反方法才能實現, 在實作上比HMM複雜 Convolutional Neural Networks For Speech Recognition | Page 1

Why HMM? Not TDNN? As far as modeling time variations is concerned, HMMs do relatively well at this task; convolutional methods, i.e., those that use neural networks endowed with weight sharing, local connectivity and pooling (properties that will be deﬁned below), are probably overkill, in spite of the initially positive results of [35]. We will continue to use HMMs in our model for handling variation along the time axis, but then apply convolution on the frequency axis of the spectrogram. 至於modeling time variations, HMM相較之下表現得比較好, 至於建模時間變化來講,即使TDNN表現得比較好, HMM模型在convolution的方法(weight sharing, local connect, pooling, ..等)中效果比較好可能是矯枉過正? 我們使用HMM來處理沿著時間軸的變量, 然後應用convolution在頻譜的頻率軸 Convolutional Neural Networks For Speech Recognition | Page 1

CNN This endows the learned acoustic features with a tolerance to small shifts in frequency, such as those that may arise from differing vocal tract lengths, and has led to a signiﬁcant improvement over DNNs of similar complexity on TIMIT speaker-independent phone recognition, with a relative phone error rate reduction of about 8.5%. Learning invariant representations over frequency (or time) are notoriously more difﬁcult for standard DNNs. Deep architectures have considerable merit. They enable a model to handle many types of variability in the speech signal. 這賦予了學聲學特徵在頻率上有微小的頻率偏移, 像是可能產生的不同的聲道長度, 在TIMIT speaker-independent phone recognition中, 與DNN相比有顯著的改善, 約減少了8.5%的相對錯誤率從頻率(或時間)learning representations比標準的DNN還要困難深層的結構有相當大的好處。它們使一個模型可以處理許多不同變異性的語音訓號 Convolutional Neural Networks For Speech Recognition | Page 1

CNN The work of [29], [36] shows that the feature representations used in the upper hidden layers of DNNs are indeed more invariant to small perturbations in the input, regardless of their putative deep structural insight or abstraction, and in a manner that leads to better model generalization and improved recognition performance, especially under speaker and environmental variations. The more crucial question we have undertaken to answer is whether even better performance might be attainable if some representational knowledge that arises from a careful study of the empirical domain can be used to explicitly handle the variations in question. 1 DNN中較上層的隱藏層確實比較不受input小擾動而改變，不論其假定的深層結構的insight或 abstraction，在通向更好的模型概括和改進識別性能，特別是在speaker和環境變化的方式更關鍵的問題是, 從經驗領域中小心研究的representational,是否能獲得較佳的performance? Convolutional Neural Networks For Speech Recognition | Page 1

Vocal tract length normalization
Vocal tract length normalization (VTLN) is another very good example of this. VTLN warps the frequency axis based on a single learnable warping factor to normalize speaker variations in the speech signals, and has been shown [41], [16] to further improve the performance of DNN-HMM hybrid models when applied to the input features. More recently, the deep architecture taking the form of recurrent neural networks, even with unstacked single-layer variants, have been reported with very competitive error rates [42]. 聲道長度歸一化(VTLN)就是個很好的例子。 VTLN主要用於語音識別，消除男，女的聲道長度的差異。它可以看成是normalize頻率, 經由一個學習來的warping factor來normalize語者變異的語音訊號, 有一些研究結果顯示這樣的處理能夠增進hybrid DNN-HMM的performance 最近也有另一種deep learning的架構叫做RNN, 即使RNN的架構是unstacked single layer, 也有具競爭力的低錯誤率 Convolutional Neural Networks For Speech Recognition | Page 1

now GMM-HMM Hybrid ANN-HMM Hybrid ANN-HMM+RBM TDNN Hybrid DNN-HMM Hybrid CNN-HMM RBM ANN Convolutional Neural Networks For Speech Recognition | Page 1

Review : Deep Neural Networks
A deep neural network (DNN) refers to a feedforward neural network with more than one hidden layer. For a multi-class classiﬁcation problem, the posterior probability of each class can be estimated using an output softmax layer. sigmoid() or tanh() bias Why Bias? 不加bias的話, 當輸入全是0的時候輸出就全是0，但大多數時候不應該是這樣吧。 bias的值就是當輸入全0的時候這個neuron的activation應該是多少 h = sigmoid(WX), 改不同的w只會影響sigmoid的斜率, 函數圖形沒有平移 h = sigmoid(WX + b), 多了bias可以讓函數圖形有平移的效果 Convolutional Neural Networks For Speech Recognition | Page 1

Review : DNN-HMM model In the hybrid DNN-HMM model, the DNN replaces the GMMs to compute the HMM state observation likelihoods. In the training stage, forced alignment is ﬁrst performed to generate a reference state label for every frame. These labels are used in supervised training to minimize the cross-entropy function. example : She had your dark suit in greasy wash water all year. d => desire output (上圖的forced alignment結果) 要跟DNN預測的結果作比較, 希望跟d越像越好 Picture: Convolutional Neural Networks For Speech Recognition | Page 1

Review : DNN-HMM model If we use the stochastic gradient descent algorithm to minimize the objective function, for each training sample or mini-batch, each weight matrix update can be computed as: learning rate training data phoneme states DNN output phoneme states 為了最小化cross-entropy function, 可以透過back propagation調整weight sigmoid微分後的方程式 Convolutional Neural Networks For Speech Recognition | Page 1

Review : DNN pre-training
Initializes all weight , especially when the amount of training data is limited and when no constraints are imposed on the DNN weights. One popular method to pretrain DNNs uses the restricted Boltzmann machine (RBM) as a building block. After learning, all RBM weights can be used as a good initialization for one DNN layer. 由於DNN model越來越複雜, 當training data很少的時候, 通常需要pretraining, 來初始化所有的weight pretraining通常使用RBM RBM生成模型的一種, 用來替資料的機率分佈建立model RBM有一組hidden units用來計算出較好的input data的representation, 透過RBM learning之後, 可以讓DNN中的一層weight能得到較好的初始值每次learninging會從最底下的hiddeen layer開始, 然後固定這層再透過一次RBM替這層和下一層hidden layer做pretraining, 一直到全部都pretraining完通常也會使用contrastive divergence(相對散度)來learn RBM的weight Convolutional Neural Networks For Speech Recognition | Page 1

Convolution Neural Networks and their use in ASR
Convolutional Neural Networks For Speech Recognition | Page 1

Organization of the Input Data to the CNN
CNN introduces a special network structure, which consists of alternating so-called convolution and pooling layers. CNNs run a small window over the input image at both training and testing time, the weights of the network that looks through this window can learn from various features of the input data regardless of their absolute position within the input. The input “image can loosely be thought of as a spectrogram, with static, delta and delta-delta features (i.e., ﬁrst and second temporal derivatives) serving in the roles of red, green and blue CNN是從影像處理借來用的技術, 很直覺的把input的長跟寬視為一個2D的array, 如果是彩色圖片, 它的RGB各會是一個input feature map, 就會有三個不同的2D的feature map CNN在training和test的時候都會用一個小的filter來掃整個input feature map, 從這個filter的weight可以從輸入數據的各種特徵學習, 而不用管它的絕對位置而full weight sharing是指這個filter去掃整張input feature map的weight都是一樣的 CNN通常被稱為是local的, 因為從filter掃出來的每個unit都是整張圖片/影像的一塊小區域語音辨識上, input image可以想成是頻譜, 有static, delta, delta-delta, 相當於, 圖片的RGB Convolutional Neural Networks For Speech Recognition | Page Picture : 1

How to organize speech feature vectors into feature maps
We need to use inputs that preserve locality in both axes of frequency and time. Time : presents no immediate problem from the standpoint of locality. Like other DNNs for speech, a single window of input to the CNN will consist of a wide amount of context (9–15 frames). Frequency : the conventional use of MFCCs does present a major problem because the discrete cosine transform projects the spectral energies into a new basis that may not maintain locality. 為了保存locality, 我們可以從頻率跟時間的角度來看 Time : 保存locality方面沒有問題, 像在語音辨識上用DNN, 輸入給CNN的single window會由9-15個frame組成 Frequency: 傳統使用MFCC會有問題, 因為經過離散餘弦轉換後的頻譜能量會投影到一個新的basis, 這時候的locality就會消失 Why MFCC會有問題? Convolutional Neural Networks For Speech Recognition | Page 1

How to organize speech feature vectors into feature maps
In this paper, we shall use the log-energy computed directly from the mel-frequency spectral coefﬁcients (i.e., with no DCT), which we will denote as MFSC features. MFSC features will be used to represent each speech frame, along with their deltas and delta-deltas (in order to describe the acoustic energy distribution in each of several different frequency bands.) There exist several different alternatives to organizing these MFSC features into maps for the CNN. PAPER中使用MFSC feature, 代表每個speech frame, 預計他們的delta跟delta-delta(為了描述每個不同frequency bands的聲波能量分佈) 有兩種不同的方法來處理MFSC features餵給CNN Convolutional Neural Networks For Speech Recognition | Page 1

Organizing MFSC features : Fig.1(b)
They can be arranged as three 2-D feature maps, each of which represents MFSC features (static, delta and delta-delta) distributed along both frequency (using the frequency band index) and time (using the frame number within each context window). In this case, a two-dimensional convolution is performed to normalize both frequency and temporal variations simultaneously. 3個2D的feature maps, (static, delta, delta-delta), 散佈在兩軸, 頻率是看frequency band, 時間是看context window的frame number 這種狀況會使用2D的convolution, 可以同時normalize時間和頻率的變化(像影像處理的filter一樣) Convolutional Neural Networks For Speech Recognition | Page 1

Organizing MFSC features : Fig.1(c)
Alternatively, we may only consider normalizing frequency variations. For example, if the context window contains 15 frames and 40 ﬁlter banks are used for each frame, we will construct 45 (i.e., 15 times 3) 1-D feature maps, with each map having 40 dimensions 另一種方式是只考慮頻率的變化(filter取整塊直的), 在這種狀況下MFSC features會變成1D的feature map(沿著頻率軸) 舉例來說, 假設context window有15個frame 跟40個filter bank, 就會產生45個1維的feature map Convolutional Neural Networks For Speech Recognition | Page 1

Convolution ply 一旦input的feature map形成，convolution和pooling就會和input layer得到input feature map的過程一樣, convolution和poolinglayer也會依序各自得到各自的feature map CNN術語上通常會把一組convolution layer跟 pooling layer成為一個CNN layer, 而deep CNN就是有很多組的CNN layer, 為了避免混搖, 這邊吧convolution和pooling layers各別叫做convolution ply跟pooling ply 如圖, 每一個input feature map Oi 會連結到很多的convolution feature maps Qj, 在convolution ply 中, 會有weight matrix wij, mapping的方法可以看作是訊號處理的convolution運算, 假設input feature map全部都是一維的, 在convolution ply中的其中一個feature map的每一個unit可以用這個式子計算 𝑜 𝑖,𝑚 是第𝑖個input feature map的第𝑚個unit 𝑞 𝑗,𝑚 是第𝑗個convolution feature map的第𝑚個unit 𝑤 𝑖,,𝑗,𝑛 是第weight matrix中, 的第𝑛個weight vector wij, 連接的第𝑖個input feature map和第j個convolution feature map F是filter size, 在convolution feature的每個unit接收幾個 input feature map的frequency band數量作為input 這個式子可以寫成矩陣乘法 Bias : activation function 函數平移, 否則變換weight只會造成函數斜度變化 Convolutional Neural Networks For Speech Recognition | Page 1

Convolution ply Both 𝑂 𝑖 and 𝑊 𝑖,𝑗 are vectors if one dimensional feature maps are used, and are matrices if two dimensional feature maps are used. Note that, in this presentation, the number of feature maps in the convolution ply directly determines the number of local weight matrices that are used in the above convolutional mapping. 𝑊 𝑖,𝑗 : each local matrix 𝑂 𝑖 : the i-th input feature map Oi 是第i個input feature map, wij是weight matrix, 如果feature map是一維的, 那麼Oi跟Wij就是vector, 如果二維的, 那麼就會是矩陣有幾個Feature map就表示會有幾個local weight矩陣, 實際上, 這些矩陣是相同的(一個filter掃整個input feature map是用同一個weight) 還有一點很重要, filter掃input feature map會有overlap Convolutional Neural Networks For Speech Recognition | Page 1

Convolution ply The convolution operation itself produces lower-dimensional data—each dimension decreases by ﬁlter size minus one. But we can pad the input with dummy values (both dummy time frames and dummy frequency bands) to preserve the size of the feature maps. All units in the same feature map share the same weights but receive input from different locations of the lower layer. Convolution operation會產生較低維度的資料, 例如frequency band = 40, filter size用5去掃, 最多只會產生 = 40-(5-1) = 40 – (filter size -1) = 36個feature map 為了讓convolution運算之後, feature map的size不變, 所以會塞dummy value, 讓convolution運算之後的frequency band還是40, 以這個例子來說, input feature map可以塞4個dummy value, 這樣convolution運算後就會是40, 之後再把dummy value拿掉同一個feature map的所有unit, 雖然分別在input中不同的位置, 但是會共享同一個weight Convolutional Neural Networks For Speech Recognition | Page 1

Pooling ply It has the same number of feature maps as the number of feature maps in its convolution ply, but each map is smaller. The purpose of the pooling ply is to reduce the resolution of feature maps. Pooling operation是從convolution ply產生相對應的pooling ply, 他的數量會跟convolution ply相同, 但是維度會比較小 Pooling ply的目的是降低feature map的維度, 這也表示pooling ply的每一個unit會囊刮較低層的convolution ply的特徵然後這些概要的特徵會再次用於頻率空間性, 就算頻率有小幅度的變動, 他們也是不會變的, Convolutional Neural Networks For Speech Recognition | Page 1

Pooling ply It is usually a simple function such as maximization or averaging. Max pooling : Mean pooling : G : pooling size s : shift size, determines the overlap of adjacent pooling windows r : a scaling factor that can be learned Better performance 影像辨識中, 通常不會重疊降維通常是將convolution ply裡面, 某個區域中的數個unit經過運算並成一個, 選幾個unit又叫pooling size, pooling function常用的有兩種, 一個是取平均, 一個是取最大舉例來說, max pooling的function中, G是pooling size, s是shift size, 可以控制是否要overlap, overlap的size 影像處理中, 在convolution做pooling, filter掃的時候通常不會重疊, 後面有實驗在比較有無overlap會影響多少出來的結果會透過一個non-linear的activation function, 產生最後的output Convolutional Neural Networks For Speech Recognition | Page 1

圖中表示pooling size = 3, 每個pooling unit 都是從同一個feature map的三個convolution ply的unit接收而來的如果pooling size 跟 shift size相同時, pooling ply的size就會是convolution ply的三分之一 Pooling size = 3 Convolutional Neural Networks For Speech Recognition | Page 1

Learning weight in CNN All weights in the convolution ply can be learned using the same error back-propagation algorithm, but some special modiﬁcations are needed to take care of sparse connections and weight sharing. In order to illustrate the learning algorithm for CNN layers, let us ﬁrst represent the convolution operation in eq. (9) in the same mathematical form as the fully connected ANN layer. convolution ply的所有weight可以透過同樣的back propagation法來learnning, 有些特殊的修改需要注意weight sharing跟sparse connection 為了要舉例說明, 我們先看之前的第九式, 做convolution運算的式子, 其實跟一般的 fully connected ANN layer數學式一樣 Convolutional Neural Networks For Speech Recognition | Page 1

Learning weight in CNN Figure. (a) assumes a ﬁlter size of 5, 45 input feature maps and 80 feature maps in the convolution ply 如果使用1D的feature map, convolution operation可以看成是簡單的矩陣乘法, 右圖是一個超大的sparse矩陣 W, 透過折疊左圖這個基本的矩陣得到的左圖的基本矩陣w是由local weight matrix wij構成的, 有 I * F個row (也就是45個I * filter size5), 每一個frequency band有 I 個row(因為有45個1D的input feature map), 然後有 J 個column filter的size input map數量 Convoluted feature map數量 Convolutional Neural Networks For Speech Recognition | Page 1

Learning weight in CNN # of freq. bands Input feature map跟convolution feature map 可以看作是row vector o跟q 一個單一的row vector o 是把所有的1D input feature map橫放, 然後全部串起來, vm是row vector, 表示從所有I 個input feature map中的第m個frequency band 前面第9式的式子可以表示成row vector o乘以weight vector再經過sigmoid的運算, 這個式子跟前面DNN的典型的fully connected hidden layer Convolutional Neural Networks For Speech Recognition | Page 1

Learning weight in CNN The treatment of shared weights in the convolution ply is slightly different from the fully-connected DNN case where there is no weight sharing. convolution ply weight的更新一樣是使用back propagation, 不過share weight的處理方式跟fully-connected DNN不同, 因為 fully-connected DNN沒有weight sharing的概念因此我們把他們要update的delta w大矩陣加起來, I和J分別是input feature map和convolution feature map的數量, \ Error vector e計算方式和前面DNN的地方一樣, error = desire output – 實際output, 然後透過第7式的back propagation公式, 用sparse weight matrix把error傳到低層去同樣地, bias可以在sparse weight matrix w加一個row, 跟input vector o 加一個element來計算bias的值 Convolutional Neural Networks For Speech Recognition | Page 1

Learning weight in CNN 因為pooling ply沒有weight, 所以不需要learning, 所以error訊號只要透過pooling function直接back propagate到前一層就可以舉例來說, max-pooling的error訊號只會傳給當初做pooling的那組中最大的unit, 然後每組都會有個最大的, error 訊號就傳給那些最大的unit 也就是說error訊號可以透過中間這個公式傳下去, delta of x 是一個function, 如果x=0就回傳1, 否則回傳0, uim是pooling unit中對應過去的那些在convolution ply中的units的最大的那個unit的index和它的值, Convolutional Neural Networks For Speech Recognition | Page 1

Pretraining CNN Layers
RBM-based pretraining improves DNN performance especially when the training set is small. For convolutional structure, a convolutional RBM (CRBM) has been proposed in Similar to RBMs, the training of the CRBM aims to maximize the likelihood function of the full training data according to an approximate contrastive divergence (CD) algorithm. [2008 Bengio et al.] Empirical Evaluation of Convolutional RBMs for Vision DNN把RBM用在pre-training能改進效能, 尤其是training set很小的時候 Pretraining會把DNN的weight初始化, 讓weight到一個適當的範圍, 使得最佳化及正規化有較好的效果關於convolution的結構, 2008年Bengio提出了convolutional RBM, 又叫CRBM, 與RBM相同, CRBM training的目標是透過constrastive divergence(相對散度), 來maximize likelihood function CRBM把所有convolution ply中的每組pool hidden units當作是多項式分佈, 每組pool 最多只會有一個units被觸發 (進去pooling層之前, conv ply裡面的每幾個units會分在一組, 照pooling size) 所以這需要convolution 到pooling掃的時候沒有overlap, 或者convolution 到pooling的時候使用limit weight sharing(下面的章節會介紹) Convolutional Neural Networks For Speech Recognition | Page 1

Treatment of Energy Features
In ASR, log-energy is usually calculated per frame and appended to other spectral features. In a CNN, it is not suitable to treat energy the same way as other ﬁlter bank energies since it is the sum of the energy in all frequency bands and so does not depend on frequency. \ Instead, the log-energy features should be appended as extra inputs to all convolution units. Other non-localized features can be similarly treated. Other non-localized features can be similarly treated. 語音辨識中, 通常會每個frame計算log能量, 然後附加給其他的頻譜特徵 CNN中, 用這種方法就比較不適當, 因為它是在所有frequency band的能量的總和，而不是取決於頻率如圖, 我們可以把log能量加在w矩陣, 當作額外的一個row, 這樣在convolution的時候就可以順便計算, 要加入其他非local的feature也可以照這種方法加後面有比較有加能量跟沒加能量的實驗結果, 有加的效果比較好, 但是有個問題是, 如果用在大規模的ASR是不是也會有這樣的改善效果儘管如此, 實驗結果至少表示被CNN架構調整的那些與頻率獨立的feature(像是能量 ), 會改善performance Convolutional Neural Networks For Speech Recognition | Page 1

The Overall CNN Architecture
In this paper, we follow the hybrid ANN-HMM framework, where we use a softmax output layer on top of the topmost layer of the CNN to compute the posterior probabilities for all HMM states. These posteriors are used to estimate the likelihood of all HMM states per frame by dividing by the states’ prior probabilities. Finally, the likelihoods of all HMM states are sent to a Viterbi decoder to recognize the continuous stream of speech units. CNN 是由一對convolution ply和pooling ply組成, input是一堆localized的feature所組成的feature map, 越往上的layer, 如果經過越多層的convolution ply和pooling ply, feature map的數量就會越來越小通常CNN的最後一層layer會加一層或多層的more fully connected hidden layers , 把feature中所有frequency band的結果合併在一起, 然後餵到output layer 這篇paper延續ANN-HMM的架構, 在output layer使用softmax來計算所有HMM state的posterior probability 根據貝氏定理, 這些posterior會除以prior probability, 用來估計每一個frame所有HMM state的likelihood 最後這些HMM state的liklihood會傳到viterbi decoder來辨識連續的speech stream的speech units Convolutional Neural Networks For Speech Recognition | Page 1

Beneﬁts of CNNs for ASR The CNN has 3 key properties: Locality : allows more robustness against non-white noise where some bands are cleaner than the others. Weight sharing : can also improve model robustness and reduce overﬁtting as each weight is learned from multiple frequency bands in the input instead of just from one single location. Pooling : the same feature values computed at different locations are pooled together and represented by one value. CNN有三個關鍵的性質, locality, weight sharign, pooling, 每一個都有改善performance的潛力 Locality –在non-white noise允許更多的robustness, 有些frequency band會比其他的band還乾淨(沒有雜訊) 因為好的feature可以從頻譜中乾淨的部分取得區域性feature, 而且只有少數的feature會被雜訊影響這樣高層的layer對frequency band組合feature的時候, 可以很容易地處理這些雜訊這比將所有的input feature都接到hidden layer的fully-connect layer的效果還要好, Locality可以減少neuralnetwork中需要learning出來的weight數量 Weight sharing – 也可以改善robustness, 和減少overfitting, 因為每個weight是從input的不同frequency band學習的, 而不是單一個frequency band Pooling – 從feature不同區域會經過pooling計算成一個值, 它讓頻率有小幅度變動的feature 他們的差異更小, (尤其是用max-pooling的時候) 因為就算是同一個語者, 每次講話的時候頻率都會有細微的變動, 這個變化用GMM和DNN是很難處理的而且ANN很難學習max pooling這種運算同樣的道理, 同一個語者不同時間的語速也會有微小的不同, CNN可以很自然地處理這類的微小變化 Convolutional Neural Networks For Speech Recognition | Page 1

CNN with limited weight sharing for ASR

Limited Weight Sharing (LWS)
The properties of the speech signal typically vary over different frequency bands, however. Using separate sets of weights for different frequency bands may be more suitable since it allows for detection of distinct feature patterns in different ﬁlter bands along the frequency axis. Only the convolution units that are attached to the same pooling unit share the same convolution weights. Convoluted feature map用這個方法會分成很多小塊的lws maps, 紅框是其中一塊, 而這些又叫做 Section Each pooling unit in this section summarizes an entire convolution feature map into one number using a pooling function, such as maximization or averaging. Convolutional Neural Networks For Speech Recognition | Page 1

Limited Weight Sharing (LWS)

Pretraining of LWS-CNN
For learning the CRBM parameters, we need to deﬁne the conditional probabilities of the states of the hidden units given the visible ones and vice versa. The conditional probability distribution of 𝑣_(𝑖,𝑛), which is the visible unit at the n-th frequency band of the i-th feature map Convolutional Neural Networks For Speech Recognition | Page 1

Pretraining of LWS-CNN
Based on the above two conditional probabilities, all connection weights of the above CRBM can be iteratively estimated by using the regular contrastive divergence (CD) algorithm. The outputs of the pooling ply are used as inputs to continuously pretrain the next layer as done in deep belief network training . Convolutional Neural Networks For Speech Recognition | Page 1

Experiments Convolutional Neural Networks For Speech Recognition | Page 1

Experiments The experiments of this section have been conducted on two speech recognition tasks to evaluate the effectiveness of CNNs in ASR: small-scale phone recognition in TIMIT large vocabulary voice search (VS) task Convolutional Neural Networks For Speech Recognition | Page 1

Speech Data and Analysis
Speech is analyzed using a 25-ms Hamming window with a fixed 10-ms frame rate. Speech feature vectors are generated by Fourier-transform-based filter-bank analysis, which includes 40 log energy coefficients distributed on a mel scale, along with their first and second temporal derivatives. All speech data were normalized so that each vector dimension has a zero mean and unit variance. Convolutional Neural Networks For Speech Recognition | Page 1

TIMIT Phone Recognition Settings
Training set : 462-speaker and removed all SA records. Tuning : Separate development set of 50 speakers Test set : 24-speaker , no overlap with the development set. Target class labels : 183 (3 states for each HMM of61 phones) 61 phone classes were mapped to a set of 39 classes for ﬁnal scoring after decoding. Language model : bigram Training : mono-phone HMM model + log energy feature per frame, normalized per utterance to have a maximum value of one, and then normalized to have zero mean and unit variance over the whole training data set Convolutional Neural Networks For Speech Recognition | Page 1

Effects of Varying CNN Parameters : Settings
In these experiments we used 1 convolution ply, 1 pooling ply and 2 fully connected hidden layers. Fully connected layer : 1000 units per layer FWS Pooling size : 6 Shift size : 2 Flter size : 8 Feature maps : 150 LWS Feature maps : 80 per frequency band The ASR performance of CNNs under different settings of the CNN parameters. Convolutional Neural Networks For Speech Recognition | Page 1

Effects of Varying CNN Parameters
Phone Error Rate (PER in %) LWS yields better performance with bigger pooling sizes. Overlapping pooling windows do not produce a clear performance gain 上圖 all conﬁgurations yield better performance with increasing pooling size up to 6. LWS yields better performance with bigger pooling sizes. 上+下圖 overlapping pooling windows do not produce a clear performance gain, and that using the same value for both the pooling size and the shift size produces a similar performance while decreasing the model complexity. Convolutional Neural Networks For Speech Recognition | Page 1

Effects of Varying CNN Parameters
Phone Error Rate (PER in %) a larger number of feature maps usually leads to better performance a larger number of feature maps usually leads to better performance, especially withFWS. It also shows that LWS can achieve better performance with a smaller number of feature maps than FWS due to its ability to learn different feature patterns for different frequency bands. This indicates that the LWS scheme is more efﬁcient in terms of the number of hidden units. Convolutional Neural Networks For Speech Recognition | Page 1

Effects of Energy Features
Phone Error Rate (PER in %) The benefit of using energy features, producing a significant accuracy improvement, especially for FWS. While the energy features can be easily derived from other MFSC features, adding them as separate inputs to the convolution filters results in more discriminative power as it provides a way to compare the local frequency bands processed by the filter with the overall spectrum. the benefit of using energy features, producing a significant accuracy improvement, especially for FWS. While the energy features can be easily derived from other MFSC features, adding them as separate inputs to the convolution filters results in more discriminative power as it provides a way to compare the local frequency bands processed by the filter with the overall spectrum. Convolutional Neural Networks For Speech Recognition | Page 1

Effects of Pooling Functions
Phone Error Rate (PER in %) Max-pooling function performs better than the average function with the LWS scheme. These results are consistent with what has been observed in image recognition applications. Convolutional Neural Networks For Speech Recognition | Page 1

Overall Performance in TIMIT task
Baseline : DNN ID = 1, DNN 3層 ID = 2, DNN 5層 ID3&4, 只差在feature map數量 Convolutional Neural Networks For Speech Recognition | Page 1

Large Vocabulary Speech Recognition Results
18 hours of voice search dataset conventional state-tied triphone HMM The results show that pretraining the CNN can improve its performance, although the effect of pretraining for the CNN is not as strong as that for the DNN. Word Error Rate (WER in %) pretraining Convolutional Neural Networks For Speech Recognition | Page 1

Conclusion Convolutional Neural Networks For Speech Recognition | Page 1

Our hybrid CNN-HMM approach delegates temporal variability to the HMM, while convolving along the frequency axis creates a degree of invariance to small frequency shifts, which normally occur in actual speech signals due to speaker differences. We have proposed a new, limited weight sharing scheme that can handle speech features in a better way than the full weight sharing Limited weight sharing leads to a much smaller number of units in the pooling ply. This discrepancy is yet to be examined thoroughly in our future work. Convolutional Neural Networks For Speech Recognition | Page 1

The use of energy information is very beneﬁcial for the CNN in terms of recognition accuracy. The ASR performance was found to be sensitive to the pooling size, but insensitive to the overlap between pooling units. Pretraining of CNNs was found to yield better performance in the large-vocabulary voice search experiment, but not in the phone recognition experiment. This discrepancy is yet to be examined thoroughly in our future work. Convolutional Neural Networks For Speech Recognition | Page 1

Thank you for your attention

Convolutional Neural Networks for Speech Recognition

Similar presentations

Presentation on theme: "Convolutional Neural Networks for Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Convolutional Neural Networks for Speech Recognition

Similar presentations

Presentation on theme: "Convolutional Neural Networks for Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback