1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Spectral/Temporal Acoustic Features for Automatic Speech Recognition Stephen A. Zahorian, Hongbing Hu, Jiang Wu Department of Electrical and Computer Engineering Binghamton University November 16th, 2010
2 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Overview of talk Background/Introduction Review of traditional spectral/temporal features DCTC/DCS features Experimental results Conclusions
3 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Most Typical Speech Features for ASR Spectral Features (Static Features) Represent the vocal tract information MFCCs (Mel-Frequency Cepstral Coefficients) Temporal Features (Dynamic Features) Capture time variation (trajectory) of spectral features Delta and Delta-Delta terms of MFCCs
4 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York MFCCs ( Mel-Frequency Cepstral Coefficients ) Mel-Frequency Scale The coefficients c i are calculated from the log filter-bank amplitudes using the Cosine transform Mel scale filter banks (20) N: Number of banks m j : Log amplitudes
5 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Speech Recognition Architecture Recognizer (HMM/NN) ini:dsil e I need a Speech Waveform Feature Extraction Speech Features Phonemes Words Classification (Recognition) Classification (Recognition)
6 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Hidden Markov Models (HMMs) Speech vectors are generated by a Markov model The overall probability is calculated as the product of the transition and output probabilities Likelihood can be approximated by only considering the most likely state sequence
7 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York DCTC Features Discrete Cosine Transform Coefficients (DCTCs) Given the spectrum X with the frequency f normalized to a [0, 1] range, the ith DCTC is calculated: First 3 DCTC basis vectors Basis vector : a(X): nonlinear amplitude scaling (log) g(f): nonlinear frequency warping (Mel- like function)
8 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York DCS Features Discrete Cosine Series Coefficients (DCSCs) Represent the spectral evolution of DCTCs over time and encode the modulation spectrum Basis vectors: h(t): time “warping” function—non- uniform time resolution First 3 DCSC basis vectors
9 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Example Original spectrogram, and its rebuilt version with different selection of features. Original spectrogram Rebuilt with 13 DCTC and 3 DCS terms Rebuilt with 8 DCTC and 5 DCS terms
10 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York DCTC/DCS Computation z DCS1DCS2DCS3 DCTC 1 DCTC 2 DCTC 3 DCTC 4 DCTC 5 Frame LengthBlock Length Spectrogram DCTC/DCS Features
11 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Experimental Evaluation Database Recognizer: HMMs Left-to-right Markov models with no skip 48 monophone HMMs are created using the HTK toolkit Bigram phone information was used as the language model Cambridge University/Microsoft HTK toolkit (Ver3.4) Provide powerful tools for data preparation, HMM training and testing, result analysis TIMIT database (“SI” and “SX” only) PhonemeReduced 48 phone set mapped down from the TIMIT 62 phone set Training data3696 sentences (462 speakers) Testing data1344 sentences (168 speakers)
12 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Experimental Evaluation TIMIT database 630 total speakers, 10 sentences each 462 speakers for training, 168 test speakers 3 state HMM phone models Results given as phone accuracy for 39 “standard” phone categories Number of mixtures per state “relatively” high to maximize accuracy
13 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Static Only Features Vary frame length from 5 ms to 30ms (5ms as the frame space) Vary number of DCTCs (7, 10, 13, 16, 19) 8 GMM mixtures for each state of HMMs
14 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Dynamic Features Use small number of DCTCs (1, 2, 3, or 4), and vary the number of DCSs Vary the number of frames per block, so that DCS terms are computed over 50, 100, or 300 ms 10 ms frame length, 5 ms frame space 8 GMM mixtures for each GMM state
15 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York
16 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Spectral/Temporal Features Use 40 features total, and 40 GMM mixtures. Vary frame length and the number of frames per block 2 ms frame space 8 ms block space Vary the combination of different numbers of DCTCs and DCSs—but fix number of parameters to 40
17 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Spectral/Temporal Features Condition 1: 8 DCTCs and 5 DCSs Condition 2: 9 DCTCs and 5 DCSs Condition 3: 10 DCTCs and 4 DCSs Ss Condition 4: 11 DCTCs and 4 DC Condition 5: 12 DCTCs and 4 DCSs Condition 6: 13 DCTCs and 4 DCSs Condition 7: 14 DCTCs and 3 DCSs Condition 8: 15 DCTCs and 3 DCSs
18 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Conclusions from these results Features which represent trajectories of global spectral shape carry considerable information for ASR. There are tradeoffs between “static” spectral features and “dynamic” spectral trajectory features Spectral resolution can be relatively low for spectral ASR features “Information” in trajectory features is more “dilute” than in spectral features
19 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Questions?