Download presentation
Presentation is loading. Please wait.
Published byJeffery Harrell Modified over 6 years ago
1
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA {zahorian, hongbing.hu, zhengqing.chen, Introduction Modulation Spectrum Represent temporal trajectory of spectral information (e.g. MFCC delta coefficients) Considered to be of secondary importance as “dynamic” features compared to “static” spectral information (e.g. MFCCs, formants) Very effective for noise robust ASR when extracted from spectral information over long time intervals Combination of Spectral and Modulation Features Spectral Features: Discrete Cosine Transform (DCT) Coefficients of the log magnitude spectrum Temporal Modulation Features: Discrete Cosine Series (DCS) expansion of DCT coefficients Each term is computed from an integrated feature of the entire frequency spectrum The DCT/DCS feature set can be tuned to emphasize spectral info, or modulation info, or their combination DCTC/DCSC Features DCTC Computation Discrete Cosine Transform Coefficients (DCTC) Spectral features obtained form a modified cosine transform of the spectrum Given the spectrum X with the frequency f normalized to a [0, 1] range, the ith DCTC is calculated: DCSC Computation Discrete Cosine Series Coefficients (DCSC) Represent the spectral evolution of DCTCs over time and encode the modulation spectrum A cosine basis vector expansion over time using overlapping blocks of DCTCs. Temporal resolution is highest at the center of the region DCTC Basis Vectors Spectra DCSC Basis Vectors DCTC Features Basis vectors: h(t): time “warping” function—non-uniform time resolution Basis vector : a(X): nonlinear amplitude scaling (log) g(f): nonlinear frequency warping (Mel-like function) DCTC/DCSC Features The features can easily to be varied to examine tradeoffs between static spectral information and trajectory (dynamic) spectral information Literature Comparison Recognition accuracy comparison using clean speech for TIMIT and NTIMIT Exp 2: Spectral Features High resolution spectral DCTC features only The spectral only case as no DCSC terms were used 20 DCTCs computed using 25 ms frames, 10 ms frame spacing and Mel-like frequency warping 3-state HMMs with 25 mixtures Exp 1: Control Control Experiment with MFCCs 12 MFCCs plus energy with delta and acceleration terms (39 total terms) Frame spacing:10 ms, frame length: 25ms 3-state HMMs with 75 mixtures used for recognition Experimental Evaluation Experimental Setup Database: TIMIT and NTIMIT (‘SI’ and ‘SX” only) 39 phoneme set mapped down from 62 phoneme set Training Data: 3696 sentences (460 speakers) Testing Data: 1344 sentences (168 speakers) SNRs: clean, 30, 20, 10, and 0 dB Recognizer: Hidden Markov Model (HTK3.4) Left-to-right HMMs with 3 states and varying number of mixtures per state 48 Monophone HMMs Bigram phones used as the language model TIMIT Feature Recognizer Accuracy Study PLP ANN/HMM 71.50% Ketabdar et al. (2008) [9] MFCC GMM 70% Sha et al. (2006) [10] DCT/DCS HMM 73.85% this study NTIMIT 58.79% * Morales et al. (2007) [11] 62.50% *: A 51-phoneme set and full NTIMIT were used. Exp 4: Combined Features Combine the high resolution spectral features with the modulation spectrum features The 20 spectral features in Exp 2 and the 60 modulation features in Exp 3 were combined (80 features total) 75 mixtures used for each state of HMMs Exp 5: Integrated Features Integrated DCTC/DCSC features to capture both spectral and modulation information 13 DCTCs with 6 DCSCs for each (78 total terms) DCTCs computed with 8 ms frames spaced 2 ms A 500 ms block length (250 frames) and a block spacing of 4 frames (8 ms) used for the DCSC computation 3-state HMMs with 75 mixtures Exp 3: Modulation Spectrum Features Using DSCS features for modulation spectrum 6 DSCSs with 10 DCTCs for each (60 total terms ) DCTCS computed using 6 ms frames and 2 ms frame spacing A 500 ms block length (250 frames) and a block spacing of 4 frames (8 ms) used for the DCSC computation 3-state HMMs with 32 mixtures Conclusions The modulation information (DCSCs) is obtained by extracting the temporal trajectories of integrated frequency domain features (DCTCs) The modulation features are far more noise robust than spectral features The integrated spectral and temporal information approach was the best overall, and (slightly) preferred over the combination of high resolution spectral information with modulation features Phonetic recognition results using both the TIMIT and NTIMIT databases compare favorably with any results reported in the literature using these databases
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.