Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Similar presentations


Presentation on theme: "1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State."— Presentation transcript:

1 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Spectral/Temporal Acoustic Features for Automatic Speech Recognition Stephen A. Zahorian, Hongbing Hu, Jiang Wu Department of Electrical and Computer Engineering Binghamton University November 16th, 2010

2 2 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Overview of talk  Background/Introduction  Review of traditional spectral/temporal features  DCTC/DCS features  Experimental results  Conclusions

3 3 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Most Typical Speech Features for ASR  Spectral Features (Static Features)  Represent the vocal tract information  MFCCs (Mel-Frequency Cepstral Coefficients)  Temporal Features (Dynamic Features)  Capture time variation (trajectory) of spectral features  Delta and Delta-Delta terms of MFCCs

4 4 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York MFCCs ( Mel-Frequency Cepstral Coefficients )  Mel-Frequency Scale  The coefficients c i are calculated from the log filter-bank amplitudes using the Cosine transform Mel scale filter banks (20) N: Number of banks m j : Log amplitudes

5 5 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Speech Recognition Architecture Recognizer (HMM/NN) ini:dsil e I need a Speech Waveform Feature Extraction Speech Features Phonemes Words Classification (Recognition) Classification (Recognition)

6 6 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Hidden Markov Models (HMMs)  Speech vectors are generated by a Markov model  The overall probability is calculated as the product of the transition and output probabilities  Likelihood can be approximated by only considering the most likely state sequence

7 7 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York DCTC Features  Discrete Cosine Transform Coefficients (DCTCs)  Given the spectrum X with the frequency f normalized to a [0, 1] range, the ith DCTC is calculated: First 3 DCTC basis vectors Basis vector : a(X): nonlinear amplitude scaling (log) g(f): nonlinear frequency warping (Mel- like function)

8 8 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York DCS Features  Discrete Cosine Series Coefficients (DCSCs)  Represent the spectral evolution of DCTCs over time and encode the modulation spectrum Basis vectors: h(t): time “warping” function—non- uniform time resolution First 3 DCSC basis vectors

9 9 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Example  Original spectrogram, and its rebuilt version with different selection of features. Original spectrogram Rebuilt with 13 DCTC and 3 DCS terms Rebuilt with 8 DCTC and 5 DCS terms

10 10 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York DCTC/DCS Computation z DCS1DCS2DCS3 DCTC 1 DCTC 2 DCTC 3 DCTC 4 DCTC 5 Frame LengthBlock Length Spectrogram DCTC/DCS Features

11 11 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Experimental Evaluation  Database  Recognizer: HMMs  Left-to-right Markov models with no skip  48 monophone HMMs are created using the HTK toolkit  Bigram phone information was used as the language model  Cambridge University/Microsoft HTK toolkit (Ver3.4)  Provide powerful tools for data preparation, HMM training and testing, result analysis TIMIT database (“SI” and “SX” only) PhonemeReduced 48 phone set mapped down from the TIMIT 62 phone set Training data3696 sentences (462 speakers) Testing data1344 sentences (168 speakers)

12 12 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Experimental Evaluation  TIMIT database  630 total speakers, 10 sentences each  462 speakers for training, 168 test speakers  3 state HMM phone models  Results given as phone accuracy for 39 “standard” phone categories  Number of mixtures per state “relatively” high to maximize accuracy

13 13 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Static Only Features  Vary frame length from 5 ms to 30ms (5ms as the frame space)  Vary number of DCTCs (7, 10, 13, 16, 19)  8 GMM mixtures for each state of HMMs

14 14 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Dynamic Features  Use small number of DCTCs (1, 2, 3, or 4), and vary the number of DCSs  Vary the number of frames per block, so that DCS terms are computed over 50, 100, or 300 ms  10 ms frame length, 5 ms frame space  8 GMM mixtures for each GMM state

15 15 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York

16 16 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Spectral/Temporal Features  Use 40 features total, and 40 GMM mixtures.  Vary frame length and the number of frames per block  2 ms frame space  8 ms block space  Vary the combination of different numbers of DCTCs and DCSs—but fix number of parameters to 40

17 17 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Evaluation with Spectral/Temporal Features  Condition 1: 8 DCTCs and 5 DCSs  Condition 2: 9 DCTCs and 5 DCSs  Condition 3: 10 DCTCs and 4 DCSs Ss  Condition 4: 11 DCTCs and 4 DC  Condition 5: 12 DCTCs and 4 DCSs  Condition 6: 13 DCTCs and 4 DCSs  Condition 7: 14 DCTCs and 3 DCSs  Condition 8: 15 DCTCs and 3 DCSs

18 18 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Conclusions from these results  Features which represent trajectories of global spectral shape carry considerable information for ASR.  There are tradeoffs between “static” spectral features and “dynamic” spectral trajectory features  Spectral resolution can be relatively low for spectral ASR features  “Information” in trajectory features is more “dilute” than in spectral features

19 19 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Questions?


Download ppt "1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State."

Similar presentations


Ads by Google