Presentation is loading. Please wait.

Presentation is loading. Please wait.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Similar presentations


Presentation on theme: "Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,"— Presentation transcript:

1 Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan Presenter: Chen, Hung-Bin Eurospeech 2005 ICSLP 2006

2 Outline Introduction Weighted Combination of VAD Methods –Features and Methods for VAD Weight Optimization Using MCE Training Experiment Conclusion

3 Introduction Voice activity detection (VAD) is a vital front-end in automatic speech recognition (ASR) systems, especially to perform robustly in noisy environments. –If speech segments are not correctly detected, the subsequent recognition processes would be often meaningless. However, there are a variety of noise conditions and no single method is expected to cope with all of them. In order to realize VAD robust against various kinds of noise, we have proposed a combination of multiple features.

4 4 Weighted Combination of VAD Methods The framework of our VAD system is shown in Figure 1. Four features are calculated: –amplitude level, ZCR, spectral information, and GMM likelihood. The features are shown as f(1), · · ·, f(4) in the figure, and they are combined with weights w1, · · ·, w4.

5 5 Features and Methods for VAD Amplitude level –Amplitude level is one of the most common features of VAD methods and is used in many applications. –The amplitude level at the t-th frame Et is computed as the logarithm of the signal energy; for N-length Hamming-windowed speech samples –Then, the feature used in the combination is calculated using the ratio of amplitude of the input frame to the amplitude of noise: where En denotes the amplitude level of noise

6 6 Features and Methods for VAD Zero crossing rate (ZCR) –Zero crossing rate (ZCR) is the number of times the signal level crosses 0 during a fixed period of time. –Similarly to amplitude level, a ratio of the input frame to noise is used for this feature. The feature is calculated as follows: where Zt denotes the ZCR of the input frame, and Zn denotes that of noise

7 7 Features and Methods for VAD Spectral information –As shown in the figure, we partition the frequency domain into several channels and calculate the signal to noise ratio (SNR) for each channel. –Then, we compute the average value of each SNR as spectral information.

8 8 Features and Methods for VAD Spectral information –The spectral information feature is defined as where B denotes the number of channels The term and indicate the average intensity within channel b for speech and noise.

9 9 Features and Methods for VAD GMM likelihood –A log-likelihood ratio of speech GMM to noise GMM for input frames is used for the GMM feature. –The feature is calculated as where and denote the model parameter set of GMM for the speech and noise, respectively

10 10 Weighted Combination of VAD Methods The combined score of data frame (t: frame number) is defined as follows: –where K denotes the number of combined features The weights must satifsy the following conditions: –where the initial weights are all equal

11 11 Weighted Combination of VAD Methods The two discriminative functions judge whether each frame is speech or noise. –where θ denotes the threshold value of the combined score Data is regarded as a speech frame if the discriminative function of speech is larger than that of noise. Otherwise, is regarded as a noise frame.

12 12 Weight Optimization Using MCE Training To adapt our VAD scheme to noisy environments, we applied MCE training to the optimization of the weights. misclassification measure –For the MCE training, the misclassification measure of training data frame is defined as –where k denotes the true cluster and m indicates another cluster loss function –The loss function is defined as a differential sigmoid function approximating the 0-1 step loss function: –where γ denotes the gradient of the sigmoid function

13 13 Weight Adjustment During the weight adjustment in the MCE training, the weight set is transformed into a new set because of a constraint ( > 0); The weight adjustment is defined as: –where is a monotonically decreasing learning step size

14 14 Weight Adjustment The gradient of weight adjustment equation is obtained as follows.

15 15 Weight Adjustment After is updated – is returned to and normalization of the weights

16 16 Experiment Testing set –Speech data from ten speakers were used. –Ten utterances were used for testing for each speaker. Each utterance lasted a few seconds, and three-seconds pauses were inserted between them. –noisy data added the noises of sensor room, machine, and background speech to the clean speech data by SNR (10, 15dB) –In total, we had 600 (= 3 types × 2 SNR × 10 persons × 10 utterances) samples as the test set. Training set –A different set of ten utterances, whose text is different from the test set, was used for the weight training for each condition.

17 17 Experiment The frame length –100-ms for amplitude level and ZCR –250-ms for spectral information, and GMM likelihood –frame shift was 250-ms for each feature Noise features –such as amplitude level, ZCR and spectral information were calculated using the first second of speech data GMM likelihood –32-component GMM with diagonal covariance matrices was used to model speech and noise –JNAS (Japanese Newspaper Article Sentences) corpus that includes 306 people and about 32000 utterances was used to train the speech GMM parameters with EM algorithm 100 250 100 250

18 18 Evaluation evaluate the performance –The frame-based false alarm rate (FAR) and false rejection rate (FRR) were used as evaluation measures –FAR is the percentage of non-speech frames incorrectly classified as speech –FFR is the percentage of speech frames incorrectly classified as non-speech –EER is the threshold values for its false acceptance rate and its false rejection rate, and when the rates are equal, the common value is referred to as the equal error rate

19 19 Experimental Results The equal error rate (EER) under each noise type where the SNR was 10dB is shown in Table Before the training, the weights are set to equal (= 0.25).

20 20 Experimental Results These figures compare our proposed method to the individual methods we combined. –The horizontal axis corresponds to the FAR, and the vertical axis corresponds to the FRR.

21 21 Application and Evaluation in ASR For evaluation in ASR, we collected 1345 utterances from the same ten speakers1, and made a test set by adding the same three types of noise with SNR of 5, 10 and 15db. –Thus, we have 12105 samples (= 3 noise types × 3 SNR × 1345 utterances). The acoustic model is a phonetic tied-mixture (PTM) triphone model based on multicondition training. The recognition task is simple conversation with a robot. A finite state automaton grammar is handcrafted with a vocabulary of 865 words. INTERSPEECH 2006 - ICSLP

22 22 Experimental Results In Tables 4 ∼ 6, ASR performance in word accuracy INTERSPEECH 2006 - ICSLP AC: air conditioner, CM: craft machine, BS: background speech

23 23 Conclusion In this paper is presented a robust VAD method by adaptively combining the four different features. The proposed method realizes the significantly better performance than the conventional individual techniques. It is also shown that the weight adaptation is possible with only one utterance and as reliable as in the closed training.


Download ppt "Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,"

Similar presentations


Ads by Google