Presentation is loading. Please wait.

Presentation is loading. Please wait.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Similar presentations


Presentation on theme: "LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,"— Presentation transcript:

1 LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec, H5A 1K6, Canada Presenter: Chen, Hung-Bin ICASSP 2005

2 Outline Introduction Observation Energy dynamic range normalization method1 Energy dynamic range normalization method2 Experiment Conclusion

3 Introduction Automatic Speech Recognition (ASR) has been in commercial application for decades but still has severe limitations. –Accuracy of speech recognition degrades rapidly when speech is distorted by noise. Methods to overcome the effects of noise must be applied in order to achieve good recognition accuracy in real speech recognition applications where various types of noises may exist. –Robust speech recognition is one of the most challenging areas of speech recognition.

4 4 Introduction (cont.) Methods of robust speech recognition can be classified into two approaches. –front-end processing method is to suppress the noise and get more robust parameters –back-end processing is to compensate for noise and adapt the parameters inside the HMM system In this paper, we focus on the first approach. –We try to find a more effective way to remove the effects of additive noise for the log-energy feature. We propose a log-energy dynamic range normalization (ERN) method to minimize mismatch between training and testing data. –The dynamic range of log-energy feature sequences of an utterance is normalized to a target dynamic range.

5 5 Observation Comparing with the log-energy feature sequence –noisy speech with a 10 dB SNR ratio and that of clean speech 1. Elevated minimum value, 2. Valleys are buried by additive noise energy, while perks are not affected as much.

6 6 Energy dynamic range normalization The larger difference on valleys leads to a mismatch between the clean and noisy speech. To minimize the mismatch, –we suggest an algorithm to scale the log-energy feature sequence of clean speech, in which we lift valleys while we keep peaks unchanged.

7 7 Energy dynamic range normalization define a log-energy dynamic range of the sequence as follows

8 8 Energy dynamic range normalization Following are the steps of the proposed log-energy feature dynamic range normalization algorithm:

9 9 Energy dynamic range normalization Linear scaling equation may not be the best solution. We modify the linear scaling equation into non-linear scaling equation.

10 10 Energy dynamic range normalization Figure 2 shows a schematic representation of the scaling effect of the proposed algorithm. –The scaling effect is decreased as its own value goes up and the maximum of the sequence is unchanged.

11 11 Experiment The proposed method was evaluated on the Aurora 2 database. All recognition tests were conducted using the HTK recognition toolkit with the setting defined for evaluation. Speech models are eleven whole word HMMs fixed to 16 states with 3 diagonal Gaussian mixtures per state. –Two silence models are defined. Data in Test A are added to by noises of Subway, Babble, Car and Exhibition. Data in Test B are added to noises of Restaurant, Street, Airport and Station. In Test C, besides the additive noise, channel distortion is also included.

12 12 Recognition results The results in this section are defined in terms of relative improvement (R.I.) –where NewScore, Baseline are recognition accuracies for each test using proposed and reference algorithms,

13 13 Experiment Results of table 1 show relative improvements with the different target log-energy dynamic range.

14 14 Experiment The results of relative improvement in different target dynamic ranges using this non-linear normalization method are shown in Table 2. It achieves a 30.83% highest overall relative improvement when the target range is set to 14 dB.

15 15 Experiment Performance comparisons between linear and nonlinear normalization methods for average relative improvement at different SNR levels are shown in table 3. The mean recognition accuracy for each test set is obtained by taking the average of the recognition accuracies measured in 20, 15, 10, 5 and 0 dB SNR.

16 16 Experiment Experiment 2 in Table 4 Here in experiment 2, we answer the questions: –(1) what are the results of techniques like cepstral mean and variance normalizations? –(2) Can the proposed algorithms combine with these techniques get an even better result? CMN refers to cepstral mean normalization –process with all 13 parameters CVN for cepstral variance normalization –process with all 13 parameters ERN(L) for proposed methods is linear respectively ERN(N) for proposed methods is non-linear respectively

17 17 Experiment

18 18 Conclusion A log-energy dynamic range normalization technique is introduced to improve ASR performance in noisy conditions. Reducing mismatch in log-energy leads to a large recognition improvement. It is also confirmed that the proposed algorithm can be combined with the cepstral mean or variance normalization techniques to achieve an even better result.


Download ppt "LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,"

Similar presentations


Ads by Google