Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presenter: Shih-Hsiang(士翔)

Similar presentations


Presentation on theme: "Presenter: Shih-Hsiang(士翔)"— Presentation transcript:

1 Presenter: Shih-Hsiang(士翔)
ICASSP 2005

2 Abstract

3 Introduction Speech recognition System can be divided into two parts
Front-End processing (Feature extraction) Suppress(抑制) the noise Get more robust parameters Back-End processing (HMM decoding) Compensate(補償) for noise Adapt the parameters In order to get more noise robust features, there are numerous efforts (based on MFCCs) Add pre-processing Noise reduction, speech enhancement Incorporate algorithms in an MFCC calculation framework Frequency masking, SNR-Normalization Add feature post-processing techniques Cepstral Channel Normalization, Cepstral Mean Normalization

4 Introduction CMN is known to be a simple noise robust post-feature processing technique Comparing with cepstral coefficients, the log-energy feature has quite different characteristics But the log-energy feature (or C0) is treated in the same way as other cepstral coefficients They propose a log-energy dynamic range Normalization (ERN) method to minimize mismatch between training and testing data

5 Energy Dynamic Range Normalization
Leads to a mismatch between the clean and noisy speech 10 dB SNR Comparing with clean speech, the log-energy feature sequence of noisy speech are Elevated minimum value Valleys are buried by additive noise energy, while peaks are not affected as much

6 Energy Dynamic Range Normalization
Log-energy dynamic range of the sequence Algorithm Find Max = Max(Log(Energyi) i=1..n ) Min = Min (Log(Energyi) i=1..n ) Calculate target T_Min = αx Max(Log(Energyi) i=1..n ) If Min (Log(Energyi) i=1..n ) < T_Min For i=1..n Liner Non-Liner

7 Energy Dynamic Range Normalization

8 Experiment Evaluated on the Aurora 2 Relative improvement (R.I.)
Overall accuracy Tow experiments Experiment 1:explore how good the performances are in the sense of relative improvement Experiment 2 :compare with other techniques

9 Experiment 1

10 Experiment 2

11 Conclusions The proposed log-energy dynamic range normalization algorithm can have overall about a 30.83% relative performance improvement It also can be combined with the cepstral mean or variance normalization techniques to achieve an even better result The proposed does not require any prior knowledge of noise and level It is difficult to use energy dynamic range normalization to deal with channel distortion Reducing mismatch in log-energy leads to a large recognition improvement

12 Presenter: Shih-Hsiang(士翔)
ASRU 2003

13 Introduction Methods of robust speech recognition can be classified into two approaches Focuses on finding more robust parameters, which are minimally affected by the noise Formants and their movements (but no effective to estimate) Focused on compensate for the noise effect Error rates of automatic could decrease if noise type is known and well trained in the HMM model But there are countless different kinds of noises in real speech conditions Training-Testing mismatch always occurs when unknown noise is involved

14 Introduction (cont.) Human auditory properties and noise masking theory show that spectral peaks (formants) could be used successfully to discriminate speech from the noise The aim of this paper is to introduce noise reduction and spectral emphasis techniques It does not require retraining the acoustic models It utilizes gain coefficients estimated in noise reduction for spectral emphasis

15 Noise Reduction Spectral subtraction
The basic principle is to subtract the magnitude of spectral noise from the noisy speech Assuming an additive noise model, the input noisy speech signal y(t) is expressed as time-domain frequency-domain Inverse Fourier Transform

16 Noise Reduction & Spectral Emphasis
Thus a noisy signal does not affect the speech signal uniformly over the entire spectrum Multi-band or non-linear spectral subtraction is better Over-subtraction factor is frequency dependent In this paper, gain function is computed as a function of the instantaneous estimated SNRs at each frequency band on a frame-by-frame basis r : forgetting factor Noise estimate b,c : flooring factors a : scale factor gain function Spectral emphasis m : weight factor

17 Spectral emphasis G(t,f) is an SNR related value
Higher G(t,f) value at the spectral peaks (formants) than that of spectral valleys (buried by noise) In order to compensate spectral mismatch between clean and noisy speech, some methods have been proposed peak-to-vally ratio locking, SNR normalization (manipulate values of the spectral valleys) But this paper is to emphasize values of spectral peaks It is a trade-off to select a proper weight factor suggest to use a larger factor to emphasize the formants to discriminate them from noise

18 Voice Activity Detector
Non-speech parts will cause insertion error deleting non-speech frames will improve the performance The estimated gain function can be a good detector when a frame with a flooring value in all frequency bands is set to be a non-speech frame Leave 10-frame non-speech segment before/after each speech segment, and delete any other non-speech frame

19 Experiment Evaluated on the Aurora 2 Four experiments Noise reduction
Total relative performance improvement of 37.76% can be obtained by the proposed noise reduction method only Spectral emphasis A more aggressive factor can cause a bigger gain for noisy speech, but it is also harmful to clean speech Voice activity detector It works well especially in low SNR condition (5% improvement) Noise reduction + Spectral emphasis + Voice activity detector

20 Recognition results – Noise Reduction
From 38.61% to 67.02%

21 Recognition results – Spectral Emphasis

22 Recognition results

23 Recognition results

24 Conclusions Noise reduction and spectral emphasis techniques are used to improve ASR performance in noisy conditions The proposed algorithms can easily be embedded in a standard front-end MFCC calculation program With a low computational load and for real-time operation It works well for all 8 types of test noises as well as noise plus channel distortion


Download ppt "Presenter: Shih-Hsiang(士翔)"

Similar presentations


Ads by Google