Presentation is loading. Please wait.

Presentation is loading. Please wait.

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,

Similar presentations


Presentation on theme: "Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,"— Presentation transcript:

1 Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN, UK Presented by Shih-Hsiang IEEE Trans. on Audio, Speech, and Language Processing, Vol. 14, No.3, May 2006

2 2 Introduction Speech recognition performance is known to degrade dramatically when a mismatch occurs between training and testing conditions Traditional approaches for removing the mismatch thereby reducing the effect of noise on recognition include –Removing the noise from the test signal Noise filtering or speech enhancement –Spectral subtraction, Wiener filtering, RASTA filtering Assuming the availability of a priori knowledge –Construction a new acoustic model to match the appropriate test environment Noise or environment compensation –Model adaptation, Parallel model combination (PMC), Multi-condition training, SPLICE Real-world noisy training data is needed More recent studies are focused on the methods requiring less knowledge –Since this knowledge can be difficult to obtain in real-world application

3 3 Introduction (cont.) This paper investigates noise compensation for speech recognition –Involving additive noise, assuming any corruption type (e.g. full, partial, stationary, or time varying) –Assuming no knowledge about the noise characteristics and no training data from the noisy environment This paper proposes a method which focuses recognition only on reliable features but robust to full noise corruption that affects all time-frequency components of the speech representation –Combining artificial noise compensation with the missing-feature method, to accommodate mismatches between the simulated noise condition and the actual noise condition It is possible to accommodate sophisticated spectral distortion, e.g. full, partial, white, colored or none –Based on clean speech training data and simulated noise data –Namely, “Universal Compensation (UC)”

4 4 Methodology The UC method comprises three step –Construct a set of models for short-time speech spectra using artificial multi-condition speech data Generated by corruption the clean training data with artificial wide-band flat- spectrum noise at consecutive SNRS –Given a test spectrum Search for the spectral components in each model spectrum that best match the corresponding spectral components in the test spectrum Produce a score based on the matched components for each model spectrum –Combine the scores from the individual model spectra to form an overall score for recognition

5 5 Methodology (cont.) Step 1 –Generating noise by passing a white noise through a low-pass filter Step 2 –Calculating a score for each model spectrum based only on the match spectral components Step 3 –Combining the individual score from each model spectra to product an overall score Clean training spectrum Artificial wide-band flat-spectrum noise Noisy test spectrum

6 6 Methodology (cont.) A key to the success of the UC method is the accuracy for converting a full band corruption into partial-band corruption This accuracy is determined by two factors –The frequency-band resolution Determines the bandwidth for each spectral component The smaller the bandwidth, the more accurate the approximation for arbitrary noise spectral by piecewise flat spectra –But usually results in a loss of correlation between the spectral components, thus giving a poor phonetic discrimination An optimum frequency-band subdivision, in term of a good balance between the noise spectral resolution and the phonetic discrimination remains a topic for study –The amplitude resolution Refers to the number of steps used to quantize the SNR The finer the quantizing steps, the more accurate the approximation for any given level of noise –The use of a large number of SNRs may result in a low computational efficiency

7 7 Formulation A. Model and Training Algorithms Assume that each training frame is represented by spectral vector consisting of sub-band spectral components Assume that level of SNR are used to generate the wide-band flat- spectrum noise to form the noisy training, Let represent a model spectrum, expressed as the probability distribution of the model spectral vector, associated with speech state and trained on SNR level Let be a test spectral vector Recognition involves classifying each test spectrum into an appropriate speech state, based on the probabilities of the test spectrum associated with the individual model spectra within the state Computing the probability for each model spectrum –Only the matched spectral components are retained, –The mismatched components are ignored

8 8 Formulation (cont.) A. Model and Training Algorithms The probability can be approximated by which is the marginal distribution of obtained from with the mismatched spectral components in ignored to improve mismatch robustness Given for each model spectrum, the overall probability of,associated with speech state,can be obtained by combining over all different SNRs For simplicity, assume that the individual spectral components are independent of one another. So the probability for any subset can be written as (1) (2)

9 9 Formulation (cont.) A. Model and Training Algorithms The model spectrum may be constructed in two different ways –Firstly, we may estimate each explicitly by using the training data corresponding to a specific SNR –Alternatively, we may build the model by polling the training data from all SNR conditions together, and training the model as a usual mixture model on the mixed dataset (more flexible) Use EM algorithm decide the association between data / mixture / weights

10 10 Formulation (cont.) B. Recognition Algorithm Given a test spectral vector, the mixture probability in (1) using only a subset of the data for each of the mixture densities –Reducing the effect of mismatched noisy spectral components –But we need to decide the matched subset that contains all the matched components for each model spectrum If we can assume that the matched subset produces a large probability, then may be defined as the subset that maximize the probability among all possible subsets in However, (2) indicates that the values of for different sized subsets are of a different order of magnitude and are thus not directly comparable –An appropriate normalization is needed for the probability –A possible solution is to replace the condition probability of the test subset with the posterior probability of the model spectrum  always producing a value in the range [0,1]

11 11 Formulation (cont.) B. Recognition Algorithm By maximizing the posterior probability, we should be able to obtain the subset for model spectrum that contains all the matched components. The following shows the optimum decision: The above optimized posterior probability can be incorporated into a HMM to form the state based emission probability  MAP Criterion Don ’ t care Assuming an Equal prior p(s) for all the states (3)

12 12 Experimental Evaluation (cont.) A. Databases Tow databases are used to evaluate the performance of the UC method –The first database is Aurora 2 For speaker independent recognition of digit sequences in noisy conditions –The second database containing the highly confusing E-set words Used as an example to further examine the ability of the new UC model to deal with acoustically confusing recognition tasks E-set words include b, c, d, e, g, p, t, v

13 13 Experimental Evaluation (cont.) Acoustic Modeling for Aurora 2 The performance of UC model is compared with the performances of four baseline systems –The first one trained on the clean training set 3 mixtures per state for the digits / 6 mixtures per stat for the silence –The second one trained on the multi condition training set 3 mixtures per state for the digits / 6 mixtures per state for the silence –The third one improved model correspond to the complex back-end model 20 mixtures per states for the digits / 36 mixtures per state for the silence –The forth one uses 32 mixtures for all the state Which thus has the same model complexity as the UC model The UC model is trained using only the clean training set –Expanded by adding wide-band flat-spectrum noise to each of the utterance –10 different SNR levels, from 20dB to 2dB, reducing 2dB every level –The wide-band flat-spectrum is computer-generated white noise filtered by a low-pass filter with a 3-dB bandwidth of 3.5 kHz

14 14 Experimental Evaluation (cont.) Acoustic Modeling for Aurora 2 The speech is divided into frames of 25 ms at a frame rate of 10 ms For each frame –13-channel mel filter bank to obtain 13 log filter-bank amplitudes –These 13 amplitudes are then decorrelated by using a high-pass filter resulting in 12 decorrelated log filter-bank amplitudes, denoted by –The bandwidth of the subband can be increased conveniently by grouping neighboring subband components together to form a new subband component, for example a 6-subband spectral vector can be express as –In this paper, each feature vector consists 18 components 6 static subband spectra, 6 delta subband spectra and 6 delta-delta subband spectra The overall size of the feature vector for a frame is 18 x 2 = 36

15 15 Experimental Evaluation (cont.) Tests on Aurora 2 Condition Table shows the recognition result for clean test data For the clean data, best accuracy rates were obtained by the multi- condition baseline model with 20 and 32 mixtures per states The UC model performed on average slightly better than the multi- condition model with 3 mixtures models

16 16 Experimental Evaluation (cont.) Tests on Aurora 2 Condition Tables show the recognition result on test set A and test set B The UC model significantly improved over the baseline model trained on clean data, and achieve an average performance close to that obtained by the multi-condition model with three mixtures per state Car noise exhibits a less sophisticates spectral structure than the babble noise, and thus may be more accurately matched by the piece-wise flat spectra as implemented in the UC model

17 17 Experimental Evaluation (cont.) Tests on Aurora 2 Condition Table shows the recognition result on test set C The channel mismatch problem can be solved by Multi-20 and Multi-32 The UC model also showed a capability of coping with this mismatch –The performance is little affected by channel mismatch Figure summarizes the average word accuracy results for the five system

18 18 Experimental Evaluation (cont.) Tests on Noise Unseen in Aurora 2 The purpose of this study is to further investigate the capability of the UC model to offer robustness for a wide variety of noise –Three additional noise are used A polyphonic mobile phone ring, A pop song segment, A broadcast news segment –The spectral characteristics of the three noise are shown in follow figure A polyphonic mobile phone ring A pop song segment A broadcast news segment

19 19 Experimental Evaluation (cont.) Tests on Noise Unseen in Aurora 2 The UC model offered improved accuracy over all the three baseline model The UC model produced particularly good result for the ringtone noise –because the noise mainly partial corruption over the speech frequency band Table also indicates that increasing the number of mixtures in the mismatched baseline model –produced only a small improvement for the news noise –no improvement for the phone ring noise

20 20 Experimental Evaluation (cont.) Tests on Noise Unseen in Aurora 2 The UC model, with a complexity similar to that of Multi-32, performed similarly to Multi-3 trained in matched conditions The UC model was able to outperform Multi-32 in the case of unknown/mismatched noise conditions

21 21 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database This experiment is conducted into the ability of the UC model to discriminate between acoustically confusing words –While it reduces the mismatch between training and testing conditions, does it also reduce the discrimination between utterances of different words They experimented on a new database, containing the highly confusing E-set words (b, c, d, e, g, p, t, v), extracted from the Connex speaker- independent alphabetic database provided by British Telecom –Contains three repetitions of each word by 104 speakers 53 male and 51 female Among 104 speakers, 52 for training and the other 52 for testing For each word, about 156 utterances are available for training A total of 1219 utterances are available for testing For different noise from Aurora 2 test set A are artificially added Two baseline HMMs are buit –One with the clean training set (1 mixture per state) –The other with the multi-condition training set (11 mixtures per state)

22 22 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database For the clean E-set, the UC model achieved a recognition accuracy rate close to the rate obtained by the baseline model, with only small loss in accuracy (84.91%  83.33%) For the given noise conditions, the UC model achieved an average performance close to that obtained by the multi-condition baseline model

23 23 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database Finally, tested the performance of the UC model with different resolutions for quantizing the SNR –Three different training sets are generated with an increasing SNR resolution Coarse quantization (6 mixtures per state ) –Including only five different SNRs, from 20dB to 4dB with a 4 dB step Medium-resolution quantization (11 mixtures per state) –Including ten different SNRs, from 20dB to 2dB with a 2dB step Fine quantization (21 mixtures per state) –Including twenty different SNRs, from 20dB to 2dB with an 1dB step Additionally, all the three sets also include the clean training data

24 24 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database The two models with the medium and fine quantization produce quite similar recognition accuracy in many test conditions The model with the coarse quantization trained with 6 SNRs produced poorer results than the other two models, but still showed significant performance improvement in comparison to the baseline model trained on the clean data

25 25 Summary This paper investigated noise compensation for speech recognition –Assuming no knowledge about the noise characteristics and no training data from the noisy environment –Universal compensation (UC) is proposed as a possible solution to the problem –The UC method involves a novel combination of the principle of multi- condition training and the principle of the missing feature method Experiments on the Aurora 2 have shown that the UC model has the potential to achieve a recognition performance close to the multi- condition model performance without assuming knowledge of the noise Further experiments with noises unseend in Aurora 2 have indicated the ability of the UC model to offer robust performance for a wide variety of noises Finally, the experimental results on an E-set database have demonstrated the ability of the UC model to deal with acoustically confusing recognition tasks


Download ppt "Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,"

Similar presentations


Ads by Google