Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Reporter: Shih-Hsiang( 士翔 )

Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech recognition –Feature extraction (the first crucial step) Acoustic features may greatly affect the performance of a speech recognizer –Discriminability –Robustness –Complexity MFCCs are used almost as “standard” acoustic parameters in currently available speech recognition systems –Not to cope well with noisy speech –Wiener filtering, spectral subtraction, RASTA, PMC, MLLR …etc. In this paper, they present differential power spectrum (DPS) for speech recognition

Definition of the differential power spectrum y(t) : received speech signal s(t) : original clean speech signal h(t) : impulse response of the transmission channel x(t) : the noise-free speech signal v(t) : ambient noise (0≤n<N, where N is the frame length) assume power spectrum ω : radian frequency r y (τ) : the short-time autocorrelation

Definition of the differential power spectrum (cont.) assume noise and speech signal are mutually uncorrelated Differential power spectrum (DPS) assume noise and speech signal are mutually uncorrelated (continuous frequency domain)

Definition of the differential power spectrum (cont.) Its discrete counterpart can be approximated in terms of following difference equation where P and O are the orders of the differential equation b l ’s some real-valued weighting coefficients 0≤k<K, here K is the length of FFT

Definition of the differential power spectrum (cont.) D(k) = Y(k) – Y(k+1)

Representing DPS into speech features Three problems –The selection of proper orders of the difference equations –The determination of weights b l ’s –How DPS should be converted into a few parameters An optimal solution to any of the three listed problems is difficult to achieve For the first two problems, they proposed three special forms –DPS1: D(k) = Y(k) – Y(k+1) –DPS2: D(k) = Y(k) – Y(k+2) –DPS3: D(k) = Y(k-2) + Y(k-1) – Y(k+1) - Y(k+2) The third problem is converting DPS into cepstral coefficients –An absolute operation to make negative parts positive –The magnitude of DPS is passed through a mel-frequency filter bank –Logarithmic filter bank outputs are compressed into a feature vector

Representing DPS into speech features (cont.)

Comparison with the cepstral liftering technique If x i is the i-th cepstral coefficient, then the corresponding liftered cepstral coefficient is given by Various types of lifters are proposed in the literature where W i define the lifter

Comparison with the cepstral liftering technique (cont.) Type of lifter SNR in dB ∞30252015 No Lifter93.070.655.937.224.0 Lin. Lifter94.090.886.680.170.6 Stat. Lifter93.986.778.368.355.3 Sin. Lifter94.585.978.968.751.5 Exp. Lifter94.390.185.178.968.1 Effect of cepstral liftering on the performance of a DTW-based speech recognizer

Comparison with the cepstral liftering technique (cont.) But liftering has no effect in the recognition process Mahalanobis distance - HMM Mahalanobis distance liftered cepstral cofficients are used Weighted Matrix  

Comparison with the cepstral liftering technique (cont.) In DPS based cepstrum

Comparison with the spectral subtraction SS can be formulated as For speech recognition, it was found that SS operated in each band-pass filter could yield more consistent improvement for MFCC features against noise β :spectral flooring α:controls the amount of noise subtracted from the noisy signal E Y( k) is the output of the kth band-pass filter when Y(k) is passed though the filter attack decay

Experiments In this paper they conduct a number of speech recognition experiments –Isolated speech recognition –SNR improvement –Connected digits recognition –Phone recognition –Evaluation on AURORA task

Experiments - Isolated speech recognition TI46 database – an isolated spoken words database (TI) –16 speakers (8 males / 8 females) –Vocabulary consists 10 isolated digits from ‘ZERO’ to ‘NINE’ 26 isolated English alphabets from ‘A’ to ‘Z’ 10 isolated words including “ENTER, ERASE, GO, HELP, NO, RUBOUT, REPEAT, STOP, START, YES” –26 utterances of each word from each speaker (10 training /16 testing) In this experiment, four sets of features are considered –MFCC –DPSCC1 –DPSCC2 –DPSCC3

Experiments - Isolated speech recognition (cont.) The DPS based features can at least yield comparable performance as the standard MFCCs For both MFCCs and DPSCCs, the inclusion of dynamic and acceleration features can greatly augment the performance

Experiments - SNR improvement Clean speech signals are taken from the TI46 database Take Lynx noise from the NOISEX database Power spectrum based DPS based

Experiments - SNR improvement (cont.) Tge average SNR D is approximately 4 dB higher than SNR Y

Experiments - Connected digits recognition TI connected digits database – contains digits string uttered by adult and child speakers –Vocabulary consists 11 words - 10 digits and an “oh” –Each speaker uttered 77 sequences of these words Add some noise to the speech signal in the test set, and the training speech is kept clean –wide-band stationary speech noise, machine-gun noise, Lynx noise Four sets of feature vectors are investigated –MFCC –DPSCC –MFCC + CMN –DPSCC + CMN

Experiments - Connected digits recognition (cont.) Compared with MFCCs, it yields at least comparable performance in clean conditions In most strong noise conditions, DPSCC outperforms MFCC CMN is effective to augment the robustness of both

Experiments - Phone recognition TIMIT phoneme based continuous speech database –Contains a total of 6300 sentences –10 sentences spoken by each of 630 speakers from 8 major dialect regions of the US –Perform phonetic recognition on the database over the set of 39 classes that are commonly used for evaluation Add some noise to the speech signal in the test set, and the training speech is kept clean –wide-band stationary speech noise, machine-gun noise, Lynx noise Two feature sets are used –MFCC+CMN (39 coefficients) –DPSCC+CMN (39 coefficients)

Experiments - Phone recognition (cont.) The MFCC and the DPSCC features yield comparable result in clean and weak noise conditions. DPSCC features slightly outperform the MFCC features in strong noise conditions

Experiments - Evaluation on AURORA task Noise signals are recorder at different places –suburban train, babble, car, exhibition hall, restaurant, street, airport and train station Two training modes are defined –Training on clean data only 8440 utterances (55 male / 55 female) Signals are filtered with the G.712 characteristic without noise added –Training on clean as well as noisy data (multi-condition) 8440 utterances and split into 20 subsets (with 422 utterances) Suburban train, babble, car, and exhibition hall noises are added to 20 subsets at 5 different SNRs (20, 15, 10, 5 dB and the clean condition) Three test sets are defined –Test Set A 、 Test Set B 、 Test Set C

Experiments - Evaluation on AURORA task (cont.) With the use of CMN, the average word error rate is reduced 8.8% SS used together with the CMN, it increases the average performance by 19.3% The DPS based cepstrum outperforms MFCC. It also yields a slightly better performance than SS

Discussion and conclusion DPS can also preserve spectral information to discriminate among different linguistic units (e.g. phonemes and words) DPS had a higher SNR than the power spectrum, specially for voiced frames –DPS based features should be more resilient to noise than the power spectrum based feature The DPSCC can yield at least comparable performance when compared to the conventional MFCCs. –In most cases, it outperforms MFCC Compared to the estimation of MFCC, the extraction of DPSCC requires (K/2-1) more addition (subtraction) and absolute operations for each frame signal –This increase in computational complexity is negligible for today’s computer

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Similar presentations

Presentation on theme: "Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Similar presentations

Presentation on theme: "Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech."— Presentation transcript:

Similar presentations

About project

Feedback