Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.

Similar presentations


Presentation on theme: "Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University."— Presentation transcript:

1 Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/~mwmak/

2 2 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

3 3 Speaker Verification Process To verify the identify of a claimant based on his/her own voices Is this Mary’s voice? I am Mary

4 4 A 2-class Hypothesis problem: H0: MFCC sequence X (c) comes from to the true speaker H1: MFCC sequence X (c) comes from an impostor Verification score is a likelihood ratio: Feature extraction Background Model Decision + − Score Speaker Model Speaker Verification Process

5 5 Voice Activity Detection in Speaker Verification Speech Speech segments DCT Log|X(ω)| MFCC VAD Feature Extraction Acoustic Features (MFCC)

6 6 Effect of VAD on Acoustic Features Feature Extraction VAD Speech Feature Extraction Feacture vector: MFCC dim1 dim2 Feacture vector: MFCC dim1 dim2 Non-speech region

7 7 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

8 8 Interview-Speech in NIST SRE Interview Room Interviewer Interviewee Desk Source: NIST SRE 2008 Workshop

9 9  Far-field and desktop microphones were used for collecting interview speech  Some interview-speech files are very noisy, causing difficulty in differentiating speech segments from non-speech segments non-speech speech Time Frequency Amplitude A typical interview-speech file in NIST SRE 2008 Interview-Speech in NIST SRE

10 10 Frequency Amplitude Segmentation S: speech h#: non-speech S: speech Whole file Time Interview-Speech in NIST SRE  Some files have very low SNR

11 11  Some files contain spiky signals, causing wrong VAD decision threshold Time Amplitude Spiky signal Interview-Speech in NIST SRE

12 12 Some files contain low-energy speech signal superimposed on periodic background noise. Time Frequency Amplitude Segmentation Non-speech detected as speech Interview-Speech in NIST SRE

13 13 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

14 14 l Use speech enhancement as a pre-processing step VAD for NIST Speaker Recognition Evaluation Noisy Speech Denoised Speech Speech Segment Info Denoising (Spectral Subtraction) Energy-based VAD Spectral-Subtraction VAD (SVAD) Feature Extraction Scoring MFCC Accept/Reject Decision Making Speaker Model Impostor Model Decision Threshold S S SS S S

15 15 l Use speech enhancement as a pre-processing step VAD for NIST Speaker Recognition Evaluation SignalFrequency Spectrum Clean speech x(n,m)X(ω,m) Noisy speech y(n,m)Y(ω,m) Background speech b(n,m)B(ω,m) This values were set such that we remove as much noise as possible.

16 16 VAD for NIST Speaker Recognition Evaluation  Without denoising  With denoising Time Amplitude Time Amplitude

17 17 VAD for NIST Speaker Recognition Evaluation  Without denoising S: speech h#: non-speech

18 18 VAD for NIST Speaker Recognition Evaluation With denoising SS-VAD VAD in ETSI-AMR speech coder S: speech h#: non-speech

19 19 VAD for NIST Speaker Recognition Evaluation Speech-segment-length to speech-file-length ratio of 3 VADs 6249 Speech Files (NIST’05-08) Speech / Non-speech Energy-based VAD ETSI-AMR Coder Energy-based VAD with SS Speech / Non-speech Speech / Non-speech total duration: 10 secs. total speech segment: 3 secs. speech-segment-length to speech-file-length ratio = 3/10

20 20 VAD for NIST Speaker Recognition Evaluation Speech-segment-length to speech-file-length ratio of 3 VADs High frequency of occurrence, suggesting many non-speech segments being mistakenly detected as speech segments Ordinary Energy- based VAD Spectral- Subtraction VAD VAD in ETSI AMR Coder

21 21 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

22 22 Experiments on NIST SRE 2008 Speaker Modeling: GMM-SVM Score Normalization: T-norm

23 23 Results on NIST 2008 SRE

24 24 Results on NIST 2008 SRE Common Condition 1 SS-VAD VAD ETSI AMR

25 25 Preliminary Results on NIST 2010 EER (%)Normalized minDCF Energy-based VAD11.720.99 SS-VAD4.450.58 SMB5.830.75 SS-SMB4.620.60 NIST ASR Transcripts8.580.85 ETSI-AMR8.050.85 Common Condition 2: All trials involving interview speech from different microphones SMB: Statistical-Model Based VAD Sohn, et al. “A statistical model-based voice activity detection”, IEEE Signal Processing Letters, 1999.

26 26 Conclusions Noise reduction is of primary importance for VAD under extremely low SNR It is important to remove the sinusoidal background found in NIST SRE sound files as this kind of background signal could lead to many false detection in energy-based VAD. Using noise reduction as a pre-preprocessing step leads to a VAD outperforms the VAD in ETSI-AMR (Option 2).

27 27 VAD for NIST Speaker Recognition Evaluation Threshold Determination and VAD Decision Logic spike Windowing frame amplitude apLapL ap1ap1 L 500 preset non-speech frames μbμb Sample-based Frame-based Amplitude Ranking

28 28 Results

29 29 Experiments on NIST SRE 2008 Feature Extraction Model Creation (NIST’05 & 06) Feature Extraction MAP Adaptation (NIST’08) GMM-supervectors of target speakers MAP Adaptation GMM-supervectors of 300 impostors 300 background speakers (NIST’06) SVM Training GMM-SVM Training phase NAP

30 30 Experiments on NIST SRE 2008 Verification phase


Download ppt "Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University."

Similar presentations


Ads by Google