Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.

Slides:



Advertisements
Similar presentations
Known Non-targets for PLDA-SVM Training/Scoring Construction of Discriminative Kernels from Known and Unknown Non-targets for PLDA-SVM Scoring Results.
Advertisements

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Advanced Speech Enhancement in Noisy Environments
Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification
A Text-Independent Speaker Recognition System
Advances in WP1 Turin Meeting – 9-10 March
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Advances in WP1 Nancy Meeting – 6-7 July
HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.
Speech Recognition in Noise
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE MEETING Trento, January 11-12, 2007 José C. Segura, Javier Ramírez.
SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
Nico De Clercq Pieter Gijsenbergh Noise reduction in hearing aids: Generalised Sidelobe Canceller.
Speech Enhancement Using Spectral Subtraction
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Jacob Zurasky ECE5526 – Spring 2011
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
Speaker independent Digit Recognition System Suma Swamy Research Scholar Anna University, Chennai 10/22/2015 9:10 PM 1.
Sound-Event Partitioning and Feature Normalization for Robust Sound-Event Detection 2 Department of Electronic and Information Engineering The Hong Kong.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach.
Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimation method.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
SNR-Invariant PLDA Modeling for Robust Speaker Verification Na Li and Man-Wai Mak Department of Electronic and Information Engineering The Hong Kong Polytechnic.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Speech Enhancement Summer 2009
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
朝陽科技大學 資訊工程系 謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學 資訊工程系 謝政勳
Decision Making Based on Cohort Scores for
feature extraction methods for EEG EVENT DETECTION
A maximum likelihood estimation and training on the fly approach
Speech / Non-speech Detection
SNR-Invariant PLDA Modeling for Robust Speaker Verification
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University

2 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

3 Speaker Verification Process To verify the identify of a claimant based on his/her own voices Is this Mary’s voice? I am Mary

4 A 2-class Hypothesis problem: H0: MFCC sequence X (c) comes from to the true speaker H1: MFCC sequence X (c) comes from an impostor Verification score is a likelihood ratio: Feature extraction Background Model Decision + − Score Speaker Model Speaker Verification Process

5 Voice Activity Detection in Speaker Verification Speech Speech segments DCT Log|X(ω)| MFCC VAD Feature Extraction Acoustic Features (MFCC)

6 Effect of VAD on Acoustic Features Feature Extraction VAD Speech Feature Extraction Feacture vector: MFCC dim1 dim2 Feacture vector: MFCC dim1 dim2 Non-speech region

7 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

8 Interview-Speech in NIST SRE Interview Room Interviewer Interviewee Desk Source: NIST SRE 2008 Workshop

9  Far-field and desktop microphones were used for collecting interview speech  Some interview-speech files are very noisy, causing difficulty in differentiating speech segments from non-speech segments non-speech speech Time Frequency Amplitude A typical interview-speech file in NIST SRE 2008 Interview-Speech in NIST SRE

10 Frequency Amplitude Segmentation S: speech h#: non-speech S: speech Whole file Time Interview-Speech in NIST SRE  Some files have very low SNR

11  Some files contain spiky signals, causing wrong VAD decision threshold Time Amplitude Spiky signal Interview-Speech in NIST SRE

12 Some files contain low-energy speech signal superimposed on periodic background noise. Time Frequency Amplitude Segmentation Non-speech detected as speech Interview-Speech in NIST SRE

13 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

14 l Use speech enhancement as a pre-processing step VAD for NIST Speaker Recognition Evaluation Noisy Speech Denoised Speech Speech Segment Info Denoising (Spectral Subtraction) Energy-based VAD Spectral-Subtraction VAD (SVAD) Feature Extraction Scoring MFCC Accept/Reject Decision Making Speaker Model Impostor Model Decision Threshold S S SS S S

15 l Use speech enhancement as a pre-processing step VAD for NIST Speaker Recognition Evaluation SignalFrequency Spectrum Clean speech x(n,m)X(ω,m) Noisy speech y(n,m)Y(ω,m) Background speech b(n,m)B(ω,m) This values were set such that we remove as much noise as possible.

16 VAD for NIST Speaker Recognition Evaluation  Without denoising  With denoising Time Amplitude Time Amplitude

17 VAD for NIST Speaker Recognition Evaluation  Without denoising S: speech h#: non-speech

18 VAD for NIST Speaker Recognition Evaluation With denoising SS-VAD VAD in ETSI-AMR speech coder S: speech h#: non-speech

19 VAD for NIST Speaker Recognition Evaluation Speech-segment-length to speech-file-length ratio of 3 VADs 6249 Speech Files (NIST’05-08) Speech / Non-speech Energy-based VAD ETSI-AMR Coder Energy-based VAD with SS Speech / Non-speech Speech / Non-speech total duration: 10 secs. total speech segment: 3 secs. speech-segment-length to speech-file-length ratio = 3/10

20 VAD for NIST Speaker Recognition Evaluation Speech-segment-length to speech-file-length ratio of 3 VADs High frequency of occurrence, suggesting many non-speech segments being mistakenly detected as speech segments Ordinary Energy- based VAD Spectral- Subtraction VAD VAD in ETSI AMR Coder

21 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

22 Experiments on NIST SRE 2008 Speaker Modeling: GMM-SVM Score Normalization: T-norm

23 Results on NIST 2008 SRE

24 Results on NIST 2008 SRE Common Condition 1 SS-VAD VAD ETSI AMR

25 Preliminary Results on NIST 2010 EER (%)Normalized minDCF Energy-based VAD SS-VAD SMB SS-SMB NIST ASR Transcripts ETSI-AMR Common Condition 2: All trials involving interview speech from different microphones SMB: Statistical-Model Based VAD Sohn, et al. “A statistical model-based voice activity detection”, IEEE Signal Processing Letters, 1999.

26 Conclusions Noise reduction is of primary importance for VAD under extremely low SNR It is important to remove the sinusoidal background found in NIST SRE sound files as this kind of background signal could lead to many false detection in energy-based VAD. Using noise reduction as a pre-preprocessing step leads to a VAD outperforms the VAD in ETSI-AMR (Option 2).

27 VAD for NIST Speaker Recognition Evaluation Threshold Determination and VAD Decision Logic spike Windowing frame amplitude apLapL ap1ap1 L 500 preset non-speech frames μbμb Sample-based Frame-based Amplitude Ranking

28 Results

29 Experiments on NIST SRE 2008 Feature Extraction Model Creation (NIST’05 & 06) Feature Extraction MAP Adaptation (NIST’08) GMM-supervectors of target speakers MAP Adaptation GMM-supervectors of 300 impostors 300 background speakers (NIST’06) SVM Training GMM-SVM Training phase NAP

30 Experiments on NIST SRE 2008 Verification phase