SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification

Slides:



Advertisements
Similar presentations
Known Non-targets for PLDA-SVM Training/Scoring Construction of Discriminative Kernels from Known and Unknown Non-targets for PLDA-SVM Scoring Results.
Advertisements

1 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Patrol Team Language Identification System for DARPA RATS P1 Evaluation Pavel Matejka 1,
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Online PLCA for Real-Time Semi-supervised Source Separation Zhiyao Duan, Gautham J. Mysore, Paris Smaragdis 1. EECS Department, Northwestern University.
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Results obtained in speaker recognition using Gaussian Mixture Models Marieta Gâta*, Gavril Toderean** *North University of Baia Mare **Technical University.
Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification
Computer vision: models, learning and inference Chapter 18 Models for style and identity.
A Text-Independent Speaker Recognition System
Abstract This article investigates the importance of the vocal source information for speaker recogni- tion. We propose a novel feature extraction scheme.
Statistical Topic Modeling part 1
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.
Zhiyao Duan, Gautham J. Mysore, Paris Smaragdis 1. EECS Department, Northwestern University 2. Advanced Technology Labs, Adobe Systems Inc. 3. University.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Communications & Multimedia Signal Processing Analysis of the Effects of Train noise on Recognition Rate using Formants and MFCC Esfandiar Zavarehei Department.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Speaker Adaptation for Vowel Classification
Clustering.
Gaussian Mixture Example: Start After First Iteration.
Communications & Multimedia Signal Processing Formant Track Restoration in Train Noisy Speech Qin Yan Communication & Multimedia Signal Processing Group.
Speech Recognition in Noise
1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique.
Advisor: Prof. Tony Jebara
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Least-Mean-Square Training of Cluster-Weighted-Modeling National Taiwan University Department of Computer Science and Information Engineering.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Sound-Event Partitioning and Feature Normalization for Robust Sound-Event Detection 2 Department of Electronic and Information Engineering The Hong Kong.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Fields of Experts: A Framework for Learning Image Priors (Mon) Young Ki Baik, Computer Vision Lab.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
USE OF IMPROVED FEATURE VECTORS IN SPECTRAL SUBTRACTION METHOD Emrah Besci, Semih Ergin, M.Bilginer Gülmezoğlu, Atalay Barkana Osmangazi University, Electrical.
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
HMM - Part 2 The EM algorithm Continuous density HMM.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
A Semi-Blind Technique for MIMO Channel Matrix Estimation Aditya Jagannatham and Bhaskar D. Rao The proposed algorithm performs well compared to its training.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
OBJECT TRACKING USING PARTICLE FILTERS. Table of Contents Tracking Tracking Tracking as a probabilistic inference problem Tracking as a probabilistic.
An i-Vector PLDA based Gender Identification Approach for Severely Distorted and Multilingual DARPA RATS Data Shivesh Ranjan, Gang Liu and John H. L. Hansen.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
SNR-Invariant PLDA Modeling for Robust Speaker Verification Na Li and Man-Wai Mak Department of Electronic and Information Engineering The Hong Kong Polytechnic.
Madhulika Pannuri Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Correlation Dimension.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
3. Applications to Speaker Verification
Bayesian Models in Machine Learning
SMEM Algorithm for Mixture Models
Decision Making Based on Cohort Scores for
Whitening-Rotation Based MIMO Channel Estimation
A maximum likelihood estimation and training on the fly approach
SNR-Invariant PLDA Modeling for Robust Speaker Verification
Qiang Huo(*) and Chorkin Chan(**)
Probabilistic Surrogate Models
Presentation transcript:

SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification Man-Wai Mak Interspeech 2014 Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong SAR, China

Contents Motivation of Work Conventional PLDA Mixture of PLDA for Noise Robust Speaker Verification Experiments on SRE12 Conclusions 2

I-Vector/PLDA Scoring Motivation Conventional i-vector/PLDA systems use a single PLDA model to handle all SNR conditions. I-Vector/PLDA Scoring PLDA Score Enrollment Utterances

Motivation We argue that a PLDA model should focus on a small range of SNR. PLDA Model 1 PLDA Score PLDA Model 2 PLDA Score PLDA Model 3 PLDA Score

Distribution of SNR in SRE12 Each SNR region is handled by a PLDA Model

Proposed Solution The full spectrum of SNRs is handled by a mixture of PLDA in which the posteriors of the indicator variables depend on the utterance’s SNR. PLDA Model 1 PLDA Score SNR Estimator PLDA Model 2 SNR Posterior Estimator PLDA Model 3

Key Features of Proposed Solution Verification scores depend not only on the same- speaker and different-speaker likelihoods but also on the posterior probabilities of SNR.

Contents Motivation of Work Conventional PLDA Mixture of PLDA for Noise Robust Speaker Verification Experiments on SRE12 Conclusions

Probabilistic LDA (PLDA) In PLDA, the i-vectors x are modeled by a factor analyzer of the form: Speaker factor Residual noise with covariance Σ i-vector extracted from the j-th session of the i-th speaker Global mean of all i-vectors Speaker factor loading matrix Density of x is

Probabilistic LDA (PLDA) The PLDA parameters ω={m, V, Σ} are estimated by maximizing

Contents Motivation of Work Conventional PLDA Mixture of PLDA for Noise Robust Speaker Verification Experiments on SRE12 Conclusions

Mixture of PLDA Model Parameters of mPLDA: For modeling SNR of utts. For modeling SNR-dependent i-vectors 2

Generative Model for mPLDA : SNR in dB where the posterior prob of SNR is Posterior of SNR

PLDA vs mPLDA Generative Model PLDA Mixture of PLDA

Likelihood-Ratio Scores of mPLDA Same-speaker likelihood: SNR of target and test utterances i-vectors of target and test speakers

Likelihood-Ratio Scores of mPLDA Different-speaker likelihood: Same-speaker likelihood Verification Score = Different-speaker likelihood 16

PLDA vs mPLDA PLDA: Mixture of PLDA: Auxiliary Function Latent indicator variables: No. of mixtures Latent speaker factors: SNR of training utterances: Speaker indexes Session indexes

PLDA vs mPLDA E-Step PLDA Mixture of PLDA

PLDA versus mPLDA M-Step PLDA Mixture of PLDA

Contents Motivation of Work Conventional PLDA Mixture of PLDA for Noise Robust Speaker Verification Experiments on SRE12 Conclusions

Experiments Evaluation dataset: Common evaluation condition 2 of NIST SRE 2012 core set. Parameterization: 19 MFCCs together with energy plus their 1st and 2nd derivatives  60-Dim UBM: gender-dependent, 1024 mixtures Total Variability Matrix: gender-dependent, 500 total factors I-Vector Preprocessing: Whitening by WCCN then length normalization Followed by LDA (500-dim  200-dim) and WCCN

Experiments In NIST 2012 SRE, training utterances from telephone channels are clean, but some of the test utterances are noisy. We used the FaNT tool to add babble noise to the clean training utterances Babble noise Utterances from microphone channels FaNT From telephone channels

Performance on SRE12 Train on tel+mic speech and test on noisy tel speech (CC4) Train on tel+mic speech and test on tel speech recorded in noisy environments (CC5) Use FaNT and a VAD to determine the SNR of test utts. See our ISCSLP14 paper

Performance on SRE12 Train on tel+mic speech and test on noisy tel speech (CC4) Use FaNT and a VAD to determine the SNR of test utts. Male Female PLDA PLDA mPLDA mPLDA

Conclusions Mixture of SNR-dependent PLDA is a flexible model that can handle noisy speech with a wide range of SNR The contribution of the mixtures are probabilistically combined based on the SNR of the test utterances and the target-speaker’s utterances Results show that the mixture PLDA performs better than conventional PLDA whenever the SNR of test utterances varies widely.

Hard-Decision Mixture of PLDA

Training of mPLDA Auxiliary function: where No. of mixtures where Latent indicator variables: Latent speaker factors: SNR of training utterances: Speaker indexes Session indexes

xs and xt share the same z PLDA Scoring xs and xt share the same z

Probabilistic LDA (PLDA) PLDA example: 2-D data in 1-D subspace z Take a sample according to p(z) Source: S. Prince, “Computer vision: models, learning and inference”, 2012