# Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification

## Presentation on theme: "Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification"— Presentation transcript:

Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification
Good morning, Today I am here to give a presentation on “1” . Man-Wai MAK and Wei RAO The Hong Kong Polytechnic University

Outline GMM-UBM for Speaker Verification
GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Experiments on NIST SRE Let’s look at the outline of the presentation. Firstly, I will introduce the background of the topic Second, I will explain the detail of Utterance Partition with Acoustic Vector Re-sampling Third, Briefly introduce the experiment environment Finally,

Speaker Verification To verify the identify of a claimant based on his/her own voices Is this Mary’s voice? I am Mary

Score Normalization and Decision Making
Verification Process I’m John Decision Threshold John’s “Voiceprint” John’s Model + Score Normalization and Decision Making Feature Extraction Scores Impostor Model _ Impostors “Voiceprints” Accept/Reject

Acoustic Features Speech is a continuous evolution of the vocal tract
Need to extract a sequence of spectra or sequence of spectral coefficients Use a sliding window ms window, 10 ms shift MFCC DCT Log|X(ω)|

GMM-UBM for Speaker Verification
The acoustic vectors (MFCC) of speaker s is modeled by a prob. density function parameterized by Gaussian mixture model (GMM) for speaker s:

GMM-UBM for Speaker Verification
The acoustic vectors of a general population is modeled by another GMM called the universal background model (UBM): Parameters of the UBM

GMM-UBM for Speaker Verification
Enrollment Utterance (X(s)) of Client Speaker MAP Universal Background Model Client Speaker Model

GMM-UBM Scoring 2-class Hypothesis problem:
H0: MFCC sequence X(c) comes from to the true speaker H1: MFCC sequence X(c) comes from an impostor Verification score is a likelihood ratio: Speaker Model Score Feature extraction + Decision Background Model

Outline GMM-UBM for Speaker Verification
GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Acoustic Vector Resampling for GMM-SVM Results on NIST SRE Let’s look at the outline of the presentation. Firstly, I will introduce the background of the topic Second, I will explain the detail of Utterance Partition with Acoustic Vector Re-sampling Third, Briefly introduce the experiment environment Finally,

GMM-SVM for Speaker Verification
supervector UBM Feature Extraction MAP Adaptation Mean Stacking Mapping

GMM-SVM Scoring SVM Scoring … Compute GMM- Feature
Supervector of Target Speaker s Feature Extraction UBM Compute GMM- Supervectors of Background Speakers Feature Extraction Feature Extraction Compute GMM- Supervector of Claimant c UBM

GMM-UBM Scoring Vs. GMM-SVM Scoring
Normalized GMM-supervector of claimant’s utterance Normalized GMM-supervector of target-speaker’s utterance

Outline GMM-UBM for Speaker Verification
GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Results on NIST SRE Let’s look at the outline of the presentation. Firstly, I will introduce the background of the topic Second, I will explain the detail of Utterance Partition with Acoustic Vector Re-sampling Third, Briefly introduce the experiment environment Finally,

Data Imbalance in GMM-SVM
For each target speaker, we only have one utterance (GMM-supervector) from the target speaker and many utterances from the background speakers. So, we have a highly imbalance learning problem. Only one training vector from the target speaker

Data Imbalance in GMM-SVM
Orientation of the decision boundary depends mainly on impostor-class data

Data Imbalance in GMM-SVM
Impostor Class Speaker Class Region for which the target-speaker vector can be located without changing the orientation of the decision plane The green region beneath of decision plane in the supervector space where the speaker-class supervector can move around without affecting the orientation of the decision plane. A 3-dim two-class problem illustrating the problem that the SVM decision plane is largely governed by the impostor-class supervectors.

Outline GMM-UBM for Speaker Verification
GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Results on NIST SRE Let’s look at the outline of the presentation. Firstly, I will introduce the background of the topic Second, I will explain the detail of Utterance Partition with Acoustic Vector Re-sampling Third, Briefly introduce the experiment environment Finally,

Utterance Partitioning
Partition an enrollment utterance of a target speaker into number of sub-utterances, with each sub-utterance producing one GMM-supervector. In the last chapter, we know that the situation in GMMSVM-based speaker verification is very special. The over-sampling method is not appropriate. In this case, we proposed Utterance Partitioning which partition an enrollment utterance of target speaker into number of sub-utterances, with each sub-utterance producing one GMM-supervector.

Utterance Partitioning
Target-speaker’s Enrollment Utterance Background-speakers’ Utterances Feature Extraction Feature Extraction UBM MAP Adaptation and Mean Stacking This slide illustrate the procedure of Utterance Partitioning. After feature extraction, we get the sequence of acoustic vector of target speaker and then partition it into four segments. Through MAP adaptation and mean stacking, we generate five GMM-supervectors. Because matching the duration of target-speaker utterances with that of background utterances has been found useful in previous studies. The same partitioning strategy is also applied to background utterances and then get 5*B GMM-supervectors of background speakers which are used for SVM training to create the SVM of target speaker. SVM Training SVM of Target Speaker s

When the number of partitions increases, the length of sub-utterance decreases. If the utterance-length is too short, the supervectors of the sub-utterances will be almost the same as that of the UBM For increasing the influence of target speaker, it need more sub-utterances. In other words, it need increase the number of segments, which may reduce the length of the sub-utterances. This will inevitable compromise the representation power of the sub-utterances, which also effect the representation power of GMM-supervectors. For solving this problem, we propose to address this issue by randomizing the sequence order before partitioning takes place. This randomization and partitioning process can be repeated several times to produce a desirable number of GMM-supervectors. We named this method as “UP-AVR”. Supervector corresponding to the UBM

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)
Goal: Increase the number of sub-utterances without compromising their representation power Procedure of UP-AVR: 1. Randomly rearrange the sequence of acoustic vectors in an utterance; 2. Partition the acoustic vectors of an utterance into N segments; 3. If Step 1 and Step 2 are repeated R times, we obtain RN+1 target-speaker’s supervectors . For generating more sub-utterances with reasonable length is to use the notion of random re-sampling in bootstrapping. The idea is based on the fact that the MAP adaptation algorithm use the statistics of the whole utterance to update the GMM parameters. In other words, changing the order of acoustic vector will not affect the resulting MAP-adapted model. MFCC seq. before randomization MFCC seq. after randomization

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)
Target - speaker s Enrollment U tterance Background - speaker s U tterances Feature Extraction and Feature Extraction and Index Randomization Index Randomization (s) X ) (b 1 X (s) utt ) ( 1 utt b (s) 1 X (s) 2 X (s) 3 X (s) 4 X ) (b 1 X ) (b 2 1 X ) (b 3 1 X ) (b 4 1 X (s) 4 , X K ) (b 2 X MAP Adaptation ) ( 2 utt b UBM and ) (b 1 2 X ) (b 2 X ) (b 3 2 X ) (b 4 2 X Mean Stacking This slide also show the procedure of UP-AVR. We can find that it is similar to the procedure of UP. The only different step is adding the index randomization. Using UP-AVR, we can generate more sub-utterances with reasonable length. ) ( 4 , 1 B b s m r L SVM Training ) (b B X ) ( utt B b ) (b 1 B X ) (b 2 B X ) (b 3 B X ) (b 4 B X SVM of Target Speaker s

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)
Characteristics of supervectors created by UP-AVR Average pairwise distance between sub-utt SVs is larger than the average pairwise distance between sub-utt SVs and full-utt SV. Average pairwise distance between speaker-class’s sub-utt SVs and impostor-class’s SVs is smaller than the average pairwise distance between speaker-class’s full-utt SV and impostor-class’s SVs. Imposter-class Speaker-class This slide also show the procedure of UP-AVR. We can find that it is similar to the procedure of UP. The only different step is adding the index randomization. Using UP-AVR, we can generate more sub-utterances with reasonable length. Sub-utt supervector Full-utt supervector

Nuisance Attribute Projection
Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005] Goal: To reduce the effect of session variability Recall the GMM-supervector kernel: Define the session- and speaker-dependent supervector as Remove the session-dependent part (h) by removing the sub-space that causes the session variability: Sub-space representing session variability. Defined by V The New kernel becomes The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices.

Nuisance Attribute Projection
Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005] Sub-space representing session variability. Defined by V The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices.

Enrollment Process of GMM-SVM with UP-AVR Resampling/
Partitioning MFCCs of an utterance from target-speaker s UBM MAP and Mean Stacking Session-dependent supervectors NAP This slide also show the procedure of UP-AVR. We can find that it is similar to the procedure of UP. The only different step is adding the index randomization. Using UP-AVR, we can generate more sub-utterances with reasonable length. Session-independent supervectors SVM Training SVM of target-speaker s

Verification Process of GMM-SVM with UP-AVR
MFCCs of a test utterance from claimant c UBM MAP and Mean Stacking Session-dependent supervector Tnorm Models NAP Session-independent supervector This slide also show the procedure of UP-AVR. We can find that it is similar to the procedure of UP. The only different step is adding the index randomization. Using UP-AVR, we can generate more sub-utterances with reasonable length. score SVM Scoring T-Norm Normalized score SVM of target-speaker s

T-Norm (Auckenthaler, 2000)
Goal: To shift and scale the verification scores so that a global decision threshold can be used for all speakers T-Norm SVM 1 SVM Scoring Compute Mean and Standard Deviation This slide also show the procedure of UP-AVR. We can find that it is similar to the procedure of UP. The only different step is adding the index randomization. Using UP-AVR, we can generate more sub-utterances with reasonable length. Z-norm from test utterance SVM Scoring T-Norm SVM R

Outline GMM-UBM for Speaker Verification
GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Experiments on NIST SRE Let’s look at the outline of the presentation. Firstly, I will introduce the background of the topic Second, I will explain the detail of Utterance Partition with Acoustic Vector Re-sampling Third, Briefly introduce the experiment environment Finally,

Experiments Speech Data Evaluations on NIST SRE 2002 and 2004
Use NIST’01 for computing the UBMs, impostor-class supervectors of SVMs, Tnorm models, and NAP parameters 2983 true-speaker trials and impostor attempts 2-min utterances for training and about 1-min utt for test NIST SRE 2004: Use the Fisher corpus for computing UBMs, impostor-class supervectors of SVMs, and Tnorm models NIST’99 and NIST’00 for computing NAP parameters 2386 true-speaker trials and impostor attempts 5-min utterances for training and testing The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices.

Experiments Features and Models
12 MFCC + 12 ΔMFCC with feature warping 1024-mixture GMMs for GMM-UBM 256-mixture GMMs for GMM-SVM MAP relevance factor = 16 300 impostor-class supervectors for GMM-SVM 200 T-norm models 64-dim session variability subspace (NAP corank, rank of V) The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices.

Results No. of mixtures in GMM-SVM (NIST’02)
Threshold below which the variances of feature are deemed too small The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices. Normalized Large number of features with small variance

Results Effects of NAP on Different NIST SRE
Large eigenvalues mean large session variation The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices.

Effect of NAP Corank on Performance
Results Effect of NAP Corank on Performance The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices. No NAP

Comparing discriminative power of GMM-SVM and GMM-SVM with UP-AVR
Results Comparing discriminative power of GMM-SVM and GMM-SVM with UP-AVR The table summarizes the roles played by these corpora in the evaluations. NIST’02 and NIST’04 were used for performance evaluations. When the evaluation database is NIST’02, we use the data of NIST’01 to create UBMs, T-norm Models and Impostor-class of SVMs and calculate the NAP matrices. When the evaluation database is NIST’04, we use the data of Fisher to create UBMs, T-norm Models and Impostor-class of SVMs and NIST’99 and NIST’00 to calculate the NAP matrices. Fig.4: Scores produced by SVMs that use one or more speaker-class supervectors (SVs) and 250 background SVs for training. The horizontal axis represents the training/testing SVs. Values inside the squared brackets are the mean difference between speaker scores and impostor scores. Fig.4: Scores produced by SVMs that use one or more speaker-class supervectors (SVs) and 250 background SVs for training. The horizontal axis represents the training/testing SVs. Values inside the squared brackets are the mean difference between speaker scores and impostor scores.

EER and MinDCF vs. No. of Target-Speaker Supervectors
Results EER and MinDCF vs. No. of Target-Speaker Supervectors The figure a and b also show the trends of EER and minimum DCF when the number of speaker-class supervector increases. The figures demonstrate that utterance partitioning can reduce EER and minimum DCF. More importantly, the most significant performance gain is obtained when the number of speaker-class supervectors increases from 1 to 5. the performance levels off when more supervectors are added by increasing the number of resampling. This is reasonable because a large number of positive supervectors will only result in a large number of zero lagrange multipliers for the speaker class. NIST’02

Varying the number of resampling (R) and number of partitions (N)
Results Varying the number of resampling (R) and number of partitions (N) The figure a and b also show the trends of EER and minimum DCF when the number of speaker-class supervector increases. The figures demonstrate that utterance partitioning can reduce EER and minimum DCF. More importantly, the most significant performance gain is obtained when the number of speaker-class supervectors increases from 1 to 5. the performance levels off when more supervectors are added by increasing the number of resampling. This is reasonable because a large number of positive supervectors will only result in a large number of zero lagrange multipliers for the speaker class. NIST’02

Results Table1: NIST’04 Table1: NIST’04 Table1: NIST’04

Experiments and Results
Performance on NIST’02 EER=9.39% EER=9.05% EER=8.16%

Experiments and Results
Performance on NIST’04 GMM-UBM EER=16.05% GMM-SVM GMM-SVM w/ UP-AVR EER=10.42% EER=9.46%

References S.X. Zhang and M.W. Mak "Optimized Discriminative Kernel for SVM Scoring and its Application to Speaker Verification", IEEE Trans. on Neural Networks, to appear. M.W. Mak and W. Rao, "Utterance Partitioning with Acoustic Vector Resampling for GMM-SVM Speaker Verification", Speech Communication, vol. 53 (1), Jan. 2011, Pages M.W. Mak and W. Rao, "Acoustic Vector Resampling for GMMSVM-Based Speaker Verification, Interspeech Sept. 2010, Makuhari, Japan, pp S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, 2005 W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Processing Letters, vol. 13, pp. 308–311, 2006. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000.

Download ppt "Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification"

Similar presentations