Presentation on theme: "Brno University Of Technology Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka,"— Presentation transcript:
Brno University Of Technology Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka, Petr Schwarz and Honza Černocký NIST Speaker Recognition Workshop 2008 MOBIO
NIST SRE 20082/24 Outline Submitted systems Factor Analysis systems SVM-MLLR system Side information based calibration and fusion Contribution of subsystems in fusion Analysis of FA system – Gender dependent vs. independent – Flavors of FA system – Sensitivity to number of eigenchannels – Importance of ZT-norm – Optimization for microphone data Techniques that did not make it to the submission Conclusion
NIST SRE 20083/24 Submitted systems BUT01 - primary (3 systems) – Channel and language side information in fusion – FA-MFCC13 39 – FA-MFCC20 60 – SVM-MLLR BUT02 - (3 systems) – The same as BUT01, but no side information in fusion BUT03 - (2 systems) – Channel and language side information in fusion – FA-MFCC13 39 – FA-MFCC20 60
NIST SRE 20084/24 FA-MFCC13 39 system MAP adapted UBM with 2048 Gaussian components – Single UBM trained on Switchboard and NIST 2004,5 data 12 MFCC + C0 (20ms window, 10ms shift ) Short time Gaussianization – Rank of the current frame coefficient in 3sec window transformed by inverse Gaussian cumulative distribution function. Delta + double delta + triple delta coefficients – Together 52 coefficients, 12 frames context HLDA (dimensionality reduction from 52 to 39) Factor Analysis Model – gender independent – 300 eigenvoices (Switchboards, NIST 2004,5) – 100 eigenchannels for telephone speech (NIST 2004,5 tel data) – 100 eigenchannels for microphone speech (NIST 2005 mic data) ZT-norm – gender dependent
NIST SRE 20085/24 FA-MFCC20 60 system The same as FA-MFCC13 39 with the following differences: – 60 dimensional features are: 19 MFCC + Energy + deltas + double deltas (no HLDA) – Two gender dependent Factor Analysis models
NIST SRE 20086/24 SVM – MLLR system Linear kernels Rank normalization LibSVM C++ library [Chang2001] Pre-computed Gram matrices Features are MLLR transformations adapting LVCSR system (developed for AMI project) to speaker of given speech segment Estimation of MLLR transformations makes use of the ASR transcripts provided by NIST
NIST SRE 20087/24 SVM – MLLR system Cascade of CMLLR and MLLR – 2 CMLLR transformation (silence and speech) – 3 MLLR transformation (silence and 2 phoneme clusters) Silence transformations are discarded for SRE Supervector = 1 CMLLR + 2 MLLR = = 3* *39=4680 Impostors: NIST mic data from NIST 2005 ZT-norm: speakers from NIST 2004
NIST SRE 20088/24 Side info based calibration and fusion Side information for each trial is given by its hard assignment to classes: – Trial channel condition provided by NIST: tel-tel, tel-mic, mic- tel, mic-mic – English/non-English decision given by our LID system Side information is used as follows: – For each system: Split trials by channel condition and calibrate scores using linear logistic regression (LLR) in each split separately Split trials according to English/non-English decision and calibrate scores using LLR in each split separately – Fuse the calibrated scores of all subsystems using LLR without making use of any side information For convenience, FoCal Bilinear toolkit by Niko Brummer was used, although we did not make use of its extensions over standard LLR.
NIST SRE 20089/24 Side info based calibration and fusion tel-tel trials SRE 2006 (all trials, det1)SRE 2008 (all trials, det6) No side information (BUT02) Channel cond. + Lang. by NIST Channel cond. + Lang. by LID (BUT01 – primary system) Use of side information is helpful Some improvement can be obtained by relying on language information provided by NIST instead of more realistic LID system
NIST SRE /24 Side info based calibration and fusion mic-mic trials SRE 2006 (trial list defined by MIT-LL)SRE 2008 (det1) No side information Channel cond. + Lang. by NIST Channel cond. + Lang. by LID Use of side information allowed to use the same unchanged subsystems for all the channels
NIST SRE /24 Subsystems and fusion tel-tel trials SRE 2006 (all trials, det1)SRE 2008 (all trials, det6) SVM-MLLR FA-MFCC13 39 FA-MFCC20 60 Fusion of 2 x FA Fusion of all 3 For tel-tel trials, the single FA-MFCC20 60 performs almost as well as the fusion
NIST SRE /24 Subsystems and fusion mic-mic trials SVM-MLLR FA-MFCC13 39 FA-MFCC20 60 Fusion of 2 x FA Fusion of all 3 SRE 2006 (trial list defined by MIT-LL)SRE 2008 (det1) For microphone conditions FA-MFCC20 60 fails to perform well and fusion is beneficial. FA-MFCC13 39 system outperforms FA-MFCC20 60 system having 3x more parameters, which is possibly too over-trained to telephone data primarily used for FA model training.
NIST SRE /24 Gender dependent vs. gender independent FA system SRE 2006 tel-telSRE 2006 mic-mic FA-MFCC13 39 GI FA-MFCC20 60 GD FA-MFCC20 60 GI Halving the number of parameters of FA-MFCC20 60 system by making it gender independent degrades the performance on telephone and improves on microphone condition
NIST SRE /24 Flavors of FA Relevance MAP adaptation M = m + dz with d 2 = Σ/ τ where Σ is matrix with UBM variance supervector in diagonal Eigenchannel adaptation (SDV, BUT) Relevance MAP for enrolling speaker model Adapt speaker model to test utterance using eigenchannels estimated by PCA FA without eigenvoices, with d 2 = Σ/ τ (QUT, LIA) FA without eigenvoices, with d trained from data (CRIM) can be seen as training different τ for each supervector coefficient Effective relevance factor τ ef = trace(Σ)/ trace(d 2 ) FA with eigenvoices (CRIM) All hyperparameters can be trained from data using EM Eigenchannels Diagonal matrix Eigenvoices UBM mean supervector M = m + vy + dz + ux Speaker specific factorsSession specific factors
NIST SRE /24 Flavors of FA Eigenchannel adapt. FA: d 2 = Σ/ τ FA: d trained on data FA with eigenvoices MFCC13 39 featuresMFCC20 60 features SRE 2006 (all trials, det1) τ ef = 236.1τ ef = 81.2 Without eigenvoices, simple eigenchannel adaptation seems to be more robust than FA. FA with trained d fails for MFCC13 39 features. Too high τ ef ? Caused by HLDA? FA with eigenvoices significantly outperform the other FA configurations.
NIST SRE /24 Sensitivity of FA to number of eigenchannels Eigenchannel adapt. FA: d trained on data FA with eigenvoices 50 eigenchannels 100 eigenchannels MFCC20 60 features FA systems without eigenvoices seem not to be able to robustly estimate increased number of eigenchannels However, we benefit from more eigenchannels significantly after explaining the speaker variability by eigenvoices
NIST SRE /24 Importance of zt-norm Eigenchannel adapt. FA with eigenvoices no normalization zt-norm zt-norm is essential for good performance FA with eigenvoices. MFCC13 39 features
NIST SRE /24 Training eigenchannels for different channel conditions SRE2006 trials: tel-tel (det1) tel-mic mic-tel mic-mic SRE2008 trials: tel-tel (det6) tel-mic (det5) mic-tel (det4) mic-mic (det1) FA-MFCC13 EC trained on SRE04, SRE05 tel. data EC trained on SRE05 mic. data Negligible degradation on tel-tel condition and huge improvement particularly on mic-mic condition is obtained after adding eigenchannels trained on microphone data to those trained on telephone data.
NIST SRE /24 Training additional eigenchannels on SRE08 dev data SRE2008 trials: tel-tel (det6) tel-mic (det5) mic-tel (det4) mic-mic (det1) FA-MFCC13 39 Significant improvement is obtained on microphone condition for eval data after adding eigenchannels trained on SER08 dev data (all spontaneous speech from the 6 interviewees). 100 EC trained on SRE04, SRE05 tel. data EC trained on SRE05 mic. data + 20 EC trained on SRE08 dev data
NIST SRE /24 Training additional eigenchannels on SRE08 dev data – primary fusion det1: int-int det2: int-int same mic det3: int-int diff. mic det4: int-phn det5: phn-mic det6: tel-tel det7: tel-tel Eng. det8: tel-tel native primary submission + 20 EC trained on SRE08 dev data 20 eigenchannels trained on SRE08 dev data are added to the two FA subsystems The improvement generalizes also to non-interview microphone data (det5)
NIST SRE /24 Other techniques that did not make it to the submission The following techniques were also tried, which did not improve the fusion performance –GMM with eigenchannel adaptation –SVM-GMM NAP –SVM-GMM ISV (Inter Session Variability modeling) –SVM-GMM ISV derivative (Fisher) kernel –SVM-GMM IVS based on FA-MFCC13 39 –FA modeling prosodic and cepstral contours –SVM on phonotactics – counts from Binary decision trees –SVM on soft bigram statistics collected on cumulated posteriograms (matrix of posterior probability of phonemes for each frame) See our poster for more details
NIST SRE /24 Conclusions FA systems build according to recipe from [Kenny2008] performs excellently, though there is still some mystery to be solved. It was hard to find another complementary system that would contribute to fusion of our two FA system. Although our system was primarily trained on and tuned for telephone data, FA subsystems can be simply augmented with eigenchannels trained on microphone data (as also proposed in [Kenny2008]), which makes the system performing well also on microphone conditions. Another significant improvement was obtained by training additional eigenchannels on data with matching channels condition, even thought there was very limited amount of such data provided by NIST.
NIST SRE /24 Thanks To Patrick Kenny for [Kenny2008] recipe to building FA system that really works and for providing list of files for training FA system MIT-LL for creating and sharing the trial lists based on SRE06 data, which we used for the system development Niko Bummer for FoCal Bilinear, which allowed us to start playing with the fusion just the last day before the submission deadline
NIST SRE /24 References [Kenny2008] P. Kenny et al.: A Study of Inter-Speaker Variability in Speaker Verification IEEE TASLP, July [Brummer2008] N. Brummer: FoCal Bilinear: Tools for detector fusion and calibration, with use of side-information [Chang2001] C. Chang et al.: LIBSVM: a library for Support Vector Machines, [Stolcke2005/6] A. Stolcke: MLLR Transforms as Features in SpkID, Eurospeech 2005, Odyssey 2006 [Hain2005] T. Hain et al.: The 2005 AMI system for RTS, Meeting Recognition Evaluation Workshop, Edinburgh, July 2005.
NIST SRE /24
NIST SRE /24NIST SRE 2008 Inter-session variability High inter-session variability High inter-speaker variability UBM Target speaker model
NIST SRE /24NIST SRE 2008 Inter-session compensation High inter-session variability UBM Target speaker model Test data For recognition, move both models along the high inter-session variability direction(s) to fit well the test data High inter-speaker variability
NIST SRE /24NIST SRE 2008 LVCSR for SVM-MLLR system LVCSR system is adapted to speaker (VTLN factor and (C)MLLR transformations are estimated) using ASR transcriptions provided by NIST AMI 2005(6) LVCSR – state of the art system for the recognition of spontaneous speech [Hain2005]: – 50k word dictionary (pronunciations of OOVs were generated by grapheme to phoneme conversion based on rules trained from data) – PLP, HLDA – CD-HMM with 7500 tied-states each modeled by 18 Gaussians – Discriminatively trained using MPE – Adapted to speaker: VTLN, SAT based on CMLLR, MLLR