Presentation on theme: "Brno University Of Technology Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka,"— Presentation transcript:
Brno University Of Technology Speech@FIT Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka, Petr Schwarz and Honza Černocký NIST Speaker Recognition Workshop 2008 MOBIO
NIST SRE 20082/24 Outline Submitted systems Factor Analysis systems SVM-MLLR system Side information based calibration and fusion Contribution of subsystems in fusion Analysis of FA system – Gender dependent vs. independent – Flavors of FA system – Sensitivity to number of eigenchannels – Importance of ZT-norm – Optimization for microphone data Techniques that did not make it to the submission Conclusion
NIST SRE 20083/24 Submitted systems BUT01 - primary (3 systems) – Channel and language side information in fusion – FA-MFCC13 39 – FA-MFCC20 60 – SVM-MLLR BUT02 - (3 systems) – The same as BUT01, but no side information in fusion BUT03 - (2 systems) – Channel and language side information in fusion – FA-MFCC13 39 – FA-MFCC20 60
NIST SRE 20084/24 FA-MFCC13 39 system MAP adapted UBM with 2048 Gaussian components – Single UBM trained on Switchboard and NIST 2004,5 data 12 MFCC + C0 (20ms window, 10ms shift ) Short time Gaussianization – Rank of the current frame coefficient in 3sec window transformed by inverse Gaussian cumulative distribution function. Delta + double delta + triple delta coefficients – Together 52 coefficients, 12 frames context HLDA (dimensionality reduction from 52 to 39) Factor Analysis Model – gender independent – 300 eigenvoices (Switchboards, NIST 2004,5) – 100 eigenchannels for telephone speech (NIST 2004,5 tel data) – 100 eigenchannels for microphone speech (NIST 2005 mic data) ZT-norm – gender dependent
NIST SRE 20085/24 FA-MFCC20 60 system The same as FA-MFCC13 39 with the following differences: – 60 dimensional features are: 19 MFCC + Energy + deltas + double deltas (no HLDA) – Two gender dependent Factor Analysis models
NIST SRE 20086/24 SVM – MLLR system Linear kernels Rank normalization LibSVM C++ library [Chang2001] Pre-computed Gram matrices Features are MLLR transformations adapting LVCSR system (developed for AMI project) to speaker of given speech segment Estimation of MLLR transformations makes use of the ASR transcripts provided by NIST
NIST SRE 20087/24 SVM – MLLR system Cascade of CMLLR and MLLR – 2 CMLLR transformation (silence and speech) – 3 MLLR transformation (silence and 2 phoneme clusters) Silence transformations are discarded for SRE Supervector = 1 CMLLR + 2 MLLR = = 3*39 2 +3*39=4680 Impostors: NIST 2004 + mic data from NIST 2005 ZT-norm: speakers from NIST 2004
NIST SRE 20088/24 Side info based calibration and fusion Side information for each trial is given by its hard assignment to classes: – Trial channel condition provided by NIST: tel-tel, tel-mic, mic- tel, mic-mic – English/non-English decision given by our LID system Side information is used as follows: – For each system: Split trials by channel condition and calibrate scores using linear logistic regression (LLR) in each split separately Split trials according to English/non-English decision and calibrate scores using LLR in each split separately – Fuse the calibrated scores of all subsystems using LLR without making use of any side information For convenience, FoCal Bilinear toolkit by Niko Brummer was used, although we did not make use of its extensions over standard LLR.
NIST SRE 20089/24 Side info based calibration and fusion tel-tel trials SRE 2006 (all trials, det1)SRE 2008 (all trials, det6) No side information (BUT02) Channel cond. + Lang. by NIST Channel cond. + Lang. by LID (BUT01 – primary system) Use of side information is helpful Some improvement can be obtained by relying on language information provided by NIST instead of more realistic LID system
NIST SRE 200810/24 Side info based calibration and fusion mic-mic trials SRE 2006 (trial list defined by MIT-LL)SRE 2008 (det1) No side information Channel cond. + Lang. by NIST Channel cond. + Lang. by LID Use of side information allowed to use the same unchanged subsystems for all the channels
NIST SRE 200811/24 Subsystems and fusion tel-tel trials SRE 2006 (all trials, det1)SRE 2008 (all trials, det6) SVM-MLLR FA-MFCC13 39 FA-MFCC20 60 Fusion of 2 x FA Fusion of all 3 For tel-tel trials, the single FA-MFCC20 60 performs almost as well as the fusion
NIST SRE 200812/24 Subsystems and fusion mic-mic trials SVM-MLLR FA-MFCC13 39 FA-MFCC20 60 Fusion of 2 x FA Fusion of all 3 SRE 2006 (trial list defined by MIT-LL)SRE 2008 (det1) For microphone conditions FA-MFCC20 60 fails to perform well and fusion is beneficial. FA-MFCC13 39 system outperforms FA-MFCC20 60 system having 3x more parameters, which is possibly too over-trained to telephone data primarily used for FA model training.
NIST SRE 200813/24 Gender dependent vs. gender independent FA system SRE 2006 tel-telSRE 2006 mic-mic FA-MFCC13 39 GI FA-MFCC20 60 GD FA-MFCC20 60 GI Halving the number of parameters of FA-MFCC20 60 system by making it gender independent degrades the performance on telephone and improves on microphone condition
NIST SRE 200814/24 Flavors of FA Relevance MAP adaptation M = m + dz with d 2 = Σ/ τ where Σ is matrix with UBM variance supervector in diagonal Eigenchannel adaptation (SDV, BUT) Relevance MAP for enrolling speaker model Adapt speaker model to test utterance using eigenchannels estimated by PCA FA without eigenvoices, with d 2 = Σ/ τ (QUT, LIA) FA without eigenvoices, with d trained from data (CRIM) can be seen as training different τ for each supervector coefficient Effective relevance factor τ ef = trace(Σ)/ trace(d 2 ) FA with eigenvoices (CRIM) All hyperparameters can be trained from data using EM Eigenchannels Diagonal matrix Eigenvoices UBM mean supervector M = m + vy + dz + ux Speaker specific factorsSession specific factors
NIST SRE 200815/24 Flavors of FA Eigenchannel adapt. FA: d 2 = Σ/ τ FA: d trained on data FA with eigenvoices MFCC13 39 featuresMFCC20 60 features SRE 2006 (all trials, det1) τ ef = 236.1τ ef = 81.2 Without eigenvoices, simple eigenchannel adaptation seems to be more robust than FA. FA with trained d fails for MFCC13 39 features. Too high τ ef ? Caused by HLDA? FA with eigenvoices significantly outperform the other FA configurations.
NIST SRE 200816/24 Sensitivity of FA to number of eigenchannels Eigenchannel adapt. FA: d trained on data FA with eigenvoices 50 eigenchannels 100 eigenchannels MFCC20 60 features FA systems without eigenvoices seem not to be able to robustly estimate increased number of eigenchannels However, we benefit from more eigenchannels significantly after explaining the speaker variability by eigenvoices
NIST SRE 200817/24 Importance of zt-norm Eigenchannel adapt. FA with eigenvoices no normalization zt-norm zt-norm is essential for good performance FA with eigenvoices. MFCC13 39 features
NIST SRE 200818/24 Training eigenchannels for different channel conditions SRE2006 trials: tel-tel (det1) tel-mic mic-tel mic-mic SRE2008 trials: tel-tel (det6) tel-mic (det5) mic-tel (det4) mic-mic (det1) FA-MFCC13 39 100 EC trained on SRE04, SRE05 tel. data + 100 EC trained on SRE05 mic. data Negligible degradation on tel-tel condition and huge improvement particularly on mic-mic condition is obtained after adding eigenchannels trained on microphone data to those trained on telephone data.
NIST SRE 200819/24 Training additional eigenchannels on SRE08 dev data SRE2008 trials: tel-tel (det6) tel-mic (det5) mic-tel (det4) mic-mic (det1) FA-MFCC13 39 Significant improvement is obtained on microphone condition for eval data after adding eigenchannels trained on SER08 dev data (all spontaneous speech from the 6 interviewees). 100 EC trained on SRE04, SRE05 tel. data + 100 EC trained on SRE05 mic. data + 20 EC trained on SRE08 dev data
NIST SRE 200820/24 Training additional eigenchannels on SRE08 dev data – primary fusion det1: int-int det2: int-int same mic det3: int-int diff. mic det4: int-phn det5: phn-mic det6: tel-tel det7: tel-tel Eng. det8: tel-tel native primary submission + 20 EC trained on SRE08 dev data 20 eigenchannels trained on SRE08 dev data are added to the two FA subsystems The improvement generalizes also to non-interview microphone data (det5)
NIST SRE 200821/24 Other techniques that did not make it to the submission The following techniques were also tried, which did not improve the fusion performance –GMM with eigenchannel adaptation –SVM-GMM 2048 + NAP –SVM-GMM 2048 + ISV (Inter Session Variability modeling) –SVM-GMM 2048 + ISV derivative (Fisher) kernel –SVM-GMM 2048 + IVS based on FA-MFCC13 39 –FA modeling prosodic and cepstral contours –SVM on phonotactics – counts from Binary decision trees –SVM on soft bigram statistics collected on cumulated posteriograms (matrix of posterior probability of phonemes for each frame) See our poster for more details
NIST SRE 200822/24 Conclusions FA systems build according to recipe from [Kenny2008] performs excellently, though there is still some mystery to be solved. It was hard to find another complementary system that would contribute to fusion of our two FA system. Although our system was primarily trained on and tuned for telephone data, FA subsystems can be simply augmented with eigenchannels trained on microphone data (as also proposed in [Kenny2008]), which makes the system performing well also on microphone conditions. Another significant improvement was obtained by training additional eigenchannels on data with matching channels condition, even thought there was very limited amount of such data provided by NIST.
NIST SRE 200823/24 Thanks To Patrick Kenny for [Kenny2008] recipe to building FA system that really works and for providing list of files for training FA system MIT-LL for creating and sharing the trial lists based on SRE06 data, which we used for the system development Niko Bummer for FoCal Bilinear, which allowed us to start playing with the fusion just the last day before the submission deadline
NIST SRE 200824/24 References [Kenny2008] P. Kenny et al.: A Study of Inter-Speaker Variability in Speaker Verification IEEE TASLP, July 2008. [Brummer2008] N. Brummer: FoCal Bilinear: Tools for detector fusion and calibration, with use of side-information http://niko.brummer.googlepages.com/focalbilinear http://niko.brummer.googlepages.com/focalbilinear [Chang2001] C. Chang et al.: LIBSVM: a library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvmhttp://www.csie.ntu.edu.tw/~cjlin/libsvm [Stolcke2005/6] A. Stolcke: MLLR Transforms as Features in SpkID, Eurospeech 2005, Odyssey 2006 [Hain2005] T. Hain et al.: The 2005 AMI system for RTS, Meeting Recognition Evaluation Workshop, Edinburgh, July 2005.
NIST SRE 200826/24NIST SRE 2008 Inter-session variability High inter-session variability High inter-speaker variability UBM Target speaker model
NIST SRE 200827/24NIST SRE 2008 Inter-session compensation High inter-session variability UBM Target speaker model Test data For recognition, move both models along the high inter-session variability direction(s) to fit well the test data High inter-speaker variability
NIST SRE 200828/24NIST SRE 2008 LVCSR for SVM-MLLR system LVCSR system is adapted to speaker (VTLN factor and (C)MLLR transformations are estimated) using ASR transcriptions provided by NIST AMI 2005(6) LVCSR – state of the art system for the recognition of spontaneous speech [Hain2005]: – 50k word dictionary (pronunciations of OOVs were generated by grapheme to phoneme conversion based on rules trained from data) – PLP, HLDA – CD-HMM with 7500 tied-states each modeled by 18 Gaussians – Discriminatively trained using MPE – Adapted to speaker: VTLN, SAT based on CMLLR, MLLR