Presentation is loading. Please wait.

Presentation is loading. Please wait.

LID/SID - Research Stay at BUT Last Presentation Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo” fellowship Education.

Similar presentations


Presentation on theme: "LID/SID - Research Stay at BUT Last Presentation Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo” fellowship Education."— Presentation transcript:

1 LID/SID - Research Stay at BUT Last Presentation Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo” fellowship Education Ministry - Spanish Government February 20 th, 2012

2 Page 2 of 21 Page 2 Outline Research stay goals Work on phonotactic LID  Discriminative n-grams  New phonotactic system Using i-vectors and multinomial subspace model Work on LID-RATS  VAD and LID Future work

3 Page 3 of 21 Page 3 Research Stay Goals To work with most recent techniques for LID such as:  i-Vectors, sGMM, WCCN, score calibration/fusion To test our ranking templates and discriminative n-gram selection approach with the acoustic i-Vector system for LID task Ideas:  Fusion of scores  Selection of discriminative n-grams Collaboration on current BUT campaigns  RATS, LRE, SRE Publications

4 Page 4 of 21 Page 4 Work on Phonotactic LID LID based on ranking positions and distance Original idea: Original idea: [Cavnar and Trenkle, 1994]

5 Improvements to the Ranking approach One ranking for each n-gram order Golf position  All n-grams with the same number of occurrences share the same position in the ranking Discriminative positions in the ranking  Put in higher positions of the rank the most relevant n-grams for each language i.e. very frequent in one language but not in the others  A new formula inspired on td-idf providing normalized scores (1, and -1) Advantages: high order n-grams (up to 5-g) More details at [Caraballo et al, 2010] Page 5 of 21

6 Experiments on LRE09 Baseline: phonotactic PCA [Mikolov et al, 2010]  Use soft-counts n-grams for different phone recognizers Our system uses only the normalized score generated by the system, not the classifier  O ur baseline classifier based on distance among languages did not work fine Approaches:  Comparison/fusion with the PCA system  Fusion with acoustic iVectors system (400 iVectors, 2048 Gauss)  Selection of discriminative n-grams Goal: reduce the input vector of n-gram soft-counts Database:  Train: 9763 segments (345 hours, ~500 utt. per language)  Dev: 38134 segments from the 23 languages of LRE09  Test: 41545 segments Page 6 of 21

7 Comparison with phonotactic PCA Baseline approach:  Feature vector: Expected N-gram phoneme counts estimated from lattices  For all possible trigrams and most frequent four-grams, e.g. 3-grams: 33^3 = 35 937, (Hungarian phone-ASR) 4-grams: 33^4 = 1 185 921  Then, apply PCA to reduce the vector size (baseline:1000) Discriminative approach  Original templates (up to 4-grams) Engl: 45_2025_100K_200K Russ: 47_2209_100K_200K Hung: 33_1089_35K_200K Page 7 of 21

8 Results* Results*: problems to reproduce the same results reported in the paper No good results in almost all cases. Big difference in comparison with baseline using only 3-g and PCA. Page 8 of 21 Cavg 30103 Baseline_3g_Hung_PCA1000*4.5812.6125.69 DiscriminativeRanking_Engl8.8913.9726.17 DiscriminativeRanking_Russ9.5912.9624.21 DiscriminativeRanking_Hung10.7314.9326.45

9 Selection of discriminative n-grams Goal: Help PCA to reduce the size of the feature vector, by first selecting the most discriminative n-grams and then applying PCA  Reducing from 35K to aprox. 8K for 3-grams  Using 16K for 4-g instead of 80K most frequents [Mikolov et al, 2010] and concatenating them with the 8K trigrams  Selection based on the discriminability among all languages  We also try using probabilities instead of vector of counts Fusion with acoustic i-Vector systems  600 iVectors + 2048 Gaussians  Cavg for baseline iVectors: 30s: 2.40% 10s: 4.93% 3s: 14.04% Page 9 of 21

10 Results – Disc. Phonotactic System Page 10 of 21 BASE3G_ 1KPCA 3gCounts_ 1KPCA 3gProbs_ 1KPCA 3g- 4gCounts_ 1KPCA 3g- 4gProbs_1K PCA Base+3gPro bs_1KPCA

11 Results – Disc. Phonotactic System + iVectors Page 11 of 21 BASE3G_ 1KPCA 3gCounts_ 1KPCA 3gProbs_ 1KPCA 3g- 4gCounts_ 1KPCA 3g- 4gProbs_1K PCA Base+3gPro bs_1KPCA iVectors

12 Conclusions phonotactic For LID system based on templates we need to find better solutions for scoring normalization Discriminative n-gram selection helps both phonotactic PCA system and iVector system Better results using probabilities instead of counts because of problems with different length of files  ToDo: Test Length Normalization Find better approach to the selection of high-order n- grams  ToDo: use clusters of scores in the discriminative approach to be able to handle high order n-grams (currently implemented but we did not try it this time) Page 12 of 21

13 New Phonotactic system Baseline: [Soufifar et al, 2011]  Use n-gram soft-counts from lattices  Use subspace multinomial distributions for estimating iVectors  Use iVectors for classifying + using logistic regression (libLinear) Differences  Instead of n-gram soft-counts we use posterior-gram conditional counts  Use original features, or iVectors, or PCA on original features  Use Multiclass Logistic Regression + length normalization  Results on bigrams and trigrams (no time for fine tunning) Same training, test and dev sets as for LRE09  Fusion with the acoustic iVector system Page 13 of 21 Page 13

14 Results new phonotactic iVector Page 14 of 21 Cavg 30 sFusion10 sFusion3 sFusion Baseline Ivector (600)2.40-4.93-14.04- 2g_Hu1089_originalFeat5.201.6615.343.7529.6113.42 2g_Hu_1089toPCA100_MC LR 5.361.7014.123.6927.4413.18 2g_100iVector_MCLR_LN5.031.5510.713.5323.7412.79 Mehdi’s 600 iVectors HU3.058.1021.39 Trig_600iVector_1089Multi_ MultiClassLR_LengthNorm 3.151.258.663.0921.4512.15

15 Work on LID-RATS & VAD-RATS Goals:  Test different noise reduction and speech enhancement algorithms  Test different robust features  Test different BUT VADs  Combine with iVectors Database  Eight noise conditions + clean data  Experiments on the 2 minutes condition and short list  Train: 3458 files (115 h)  Dev: 7331 files (244 h) Page 15 of 21

16 Work on LID-RATS Noise tools and algorithms  Ctucopy, developed at SpeechLab (FEE CTU - Prague) Extended spectral substraction [Sovka and Pollák. 1996] Spectral substraction with full wave rectification  Using internal and external VAD (i.e. BUT-VAD)  Wiener filter [Zavarehei, 2005]  QIO Aurora Front-end from OGI [QIO, 2009] Internal NN_VAD + CMN/CVN + RASTA-LDA + Wiener Filter  ETSI: Advanced Front End [ETSI, 2007] 2-pass adaptive Wiener filter + internal VAD (uses energy info from the whole spectrum and F0 regions)  Kalman filter [Murphy, 1998] Page 16 of 21

17 Work on LID-RATS Common and new features  MFCC/PLP + Delta and Delta-Delta  PNCC: proposed by [Kim and Stern al, 2010] at CMU  Spectral Delta-Delta: proposed by [Kumar et al, 2011] at CMU Page 17 of 21  SDC: Shifted Delta Cepstra  RPLP: proposed by [Rajnoha and Pollák, 2011] at SpeechLab at FEE CTU Prague Hybrid between MFCC + PLP Tests w/w.o Rasta, VTLN, CMN/CVN Test new positions of the filterbank  After studying the spectogram and noise reduction effects  woNR: 300-3200, wNR:500-3000

18 Work on LID-RATS (120s) System without Noise ReductionVAD3 Baseline: 7MFCC+CMN/CVN+RASTA+7SDC+VTLN1.60 15 RPLP+CMN/CVN+RASTA+7SDC1.49 15 PNCC+ DeltaDelta + CMN/CVN + RASTA2.17 Baseline with spectral DD instead of SDC or Delta_Delta2.48 Page 18 of 21 System with Noise ReductionVAD1 Baseline: 7MFCC+CMN/CVN+RASTA+7SDC+VTLN2.03 Base line + Extended Spectral Substraction2.75 Base line + Spectral substraction with full wave rectification + BUT-VAD3.31 Base line + Wiener9.24 Base line + Qio2.09

19 Conclusions RATS-LID No any improvement when using de-noising techniques  QIO toolkit provided the best result Important improvements due to correct selection of Low and High frequency bands RPLP: New robust features for LID PNCC: promising features for LID but training time is high Spectral Delta-Delta slightly better than traditional delta- deltas but not than SDC Use of Rasta and CMN/CVN completely necessary for high performance  Short-term CMN/CVN did not provide better results Page 19 of 21

20 Future work Discriminative n-grams  New techniques for working with higher n-grams orders  Better combination of information from parallel phoneme recognizers  To write a joined paper based on using LRE09 Phonotactic iVector: Promising results  Check combination of parallel phone recognizers  Incorporation of discriminative information LRE/SRE  Try collaborations on following NIST competitions Page 20 of 21

21 Page 21 of 21 Page 21

22 Bibliography I Caraballo, M. A. et al. 2010. "A Discriminative Text Categorization Technique for Language Identification built into a PPRLM System". FALA, pp. 193- 196. Cavnar, W. B. and Trenkle, J.M. 1994. “N-Gram-Based Text Categorization”. SDAIR-94, pp. 161-175. ETSI: Advanced Front End V1.1.5. 2007. Available at http://www.etsi.org/WebSite/Technologies/DistributedSpeechRecognition.asp x http://www.etsi.org/WebSite/Technologies/DistributedSpeechRecognition.asp x Kim, C. and Stern, R.M. 2010. “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”. ICASSP, pp. 4574 – 4577. Mikolov et al. 2010. “PCA-based feature extraction for to phonotactic language recognition”. Odyssey, pp. 251-255. Murphy, K. 1998. “Kalman filter toolbox for Matlab”. Available at http://www.cs.ubc.ca/~murphyk/Software/Kalman/kalman.html http://www.cs.ubc.ca/~murphyk/Software/Kalman/kalman.html Page 22 of 21

23 Bibliography II Qualcomm-ICSI-OGI (QIO) Aurora front end. 2009. Available at ftp://ftp.icsi.berkeley.edu/pub/speech/papers/qio/aurora-front-end/ ftp://ftp.icsi.berkeley.edu/pub/speech/papers/qio/aurora-front-end/ Rajnoha, J., and Pollák, P. 2011. “ASR systems in Noisy Environment: Analysis and Solutions for Increasing Noise Robustness”. Radionegineering, Vol. 20, No. 1, April 2011, pp. 74-84. Soufifar, M. et al. 2011. “iVector approach to phonotactic language recognition”. Interspeech, pp. 2913-2916. Sovka, P., and Pollák, P. 1996. “Extended spectral subtraction” Eurospeech, pp. 963-966. Zavarehei, E. 2005. Wiener filter implementation in Matlab. Available at http://www.mathworks.com/matlabcentral/fileexchange/7673-wiener- filter/content/WienerScalart96.m http://www.mathworks.com/matlabcentral/fileexchange/7673-wiener- filter/content/WienerScalart96.m Page 23 of 21

24 Results - Discriminative Phonotactic System Page 24 of 21 Cavg 30103 Phon.+iVecPhon.+iVecPhon.+iVec Baseline_3g_Hung + PCA(1000)*4.581.5412.613.5225.6912.67 Disc3g + PCA (1000) Counts5.501.7214.313.9527.1313.68 Disc3g + PCA(1000) Probs4.831.6010.333.6921.7512.69 Disc3g + Disc4g + PCA (1000) Counts 4.321.4812.523.4325.8312.75 Disc3g + Disc4g + PCA(1000) Probs5.431.6511.503.7722.9812.80 Fusion: Baseline + 3Disc3g + PCA(1000) Probs 3.481.488.493.4920.5812.41

25 Posterior-gram system Page 25 of 21


Download ppt "LID/SID - Research Stay at BUT Last Presentation Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo” fellowship Education."

Similar presentations


Ads by Google