Presentation is loading. Please wait.

Presentation is loading. Please wait.

2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.

Similar presentations


Presentation on theme: "2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group."— Presentation transcript:

1 2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group

2 2 Outline Data Preprocessing System Overview Processing Time Result & Analysis

3 3 Outline Data Preprocessing System Overview Processing Time Result & Analysis

4 4 Statistics of Training Data Hour VOACTS VOA2VOA3CallFri end 05Full Con 07New Trn SRE <35s>35s<35s>35s Amharic 28.1249.3 Bosnian 18.514.4 Cantonese 212.25.415.2 6.7 Creole 25.738.4 Croatian 4.320.8 Dari 578.8251.7 English(ttam) 57.899.5 English(Afric) 309.6281.7 English(A) 6029 English(I) 14.7 Farsi 9130 French 144.4255.430 Georgian 20.223.1 Hausa 137.2110.9

5 5 Hour VOACTS VOA2VOA3Call- Friend 05 Full Conversatio n 07TrainSRE Dataset <35s>35s<35s>35s Hindi 13.929.43012 Korean 3.414.510.417.33027 Mandarin 0129.3 6054 Pashto 932.1862.3 Portuguese 14.498.7 Russian 309.6790.2 6.7>12.5 Spanish 60.8199.76021.7 Turkish 8.940 Ukrainian 29.210 Urdu 129.7203 6.7 Vietnamese 6.732.5 30 Statistics of Training Data

6 6 Conclusion on Statistics of Training Data For VOA data, there are many duplications The data size for different languages vary greatly There are 12 languages for CTS data, 21 languages for VOA data. Cross-test between VOA and CTS needs to be considered.

7 7 Data Preprocessing Training data selection by analyzing the outputs of phone recognizer Factor Analysis for channel compensation Try different channel compensation schemes between VOA and CTS. Energy-based VAD for speech segmentation and utterance duration estimation.

8 8 Channel Compensation method - Factor Analysis U space Training  using the CTS data for CTS U space with 2048*56*30 dim  using the VOA data for VOA U space with 2048*56*30 dim Feature Extraction  merge CTS U space and VOA U space to 2048*56*60 dim  using the fuse UBM (training with CTS&VOA data)

9 9 Factor Analysis CTS U space CTS training data VOA training data Fused UBM CTS modelVOA model VOA U spaceCTS U space, VOA U space 2048*56*30 2048*56*60 merge Train CTS U space Extract feature Train VOA U spaceMerge U space

10 10 Evaluation of performance on CTS and VOA Motivation  Evaluate how does the mismatch between CTS and VOA channel reflect the language recognition performance  Evaluate the performance for different Factor Analysis (FA) including CTS_FA, VOA_FA and FUSE_FA Experimental configuration  A small development subset of 6 languages selected from both CTS and VOA.  Language Identifiers: GMM-SVM  Duration of Utterance: 30s evaluation

11 11 VOA TESTModelFarsiFrenchHindiRussianSpanishUrdu All EER/ DCF VOA_FAVOA 0.201.004.80.000.16.002.16/2.15 CTS_FAVOA 0.201.906.200.100.205.02.48/2.36 VOA_FACTS 3.08.914.91.0 17.17.66/6.65 CTS_FACTS 2.011.016.12.901.216.27.8/6.88 VOA_FAFUSE 0.102.08.00.00.208.03.01/2.61 CTS_FAFUSE 0.002.007.81.00.06.22.83/2.83 ALLCTS_FAFUSE 0.002.007.000.80.006.102.83/2.61 ALLVOA_FAFUSE 2.33/2.21 Note: - FUSE: training data CTS+VOA - ALLVOA_FA: all data use VOA data trained FA spaces. - ALLCTS_FA: all data use CTS data trained FA spaces. Experiment Result & Conclusion

12 12 CTS TESTModelFarsiFrenchHindiRussianSpanishUrdu All EER/ DCF CTS_FA CTS 0.970.0011.11.791.6311.254.87/4.61 VOA_FA CTS 0.060.0011.791.25 11.454.62/4.55 VOA_FA VOA 1.257.511.31.874.2215.06.87/ 6.53 CTS_FA VOA 2.56.2512.02.572.9718.757.23/6.40 CTS_FA FUSE 0.060.210.51.171.2515.04.62/4.25 VOA_FA FUSE 0.001.259.530.780.8615.004.62/4.12 ALLCTS_FA FUSE 0.970.0610.151.250.8613.754.86/4.32 ALLVOA_FA FUSE 4.61/4.42 Conclusion:  VOA_FA is better than FA spaces training using CTS data  FUSE CTS and VOA data can compensate the mismatch between CTS and VOA cross test. Experiment Result & Conclusion

13 13 Outline Data Preprocessing System Overview Processing Time Result & Analysis

14 14 IFLY 2009 NIST LRE System Overview

15 15 GSV: GMM Supervector Spectral System GSV_1  GMM MeanKernel  MVP used  CTS & VOA models are trained separately, 12 models for CTS data, 21 models for VOA data GSV_2  GMM MeanKernel  23 fused language models for both CTS and VOA data Two subsystems are linear fused.

16 16 GMM-MMI subsystem MLE baseline  2048 mixture GMM model  12 CTS models and 21 VOA models according to our training data MMI  Model parameter update for 3 iterations  Different acoustic models related to same language are treated as same model.

17 17 Example of Modification for MMI Training Original Net ConfigurationModified Net Configuration

18 18 Phone Recognizers 4 Phone Recognizer Used  Hungarian Phone Recognizer by BUT  Russian Phone Recognizer by BUT  BUT-style (hybrid TRAP-NN) Chinese Phone Recognizer  English Phone Recognizer based on GMM-HMM tri-phone models Lattice Decoding  Lattices were generated using HTK Toolkit  N-Gram expectations were calculated and shared by all models

19 19 PR4G: Phone recognition with 4- gram Language Model Key Features  Both kinds of models are interpolated from an language-independent UBM model  CTS and VOA models are trained separately, so there are 33 models in all

20 20 PR4G: Phone recognition with 4- gram Language Model Illustration of the Adaptation (Interpolation ) for Language Model

21 21 PRSVM: Phone Recognition with SVM Key Features  Bag-of-3-gram features were used  N-Grams frequency less than 0.02% were discarded  SVM models are trained on tokens from both VOA and CTS data.

22 22 Outline Data Preprocessing System Overview Processing Time Result & Analysis

23 23 System Processing time (xRT) GMM-MMI0.16 GMM-SVM2 Phone Recognizer NN/HMM6 GMM/HMM15 Table of the processing time of each sub-system PS: The GMM/HMM PR system can be removed from our fused system.

24 24 Outline Data Preprocessing System Overview Processing Time Result & Analysis

25 25 Result & Analysis 1 Comparison of the submitted DCF and the Optimal DCF for 30s evaluation

26 26 Result & Analysis 2 Comparison of the submitted DCF and the Optimal DCF for 10s evaluation

27 27 Result & Analysis 3 Comparison of the submitted DCF and the Optimal DCF for 3s evaluation

28 28 Result & Analysis 3 The submitted decisions are correct for most evaluation utterances. For certain languages, such as Bosnia and Ukrainian, the decision cost is high, which affects our final result. With the provided keys, the optimal decision cost is much less.

29 29 Future work How to fuse the acoustic and phonetic systems more effectively How to determine the threshold for make decision. How to improve the recognition performance of short duration.

30 30 Thank You !


Download ppt "2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group."

Similar presentations


Ads by Google