1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

2 Outline  Goal: A highly accurate Mandarin ASR  Baseline: System-2006  Improvement Acoustic segmentation Two complementary comparable systems Language models and adaptation More Data  Error analysis  Future

3 Background: System-2006  849M words training text  60K-word lexicon  Static 5-gram rescoring  465 hrs acoustic training  Two AMs (same phone-72 pronunciation) MFCC+pitch (42-dim), SAT+fMPE, CW MPE, 3000x128 Gaussians. MFCC+MLP+pitch (74-dim), SAT+fMPE, nonoCW MPE, 3000x64 Gaussians  CER 18.4% on Eval06.

4 2007 Increased Training Data  870 hours of acoustic training data. 3500x128 Gaussians.  1.2G words of training text. Trigrams and 4-grams. #bigrams#trigrams#4-gramsDev07-IV Perplexity LM 3 58M108M---325.7 qLM 3 6M 3M---379.8 LM 4 58M316M201M297.8 qLM 4 19M24M 6M383.2

5 Acoustic segmentation  Former segmenter caused high deletion errors. It mis-classified some speech segments as noises.  Speech segment min duration 18*30=540ms=0.5s Start/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noise VocabularyPronunciation speech18 + fg Noiserej silencebg Start/nullEnd/null speech silence noise

6 New Acoustic Segmenter  Allow shorter speech duration  Model Mandarin vs. Foreign (English) separately. VocabularyPronunciation Mandarin1I1 F Mandarin2I2 F Foreignforgn Noiserej Silencebg Start/nullEnd/nullForeign silence Mandarin1 2 noise

7 Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on Eval06 SegmenterSubDelInsTotal OLD9.77.01.918.6 NEW9.96.42.018.3 Oracle9.56.81.818.1

8 Decoding Architecture MLP nonCW qLM 3 PLP CW SAT+fMPE MLLR, LM 3 MLP CW SAT MLLR, LM 3 qLM 4 Adapt/Rescore Confusion Network Combination Aachen

9 Two Sets of Acoustic Models  For cross adaptation and system combo Different error behaviors Similar error rate performance System-MLPSystem-PLP Features74 (MFCC+pitch+MLP) 42 (PLP+pitch) fMPEnoyes Phones7281

10 MLP Phoneme Posterior Features  Compute Tandem features with pitch+PLP input.  Compute HATs features with 19 critical bands  Combine Tandem and HATs posterior vectors into one.  PCA(Log(71))  32  MFCC + pitch + MLP = 74-dim

11 Tandem Features [T 1,T 2,…,T 71 ]  Input: 9 frames of PLP+pitch (42x9)x15000x71 PLP (39x9) Pitch (3x9)

12 HATS Features [H 1,H 2,…,H 71 ] 51x60x71 … E1E1 E2E2 E 19 (60*19)x8000x71

13 MLP and Pitch Features HMM FeatureMLP InputCER MFCC (39-dim)None24.1 MFCC+F0 (42-dim)None21.4 MFCC+F0+Tandem (74-dim)PLP(39*9)20.3 MFCC+F0+Tandem (74-dim)PLP+F0(42*9)19.7 nonCW ML, Hub4 Training, MLLR, LM 2 on Eval04

14 Phone-81: Diphthongs for BC  Add diphthongs (4x4=16) for fast speech and modeling longer triphone context.  Maintain unique syllabification.  Syllable ending W and Y not needed anymore. ExamplePhone-72Phone-81 要 /yao4/ a4 Waw4 北 /bei3/ E3 Yey3 有 /you3/ o3 Wow3 爱 /ai4/ a4 Yay4

15 Phone-81: Frequent Neutral Tones for BC  Neural tones more common in conversation.  Neutral tones were not modeled. The 3 rd tone was used as replacement.  Add 3 neutral tones for frequent chars. ExamplePhone-72Phone-81 了 /e5/ e3e5 吗 /ma5/ a3a5 子 /zi5/ i3i5

16 Phone-81: Special CI Phones for BC  Filled pauses (hmm, ah) common in BC. Add two CI phones for them.  Add CI /V/ for English. ExamplePhone-72Phone-81 victorywV 呃 /ah/ o3fp_o 嗯 /hmm/ e3 Nfp_en

17 Phone-81: Simplification of Other Phones  Now 72+14+3+3=92 phones, too many triphones to model.  Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2.  92 – (4x3–1) = 81 phones. ExamplePhone-72Phone-81 安 /an1/ A1 Na1 N 词 /ci2/ I1i2 池 /chi2/ IH2i2

18 Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 BNBCAvg Phone-817.627.318.9 Phone-727.427.619.0 Indeed different error behaviors --- good for system combo.

19 PLP Models with fMPE Transform  PLP model with fMPE transform to compete with MLP model.  Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT  5 Neighboring frames of Gaussian posteriors.  M is 42 x (3500*32*5), h is (3500*32*5)x1.  Ref: Zheng ICASSP 07 paper

20 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model {w | w same story (4secs) } 00 One sentence   4s window is used to make adaptation more robust against ASR errors.  {w} are weighted based on distance.

21 Topic-based LM Adaptation  Training: one topic per sentence  Train 64 topic-dependent LMs.  Testing: top n topics per sentence, weighting on neighboring 4s of speech

22 Topic-based LM Adaptation  LM i still 60K-words?  Per-sentence adaptation?  Computational cost?

23 LM Adaptation and CNC on Dev07 Dev07CW PLPCW MLPCNC LM 3 12.011.9--- LM 4 11.911.711.4 Adapted qLM 4 11.711.411.2 UW 2 systems only

24 LM Adaptation and CNC on Eval07 AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) Rover LM 3 10.29.6 9.910.1-- qLM 4 10.29.710.010.1-- LM 4 10.09.6 9.810.09.1 Adapted qLM 4 9.79.39.69.78.9

25 Eval07 TeamCER UW9.1% RWTH12.1% UW+RWTH8.9% CU+BBN9.4% IBM+CMU9.8%

26 2006 vs. 2007 on Eval07 SUBDELINSTOTAL 2006 system 7.26.50.414.1 2007 system 5.53.00.4 8.9 37% relative improvement!!

27 Progress Testset20062007-062007-12 Eval0618.4%15.3%14.7% Dev07---11.2%9.6% * Eval0714.1%8.9%---

28 RWTH Demo  UW acoustic segmenter.  RWTH single-system ASR. Foreign (Korean) speech skipped. Mis-reco highlighted.  Manual sentence segmentation.  Machine translation.  Not real-time.

29 MT Error Analysis on Extreme Cases SnippetDurCERHTER a) Worst BN87s10.9%47.73% b) Worst BC72s24.9%48.37% c) Best BN62s012.67% d) Best BC77s15.2%14.20%  CER not directly related to HTER; genre matters.  Better CER does ease MT.

30 MT Error Analysis  (a) worst BN: OOV names  (b) worst BC: overlapped speech  (c) best BN: composite sentences  (d) best BC: simple sentences with disfluency and re-starts.  *.html, *.wav

31 Error Analysis  OOV (especially names): problematic for ASR, MT, distillation. 徐昌霖徐成民徐长明 Xu, Chang-Lin 黄竹琴黄朱琴黄朱勤皇猪禽黄朱其 Huang, Zhu-Qin

32 Error Analysis  MT BN high errors Composite syntax structure. Syntactic parsing would be useful.  MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

33 Next ASR: Chinese Organization Names  Semi-auto abbreviation generation for long words. Segment a long word into a sequence of shorter words Extract the 1 st char of each shorter words: 世界卫生组织  世卫 (Make sure they are in MT translation table, too)

34 Next ASR: Chinese Person Names  Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!!  Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!!  Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu  {Chang, Cheng}  {Lin, Min, Ming} Huang  Zhu  {Qin, Qi}  After syllable CNC, apply the same name to all occurrences in Pinyin.

35 Next ASR: Foreign Names  English spelling in Lexicon, with (multiple) Mandarin pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben3 la1 deng1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT?  Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n- gram.

36 Next ASR: LM  LM adaptation with fine topics, each topic with small vocabulary size.  Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是  我想那也是 I think it, (too), (too), is, too.  I think it is, too.  If optimizing CER, stm needs to be designed such that disfluency is optionally deletable. 小孩 ( 儿 )

37 Next ASR: AM  Add explicit tone modeling (Lei07). Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words  More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering  Smaller clusters, better performance  Gender ID first.

38 ASR & MT Integration  Do we need to merge lexicon? ASR MT.  Do we need to use the same word segmenter?  Is word/char -level CNC output better for MT?  Open questions and feedback!!!

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Similar presentations

Presentation on theme: "1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Similar presentations

Presentation on theme: "1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf."— Presentation transcript:

Similar presentations

About project

Feedback