1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

2 Outline  Goal: A highly accurate Mandarin ASR  Background  Acoustic segmentation  Acoustic models and adaptation  Language models and adaptation  Cross adaptation  System combination  Error analysis  Future

3 Background  870 hours of acoustic training data.  N-gram based (N=1) ML Chinese word segmentation.  60K-word lexicon.  1.2G words of training text. Trigrams and 4-grams. n2n3n4Dev07-IV Perplexity LM 3 58M108M---325.7 qLM 3 6M 3M---379.8 LM 4 58M316M201M297.8 qLM 4 19M24M 6M383.2

4 Acoustic segmentation  Former segmenter caused high deletion errors. It mis-classified some speech segments as noises.  Speech segment min duration 18*30=540ms=0.5s Start/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noise VocabularyPronunciation speech18 + fg Noiserej silencebg Start/nullEnd/null speech silence noise

5 New Acoustic Segmenter  Allow shorter speech duration  Model Mandarin vs. Foreign (English) separately. VocabularyPronunciation Mandarin1I1 F Mandarin2I2 F Foreignforgn Noiserej Silencebg Start/nullEnd/nullForeign silence Mandarin1 2 noise

6 Two Sets of Acoustic Models  For cross adaptation and system combo Different error behaviors Similar error rate performance System-MLPSystem-PLP Features74 (MFCC+3+32) 42 (PLP+3) fMPEnoyes Phones7281

7 MLP Phoneme Posterior Features  Compute Tandem features with pitch+PLP input.  Compute HATs features with 19 critical bands  Combine two Tandem and HATs posterior vectors into one.  Log(PCA(71)  32)  MFCC + pitch + MLP = 74-dim  3500x128 Gaussians, MPE trained.  Both cross-word (CW) and nonCW triphones trained.

8 Tandem Features [T 1,T 2,…,T 71 ]  Input: 9 frames of PLP+pitch (42x9)x15000x71 PLP (39x9) Pitch (3x9)

9 HATS Features [H 1,H 2,…,H 71 ] 51x60x71 … E1 E2 E19 (60*19)x8000x71

10 Phone-81: Diphthongs for BC  Add diphthongs (4x4=16) for fast speech and modeling longer triphone context.  Maintain unique syllabification.  Syllable ending W and Y not needed anymore. ExamplePhone-72Phone-81 要 /yao4/ a4 Waw4 北 /bei3/ E3 Yey3 有 /you3/ o3 Wow3 爱 /ai4/ a4 Yay4

11 Phone-81: Frequent Neutral Tones for BC  Neural tones more common in conversation.  Neutral tones were not modeled. The 3 rd tone was used as replacement.  Add 3 neutral tones for frequent chars. ExamplePhone-72Phone-81 了 /e5/ e3e5 吗 /ma5/ a3a5 子 /zi5/ i3i5

12 Phone-81: Special CI Phones for BC  Filled pauses (hmm,ah) common in BC. Add two CI phones for them.  Add CI /V/ for English. ExamplePhone-72Phone-81 victorywV 呃 /ah/ o3fp_o 嗯 /hmm/ e3 Nfp_en

13 Phone-81: Simplification of Other Phones  Now 72+14+3+3=92 phones, too many triphones to model.  Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2.  92 – (4x3–1) = 81 phones. ExamplePhone-72Phone-81 安 /an1/ A1 Na1 N 词 /ci2/ I1i2 池 /chi2/ IH2i2

14 PLP Models with fMPE Transform  PLP model with fMPE transform to compete with MLP model.  Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT  5 Neighboring frames of Gaussian posteriors.  M is 42 x (3500*32*5), h is (3500*32*5)x1.  Ref: Zheng ICASSP 07 paper

15 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model {w | w same story (4secs) } 00 One sentence   4s window is used to make adaptation more robust against ASR errors.  {w} are weighted based on distance.

16 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model One sentence Topic-based LM Adaptation Training One topic per sentence. Train 64 topic-dep. 4-gram LM 1, LM 2, … LM 64. Decoding Top n topics per sentence, where  i ’ > threshold. Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model

17 Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on eval06 SegmenterSubDelInsTotal OLD9.77.01.918.6 NEW9.96.42.018.3 Oracle9.56.81.818.1

18 Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 BNBCAvg Phone-817.627.318.9 Phone-727.427.619.0 Indeed different error behaviors --- good for system combo.

19 Decoding Architecture MLP nonCW qLM 3 PLP CW+SAT+fMPE MLLR, LM 3 MLP CW+SAT MLLR, LM 3 qLM 4 Adapt/Rescore Confusion Network Combination Aachen

20 Topic-based LM Adaptation (NTU)  Training, per sentence: 64 topics:  = (  1,  2, …,  m ) Topic(sentence) = k = argmax {  1,  2, …,  m } Train 64 topic-dep (TD) 4-grams  Testing, per utterance: {w}: N-best confidence based weighting + distance weighting Pick all TD 4-grams whose  i is above a threshold. Interpolate with the topic-indep. 4-gram. Rescore N-best list.

21 CERs with diff LMs (internal use) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) Rover LM 3 10.29.6 9.910.1-- qLM 4 10.29.710.010.1-- LM 4 10.09.6 9.810.09.1 Adapted qLM 4 9.79.39.69.78.9

22 Topic-based LM Adaptation (NTU) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) CNC Rover LM 4 10.09.69.810.09.1 Adapted qLM 4 9.79.39.6 9.78.9 “q” represents “quick” or tightly pruned. Oracle CNC: 4.7%. Could it be a broken word sequence? Need to verify that with word perplexity and HTER.

23 2006 ASR System vs. 2007 SUBDELINSTOTAL 2006 system 7.26.50.414.1 2007 system 5.53.00.4 8.9 CER on Eval07 37% relative improvement!!

24 Eval07 BN ASR Error Distribution 66 BN snippets (Avg CER 3.4%) 0 5 10 15 20 0.0%50.0%100.0%150.0% % snippets CER (%) SRI

25 Eval07 BC ASR Error Distribution 53 BC snippets (avg CER 15.9%) 0 10 20 30 40 50 0.0%20.0%40.0%60.0%80.0%100.0%120.0% % snippets CER (%) SRI

26 What Worked for Mandarin ASR?  MLP features  MPE  CW+SAT  fMPE  Improved acoustic segmentation, particularly for deletion errors.  CNC Rover.

27 Small Help for ASR  Topic-dep. LM adaptation.  Outside regions for additional AM adaptation data.  A new phone set with diphthongs to offer different error behaviors.  Pitch input in tandem features.  Cross adaptation with Aachen  Successful collaboration among 5 team members from 3 continents.

28 Error Analysis on Extreme Cases SnippetDurCERHTER a) Worst BN87s10.9%47.73% b) Worst BC72s24.9%48.37% c) Best BN62s012.67% d) Best BC77s15.2%14.20%  CER not directly related to HTER; genre matters.  Better CER does ease MT.

29 Error Analysis  (a) worst BN: OOV names  (b) worst BC: overlapped speech  (c) best BN: composite sentences  (d) best BC: simple sentences with disfluency and re-starts.

30 Error Analysis  OOV (especially names) Problematic for both ASR/MT  Overlapped speech  What to do?  Content word mis-reco (not all errors are equal!)  升值 (increase in value)  甚至 (even) Parsing scores? 徐昌霖徐成民徐长明 Xu, Chang-Lin 黄竹琴黄朱琴黄朱勤皇猪禽黄朱其 Huang, Zhu-Qin

31 Error Analysis  MT BN high errors Composite syntax structure. Syntactic parsing would be useful.  MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

32 Next ASR: Chinese OOV Org Names  Semi-auto abbreviation generation for long words. Segment a long word into a sequence of shorter words Extract the 1 st char of each shorter words: World Health Organization  WHO (Make sure they are in MT translation table, too)

33 Next ASR: Chinese OOV Per. Names  Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!!  Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!!  Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu  {Chang, Cheng}  {Lin, Min, Ming} Huang  Zhu  {Qin, Qi}  After syllable CNC, apply the same name to all occurrences in Pinyin.

34 Next ASR: English OOV Names  English spelling in Lexicon, with (multiple) Mandarin pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben1 la1 dan1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT?  Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re- train n-gram.

35 Next ASR: LM  LM adaptation with fine topics, each topic with small vocabulary size.  Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是  我想那也是 I think it, (too), (too), is, too.  I think it is, too.  If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.

36 Next ASR: AM  Add explicit tone modeling (Lei07). Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words  More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering  Smaller clusters, better performance

37 ASR & MT Integration  Do we need to merge lexicon? ASR <= MT.  Do we need to use the same word segmenter?  Is word/char -level CNC output better for MT?  Open questions and feedback!!!

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Similar presentations

Presentation on theme: "1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Similar presentations

Presentation on theme: "1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf."— Presentation transcript:

Similar presentations

About project

Feedback