Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Similar presentations


Presentation on theme: "1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf."— Presentation transcript:

1 1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

2 2 Outline  Goal: A highly accurate Mandarin ASR  Background  Acoustic segmentation  Acoustic models and adaptation  Language models and adaptation  Cross adaptation  System combination  Error analysis  Future

3 3 Background  870 hours of acoustic training data.  N-gram based (N=1) ML Chinese word segmentation.  60K-word lexicon.  1.2G words of training text. Trigrams and 4-grams. n2n3n4Dev07-IV Perplexity LM 3 58M108M---325.7 qLM 3 6M 3M---379.8 LM 4 58M316M201M297.8 qLM 4 19M24M 6M383.2

4 4 Acoustic segmentation  Former segmenter caused high deletion errors. It mis-classified some speech segments as noises.  Speech segment min duration 18*30=540ms=0.5s Start/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noise VocabularyPronunciation speech18 + fg Noiserej silencebg Start/nullEnd/null speech silence noise

5 5 New Acoustic Segmenter  Allow shorter speech duration  Model Mandarin vs. Foreign (English) separately. VocabularyPronunciation Mandarin1I1 F Mandarin2I2 F Foreignforgn Noiserej Silencebg Start/nullEnd/nullForeign silence Mandarin1 2 noise

6 6 Two Sets of Acoustic Models  For cross adaptation and system combo Different error behaviors Similar error rate performance System-MLPSystem-PLP Features74 (MFCC+3+32) 42 (PLP+3) fMPEnoyes Phones7281

7 7 MLP Phoneme Posterior Features  Compute Tandem features with pitch+PLP input.  Compute HATs features with 19 critical bands  Combine two Tandem and HATs posterior vectors into one.  Log(PCA(71)  32)  MFCC + pitch + MLP = 74-dim  3500x128 Gaussians, MPE trained.  Both cross-word (CW) and nonCW triphones trained.

8 8 Tandem Features [T 1,T 2,…,T 71 ]  Input: 9 frames of PLP+pitch (42x9)x15000x71 PLP (39x9) Pitch (3x9)

9 9 HATS Features [H 1,H 2,…,H 71 ] 51x60x71 … E1 E2 E19 (60*19)x8000x71

10 10 Phone-81: Diphthongs for BC  Add diphthongs (4x4=16) for fast speech and modeling longer triphone context.  Maintain unique syllabification.  Syllable ending W and Y not needed anymore. ExamplePhone-72Phone-81 要 /yao4/ a4 Waw4 北 /bei3/ E3 Yey3 有 /you3/ o3 Wow3 爱 /ai4/ a4 Yay4

11 11 Phone-81: Frequent Neutral Tones for BC  Neural tones more common in conversation.  Neutral tones were not modeled. The 3 rd tone was used as replacement.  Add 3 neutral tones for frequent chars. ExamplePhone-72Phone-81 了 /e5/ e3e5 吗 /ma5/ a3a5 子 /zi5/ i3i5

12 12 Phone-81: Special CI Phones for BC  Filled pauses (hmm,ah) common in BC. Add two CI phones for them.  Add CI /V/ for English. ExamplePhone-72Phone-81 victorywV 呃 /ah/ o3fp_o 嗯 /hmm/ e3 Nfp_en

13 13 Phone-81: Simplification of Other Phones  Now 72+14+3+3=92 phones, too many triphones to model.  Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2.  92 – (4x3–1) = 81 phones. ExamplePhone-72Phone-81 安 /an1/ A1 Na1 N 词 /ci2/ I1i2 池 /chi2/ IH2i2

14 14 PLP Models with fMPE Transform  PLP model with fMPE transform to compete with MLP model.  Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT  5 Neighboring frames of Gaussian posteriors.  M is 42 x (3500*32*5), h is (3500*32*5)x1.  Ref: Zheng ICASSP 07 paper

15 15 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model {w | w same story (4secs) } 00 One sentence   4s window is used to make adaptation more robust against ASR errors.  {w} are weighted based on distance.

16 16 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model One sentence Topic-based LM Adaptation Training One topic per sentence. Train 64 topic-dep. 4-gram LM 1, LM 2, … LM 64. Decoding Top n topics per sentence, where  i ’ > threshold. Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model

17 17 Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on eval06 SegmenterSubDelInsTotal OLD9.77.01.918.6 NEW9.96.42.018.3 Oracle9.56.81.818.1

18 18 Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 BNBCAvg Phone-817.627.318.9 Phone-727.427.619.0 Indeed different error behaviors --- good for system combo.

19 19 Decoding Architecture MLP nonCW qLM 3 PLP CW+SAT+fMPE MLLR, LM 3 MLP CW+SAT MLLR, LM 3 qLM 4 Adapt/Rescore Confusion Network Combination Aachen

20 20 Topic-based LM Adaptation (NTU)  Training, per sentence: 64 topics:  = (  1,  2, …,  m ) Topic(sentence) = k = argmax {  1,  2, …,  m } Train 64 topic-dep (TD) 4-grams  Testing, per utterance: {w}: N-best confidence based weighting + distance weighting Pick all TD 4-grams whose  i is above a threshold. Interpolate with the topic-indep. 4-gram. Rescore N-best list.

21 21 CERs with diff LMs (internal use) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) Rover LM 3 10.29.6 9.910.1-- qLM 4 10.29.710.010.1-- LM 4 10.09.6 9.810.09.1 Adapted qLM 4 9.79.39.69.78.9

22 22 Topic-based LM Adaptation (NTU) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) CNC Rover LM 4 10.09.69.810.09.1 Adapted qLM 4 9.79.39.6 9.78.9 “q” represents “quick” or tightly pruned. Oracle CNC: 4.7%. Could it be a broken word sequence? Need to verify that with word perplexity and HTER.

23 23 2006 ASR System vs. 2007 SUBDELINSTOTAL 2006 system 7.26.50.414.1 2007 system 5.53.00.4 8.9 CER on Eval07 37% relative improvement!!

24 24 Eval07 BN ASR Error Distribution 66 BN snippets (Avg CER 3.4%) 0 5 10 15 20 0.0%50.0%100.0%150.0% % snippets CER (%) SRI

25 25 Eval07 BC ASR Error Distribution 53 BC snippets (avg CER 15.9%) 0 10 20 30 40 50 0.0%20.0%40.0%60.0%80.0%100.0%120.0% % snippets CER (%) SRI

26 26 What Worked for Mandarin ASR?  MLP features  MPE  CW+SAT  fMPE  Improved acoustic segmentation, particularly for deletion errors.  CNC Rover.

27 27 Small Help for ASR  Topic-dep. LM adaptation.  Outside regions for additional AM adaptation data.  A new phone set with diphthongs to offer different error behaviors.  Pitch input in tandem features.  Cross adaptation with Aachen  Successful collaboration among 5 team members from 3 continents.

28 28 Error Analysis on Extreme Cases SnippetDurCERHTER a) Worst BN87s10.9%47.73% b) Worst BC72s24.9%48.37% c) Best BN62s012.67% d) Best BC77s15.2%14.20%  CER not directly related to HTER; genre matters.  Better CER does ease MT.

29 29 Error Analysis  (a) worst BN: OOV names  (b) worst BC: overlapped speech  (c) best BN: composite sentences  (d) best BC: simple sentences with disfluency and re-starts.

30 30 Error Analysis  OOV (especially names) Problematic for both ASR/MT  Overlapped speech  What to do?  Content word mis-reco (not all errors are equal!)  升值 (increase in value)  甚至 (even) Parsing scores? 徐 昌 霖 徐 成 民 徐 长 明 Xu, Chang-Lin 黄 竹 琴 黄 朱 琴 黄 朱 勤 皇 猪 禽 黄 朱 其 Huang, Zhu-Qin

31 31 Error Analysis  MT BN high errors Composite syntax structure. Syntactic parsing would be useful.  MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

32 32 Next ASR: Chinese OOV Org Names  Semi-auto abbreviation generation for long words. Segment a long word into a sequence of shorter words Extract the 1 st char of each shorter words: World Health Organization  WHO (Make sure they are in MT translation table, too)

33 33 Next ASR: Chinese OOV Per. Names  Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!!  Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!!  Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu  {Chang, Cheng}  {Lin, Min, Ming} Huang  Zhu  {Qin, Qi}  After syllable CNC, apply the same name to all occurrences in Pinyin.

34 34 Next ASR: English OOV Names  English spelling in Lexicon, with (multiple) Mandarin pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben1 la1 dan1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT?  Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re- train n-gram.

35 35 Next ASR: LM  LM adaptation with fine topics, each topic with small vocabulary size.  Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是  我想那也是 I think it, (too), (too), is, too.  I think it is, too.  If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.

36 36 Next ASR: AM  Add explicit tone modeling (Lei07). Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words  More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering  Smaller clusters, better performance

37 37 ASR & MT Integration  Do we need to merge lexicon? ASR <= MT.  Do we need to use the same word segmenter?  Is word/char -level CNC output better for MT?  Open questions and feedback!!!


Download ppt "1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf."

Similar presentations


Ads by Google