Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation & Automated Speech Recognition Jaime Carbonell With Richard Stern and Alex Rudnicky Language Technologies Institute Carnegie Mellon.

Similar presentations


Presentation on theme: "Machine Translation & Automated Speech Recognition Jaime Carbonell With Richard Stern and Alex Rudnicky Language Technologies Institute Carnegie Mellon."— Presentation transcript:

1

2 Machine Translation & Automated Speech Recognition Jaime Carbonell With Richard Stern and Alex Rudnicky Language Technologies Institute Carnegie Mellon University Telephone: +1 412 268 7279 Fax: +1 412 268 6298 jgc@cs.cmu.edu For Samsung, February 20, 2008

3 Carnegie Mellon Slide 2 Languge Technologies Insitute Outline of Presentation Machine Translation (MT): –Three major paradigms for MT (transfer, example-based, statistical) –New method: Context-Based MT Automated Speech Recognition (ASR): –The Sphinx family of recognizers Environmental Robustness for ASR Intelligent Tutoring for English –Native Accent: Pronunciation product –Grammar, Vocabulary & other tutotring Components for Virtual English Village

4 Carnegie Mellon Slide 3 Languge Technologies Insitute Language Technologies BILL OF RIGHTS TECHNOLGY (e.g.) “…right information” “…right people” “…right time” “…right medium” “…right language” “…right level of detail” IR (search engines) Routing, personalization Anticipatory analysis Speech: ASR, TTS Machine translation Summarization, Drill down

5 Carnegie Mellon Slide 4 Languge Technologies Insitute Machine Translation : Context is Needed to Resolve Ambiguity Example: English  Japanese Power line – densen ( 電線 ) Subway line – chikatetsu ( 地下鉄 ) (Be) on line – onrain ( オンライン ) (Be) on the line – denwachuu ( 電話中 ) Line up – narabu ( 並ぶ ) Line one’s pockets – kanemochi ni naru ( 金持ちになる ) Line one’s jacket – uwagi o nijuu ni suru ( 上着を二重にする ) Actor’s line – serifu ( セリフ ) Get a line on – joho o eru ( 情報を得る ) Sometimes local context suffices (as above)... but sometimes not

6 Carnegie Mellon Slide 5 Languge Technologies Insitute CONTEXT: More is Better Examples requiring longer-range context: –“The line for the new play extended for 3 blocks.” –“The line for the new play was changed by the scriptwriter.” –“The line for the new play got tangled with the other props.” –“The line for the new play better protected the quarterback.” Better MT requires better context sensitivity –Syntax-rule based MT  little or no context –EBMT  potentially long context on exact substring matches –SMT  short context with proper probabilistic optimization –CBMT  long context with optimization (but still experimental)

7 Carnegie Mellon Slide 6 Languge Technologies Insitute Types of Machine Translation Text Generation Syntactic Parsing Semantic Analysis Sentence Planning Source (Arabic) Target (English) Transfer Rules Direct: SMT, EBMT Interlingua

8 Carnegie Mellon Slide 7 Languge Technologies Insitute Example-Based Machine Translation Example English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

9 Carnegie Mellon Slide 8 Languge Technologies Insitute Statistical Machine Translation: The Three Ingredients English Language Model English-Korean Translation Model Decoder EK p(E)p(K|E) E* K E*=arg maxp(E|K)= argmax p(E) p(K|E)

10 Alignments The proposal Will Not Now Be Implemented Les Propositions ne seront Pas mises En Application maintenant Translation models built in terms of hidden alignments p(F|E)= Each English word “generates” zero or more French words.

11 Carnegie Mellon Slide 10 Languge Technologies Insitute Conditioning Joint Probabilities The modeling problem is how to approximate these terms to achieve: Statistically robust estimates Computationally efficient training An effective and accurate model

12 Carnegie Mellon Slide 11 Languge Technologies Insitute The Gang of Five IBM ++ Models A series of five models is sequentially trained to “bootstrap a detailed translation model from parallel corpora: 1. Word-byword translation. Parameters p(f|e) between pairs of words. 2. Local alignments. Adds alignment probabilities 3. Fertilities and distortions. Allow a source word to generate zero or more target words, then place words in position with p(j|i,sentence lengths). 4. Tablets. Groups words into phrases Primary idea of Example-Based MT incorporated into SMT Currently seeking meaningful (vs arbitrary phrases), e.g. NPs. PPs. 5. Non-deficient alignments. Don’t ask...

13 Carnegie Mellon Slide 12 Languge Technologies Insitute Multi-Engine Machine Transaltion MT Systems have different strengths –Rapidly adaptable: Statistical, example-based –Good grammar: Rule-Based (linguisitic) MT –High precision in narrow domains: KBMT –Minority Language MT: Learnable from informant Combine results of parallel-invoked MT –Select best of multiple translations –Selection based on optimizing combination of: »Target language joint-exponential model »Confidence scores of individual MT engines

14 Carnegie Mellon Slide 13 Languge Technologies Insitute Illustration of Multi-Engine MT El punto de descarge The drop-off point se cumplirá en will comply with el puente Agua Fria The cold Bridgewater El punto de descarge The discharge point se cumplirá en will self comply in el puente Agua Fria the “Agua Fria” bridge El punto de descarge Unload of the point se cumplirá en will take place at el puente Agua Fria the cold water of bridge

15 Carnegie Mellon Slide 14 Languge Technologies Insitute Key Challenges for Corpus-Based MT Long-Range Context –More is better –How to incorporate it? –How to do so tractably? –“I conjecture that MT is AI complete” – H. Simon State-of-the-Art –Trigrams  4 or 5-grams in LM only (e.g., Google) –SMT  SMT+EBMT (phrase tables, etc.) Parallel Corpora –More is better –Where to find enough of it? –What about rare languages? –“There’s no data like more data” – IBM SMT circa 1992 State-of-the-Art –Avoids rare languages (GALE only does Arabic & Chinese) –Symbolic rule-learning from minimalist parallel corpus (AVENUE project at CMU)

16 Carnegie Mellon Slide 15 Languge Technologies Insitute Parallel Text: Requiring Less is Better (Requiring None is Best ) Challenge –There is just not enough to approach human-quality MT for most major language pairs (we need ~100X more) –Much parallel text is not on-point (not on domain) –Rare languages and some distant pairs have virtually no parallel text Context-Based (CBMT) Approach –Invented and patented by Meaningful Machines (in New York) –Requires no parallel text, no transfer rules... –Instead, CBMT needs »A fully-inflected bilingual dictionary »A (very large) target-language-only corpus »A (modest) source-language-only corpus [optional, but preferred]

17 CMBT System Parser Target Language Source Language Parser N-gram Segmenter Overlap-based Decoder Bilingual Dictionary INDEXED RESOURCES Target Corpora [ Source Corpora ] N-GRAM BUILDERS (Translation Model) Flooder (non-parallel text method) Cross-Language N-gram Database CACHE DATABASE N-GRAM CONNECTOR N-gram Candidates Approved N-gram Pairs Stored N-gram Pairs Gazetteers Edge Locker TTR Substitution Request

18 Step 1: Source Sentence Chunking Segment source sentence into overlapping n-grams via sliding window Typical n-gram length 4 to 9 terms Each term is a word or a known phrase Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words) S1S2S3S4S5S6S7S8S9 S1S2S3S4S5 S2S3S4S5S6 S3S4S5S6S7 S4S5S6S7S8 S5S6S7S8S9

19 Flooding Set Step 2: Dictionary Lookup T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-aT6-a T6-b T6-c Using bilingual dictionary, list all possible target translations for each source word or phrase Source Word-String T2-a T2-b T2-c T2-d Target Word Lists S2S3S4S5S6 Inflected Bilingual Dictionary

20 Step 3: Search Target Text Using the Flooding Set, search target text for word-strings containing one word from each group Find maximum number of words from Flooding Set in minimum length word- string –Words or phrases can be in any order –Ignore function words in initial step (T5 is a function word in this example) T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-aT6-a T6-b T6-c Flooding Set

21 T(x) T(x) T(x)T(x) T(x)T(x)T(x) T(x)T3-bT(x)T2-dT(x)T(x)T6-c T(x)T(x) T(x)T(x) T(x)T(x) T(x) Step 3: Search Target Text (Example) T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-aT6-a T6-b T6-c Flooding Set Target Corpus T(x) T(x) T(x)T(x) T(x)T(x)T(x) T(x)T3-bT(x)T2-dT(x)T(x)T6-c T(x)T(x) T(x)T(x) T(x)T(x) T(x) T3-bT(x)T2-dT(x)T(x)T6-c Target Candidate 1

22 T(x) T(x) T(x)T(x) T(x)T(x)T(x) Step 3: Search Target Text (Example) T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-aT6-a T6-b T6-c Flooding Set Target Corpus T(x) T(x) T(x)T(x) T(x)T(x)T(x) T(x) T(x) T4-aT6-b T(x)T2-cT3-a T(x)T(x)T(x)T(x)T(x)T(x)T(x) T4-aT6-bT(x)T2-cT3-a Target Candidate 2

23 T(x) T(x) T(x)T(x) T(x)T(x)T(x) Step 3: Search Target Text (Example) T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-aT6-a T6-b T6-c Flooding Set Target Corpus T(x) T(x) T(x)T(x) T(x)T(x)T(x) T3-cT2-bT4-eT5-aT6-aT(x)T(x) T(x)T(x) T(x)T(x) T(x)T(x) T(x) T3-cT2-bT4-eT5-aT6-a Target Candidate 3 Reintroduce function words after initial match (e.g. T5)

24 Scoring --- Step 4: Score Word-String Candidates Scoring of candidates based on: –Proximity (minimize extraneous words in target n-gram  precision) –Number of word matches (maximize coverage  recall) ) –Regular words given more weight than function words –Combine results (e.g., optimize F 1 or p-norm or …) T3-bT(x)T2-dT(x)T(x)T6-c T4-aT6-bT(x)T2-cT3-a T3-cT2-bT4-eT5-aT6-a Proximity 3rd 1st Word Matches 3rd 2st 1st “Regular” Words 3rd 1st Total Scoring 3rd 2nd 1st Target Word-String Candidates

25 T3-bT(x3)T2-dT(x5)T(x6)T6-c T4-aT6-bT(x3)T2-cT3-a T3-cT2-bT4-eT5-aT6-a T(x2)T4-aT6-bT(x3)T2-c T(x1)T2-dT3-cT(x2)T4-b T(x1)T3-cT2-bT4-e T6-bT(x11)T2-cT3-aT(x9) T2-bT4-eT5-aT6-aT(x8) T6-bT(x3)T2-cT3-aT(x8) Step 5: Select Candidates Using Overlap (Propagate context over entire sentence) Word-String 1 Candidates Word-String 2 Candidates Word-String 3 Candidates T(x2)T4-aT6-bT(x3)T2-c T4-aT6-bT(x3)T2-cT3-a T6-bT(x3)T2-cT3-aT(x8) T3-cT2-bT4-eT5-aT6-a T3-bT(x3)T2-dT(x5)T(x6)T6-c T4-aT6-bT(x3)T2-cT3-a T(x1)T3-cT2-bT4-e T3-cT2-bT4-eT5-aT6-a T2-bT4-eT5-aT6-aT(x8)

26 Carnegie Mellon Slide 25 Languge Technologies Insitute Step 5: Select Candidates Using Overlap T(x1)T3-cT2-bT4-e T3-cT2-bT4-eT5-aT6-a T2-bT4-eT5-aT6-aT(x8) T(x2)T4-aT6-bT(x3)T2-c T4-aT6-bT(x3)T2-cT3-a T6-bT(x3)T2-cT3-aT(x8) T(x2)T4-aT6-bT(x3)T2-cT3-aT(x8) T(x1)T3-cT2-bT4-eT5-aT6-aT(x8) Best translations selected via maximal overlap Alternative 1 Alternative 2

27 A (Simple) Real Example of Overlap a United States soldier died and two others were injured Monday a United States soldier United States soldier died soldier died and two others died and two others were injured two others were injured Monday A soldier of the wounded United States died and other two were east Monday N-grams generated from Flooding Systran Flooding  N-gram fidelity Overlap  Long range fidelity N-grams connected via Overlap

28 Comparative System Scores Based on same Spanish test set 0.4 0.5 0.6 0.7 0.8 BLEU SCORES Systran Spanish Google Chinese (‘06 NIST) MM Spanish 0.3 Google Arabic (‘05 NIST) 0.3859 0.5137 0.5551 0.5610 SDL Spanish 0.7533 0.85 0.7447 MM Spanish (Non-blind) Google Spanish ’08 top lang 0.7209 Human Scoring Range

29 0.4 0.5 0.6 0.7 0.8 BLEU SCORES 0.3 Human Scoring Range Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07 0.85.3743.5670.6144.6165.6456.6950.5953.6129 Non-Blind. 6374.6393. 6267.7059.6694.6929 Blind Historical CBMT Scoring.7354.6645.7365.7276.7447.7533

30 Illustrative Translation Examples Un coche bomba estalla junto a una comisaría de policía en Bagdad MM: A car bomb explodes next to a police station in Baghdad Systran: A car pump explodes next to a police station of police in bagdad Hamas anunció este jueves el fin de su cese del fuego con Israel MM: Hamas announced Thursday the end of the cease fire with israel Systran: Hamas announced east Thursday the aim of its cease-fire with Israel Testigos dijeron haber visto seis jeeps y dos vehículos blindados MM: Witnesses said they have seen six jeeps and two armored vehicles Systran: Six witnesses said to have seen jeeps and two armored vehicles

31 A More Complex Example Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes por el estallido de un artefacto explosivo improvisado en el centro de Bagdad, dijeron funcionarios militares estadounidenses. MM: A United States soldier died and two others were injured monday by the explosion of an improvised explosive device in the heart of Baghdad, American military officials said. Systran: A soldier of the wounded United States died and other two were east Monday by the outbreak from an improvised explosive device in the center of Bagdad, said American military civil employees.

32 Summing up CBMT What if a source word or phrase is not in the bilingual dictionary? –Automatically Find near synonyms in source, –Replace and retranslate Current Status –Developing a Chinese-to-English CBMT system (no results yet) –Pre-commercial (field-testing) –CMU participating in improving CBMT technology Advantages –Most accurate MT method in existance –Does not require parallel text Disadvantages –Run time speed: 20s/w (7/07)  1.5s/w (2/08)  0.1s/w(12/08) –Requires heavy-duty servers (not yet miniturizable)

33 Carnegie Mellon Slide 32 Languge Technologies Insitute Compositionality Adjust rule to reflect compositionality NP rule can be used to translate part of the sentence; keep / add context constraints, eliminate unnecessary ones Flat Seed Generation The highly qualified applicant visits the company. Der äußerst qualifizierte Bewerber besucht die Firma. ((1,1),(2,2),(3,3),(4,4),(5,5),(6,6)) S::S [det adv adj n v det n] [det adv adj n v det n] ((x1::y1) (x2::y2)…. ((x4 agr) = *3-sing) … ((y3 case) = *nom)…) Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word-aligned sentence pairs, abstracted only to POS level; no syntactic structure 2) Add compositional structure to Seed Rule by exploiting previously learned rules 3) Seeded Version Space Learning group seed rules by constituent sequences and alignments, seed rules form s-boundary of VS; generalize with validation Seeded Version Space Learning Group seed rules into version spaces: … NP v det n… Notes: 1) Partial order of rules in VS 2) Generalization via merging S::S [NP v det n] [NP n v det n] ((x1::y1) (x2::y2)…. … ((y1 case) = *nom)…) NP::NP [det adv adj n] [det adv adj n] ((x1::y1)… ((y4 agr) = (x4 agr) ….) Merge two rules: 1) Deletion of constraint 2) Raising of two value to one Agreement constraint, e.g. ((x1 num) = *pl), ((x3 num) = *pl) ((x1 num) = (x3 num) 3) Use merged rule to translate

34 Carnegie Mellon Slide 33 Languge Technologies Insitute Version Spaces (Mitchell, 1980) G boundary S boundary “Target” concept N Specific Instances Anything b

35 Carnegie Mellon Slide 34 Languge Technologies Insitute Original & Seeded Version Spaces Version-spaces (Mitchell, 1980) –Symbolic multivariate learning –S & G sets define lattice boundaries –Exponential worst-case: O(b N ) Seeded Version Spaces (Carbonell, 2002) –Generality level  hypothesis seed –  S &  G subsets  effective lattice –Polynomial worst case: O(b k/2 ), k=3,4

36 Carnegie Mellon Slide 35 Languge Technologies Insitute Seeded Version Spaces (Carbonell, 2002) G boundary S boundary “Target” concept N X n  Y m “The big book”  “ el libro grande” Det Adj N  Det N Adj (Y 2 num) = (Y 3 num) (Y 2 gen) = (Y 3 gen) (X 3 num) = (Y 2 num)

37 Carnegie Mellon Slide 36 Languge Technologies Insitute Seeded Version Spaces S boundary “Target” concept Seed ( best guess) k N G boundary X n  Y m “The big book”  “ el libro grande”

38 Carnegie Mellon Slide 37 Languge Technologies Insitute Transfer Rule Learning One SVS per transfer rule –Alignment + generalization level = seed –Apply SVS(seed,k) learning method »High confidence seed  small k [k too small increases prob(failure)] »Low confidence seed  larger k [k too large increases computation & elicitation] For translation use SVS f-centroid –If success % > threshold, halt SVS –If no rule found, k := K+1 & redo (to max K)

39 Carnegie Mellon Slide 38 Languge Technologies Insitute Sample learned Rule

40 Compositionality ;;L2: h &niin h rxb b h bxirwt ;;L1: the widespread interest in the election NP::NP [Det N Det Adj PP] -> [Det Adj N PP] ((X1::Y1)(X2::Y3) (X3::Y1)(X4::Y2) (X5::Y4) ((Y3 num) = (X2 num)) ((X2 num) = sg) ((X2 gen) = m) ((Y1 def) = +) ) Training example Rule type Component sequences Component alignments Agreement constraints Value constraints

41 Carnegie Mellon Slide 40 Languge Technologies Insitute Automated Speech Recognition at CMU: Sphinx and related technologies Sphinx is the name for a family of speech recognition engines – developed by Reddy, Lee & Rudnicky –First continuous-speech, speaker-independent large vocabulary system –Standard continuous HMM technology –Multiple languages –Open source code base –Miniaturized and integrated with MT, dialog, and TTS –Basis for Native Accent & Carnegies Speech prouducts Janus system developed–by Waibel and Schultz –Started as Neural-Net Approach –Now standard continuous HMM system –Multiple languages –NOT open source code base –Held WER advantage over Sphinx till recently (now both are equal)

42 Carnegie Mellon Slide 41 Languge Technologies Insitute Sphinx ASR at Carnegie Mellon Sphinx systems have evolved along two parallel batch Technology innovation occurs in the Research branch, then transitions to the Live branch The Live branch focuses on the additional requirements of interactive real-time recognition and on the integration with higher-level components such as spoken language understanding and dialog management Sphinx Sphinx-2 Sphinx-3.x Sphinx-3.7 PocketSphin x Live ASRResearch ASR Sphinx-2L SphinxL 1986 1992 1999 2006

43 Carnegie Mellon Slide 42 Languge Technologies Insitute Some features of Sphinx LevelFeatures Acoustic Modeling CHMM-based representations Feature-space reduction (LDA/MLLT) Adaptive training (based on VTLN) Speaker adaptation (MLLR) Discriminative Training Real-Time Decoding GMM selection (4 level) Lexical trees; fixed-sized tree copying Parameter quantization Live interaction support End-point detection Dynamic lass-based language models Dynamic language models Partial hypotheses Confidence annotation Dynamic noise processing

44 Carnegie Mellon Slide 43 Languge Technologies Insitute August 29, 2015Speech Group43 Spoken language processing speech text meaning action recognition dialogue discourse understanding generation synthesis task & domain knowledge

45 Carnegie Mellon Slide 44 Languge Technologies Insitute August 29, 2015Speech Group44 A generic speech system human speech Signal Processing Dialog Manager Decoder Parser Language Generator Speech Synthesizer post Parser Domain agent Domain agent Domain Agent display machine speech effector

46 Carnegie Mellon Slide 45 Languge Technologies Insitute August 29, 2015Speech Group45 Decoding speech Signal processing Decoder Reduce dimensionality of signal noise conditioning Transcribe speech to words Acoustic models Language models Corpus-base statistical models Lexical models

47 Carnegie Mellon Slide 46 Languge Technologies Insitute August 29, 2015Speech Group46 A model for speech recognition Speech recognition as an estimation problem –A = sequence of acoustic observations –W = string of words

48 Carnegie Mellon Slide 47 Languge Technologies Insitute August 29, 2015Speech Group47 Recognition A Bayesian formulation of the recognition problem –the “fundamental equation” of speech recognition posteriorprior The “model” conditional eventevidence

49 Carnegie Mellon Slide 48 Languge Technologies Insitute August 29, 2015Speech Group48 Understanding speech Parser Post parser Extract semantic content from utterance Introduce context and world knowledge into interpretation Grammar Context Domain Agents Grounding, knowledge engineering Ontology design, language acquisition

50 Carnegie Mellon Slide 49 Languge Technologies Insitute August 29, 2015Speech Group49 Interacting with the user Dialog manager Domain agent Domain agent Domain agent Guide interaction through task Map user inputs and system state into actions Interact with back-end(s) Interpret information using domain knowledge Task schemas Database Live data (e.g. Web) Domain expert Context Task analysis Knowledge engineering

51 Carnegie Mellon Slide 50 Languge Technologies Insitute August 29, 2015Speech Group50 Communicating with the user Language Generator Speech synthesizer Display Generator Action Generator Decide what to say to user (and how to phrase it) Rules Database Rules Database Construct sounds and intonation

52 Carnegie Mellon Slide 51 Languge Technologies Insitute Environmental robustness for the STS Expected types of disturbance: –White/broadband noise –Speech babble –Single competing speakers –Other background music Comments: –State-of-the art technologies have been applied mostly to applications where time and computation is unlimited –Samsung expects reverberation will not be a factor; if reverb actually is in environment, compensation becomes more difficult

53 Carnegie Mellon Slide 52 Languge Technologies Insitute Speech recognition as pattern classification Major functional components: –Signal processing to extract features from speech waveforms –Comparison of features to pre-stored representations Important design choices: –Choice of features –Specific method of comparing features to stored representations Feature extraction Decision making procedure Speech features Word hypotheses

54 Carnegie Mellon Slide 53 Languge Technologies Insitute Multistage environmental compensation for Samsung STC Compensation is a combination of: –ZCAE separates speech from noise based on direction of arrival –ANC cancels noise –CDCN, VTS, DCN provide static and dynamic noise removal, channel compensation, feature normalization –CBMF restores missing features that are lost due to transient noise

55 Carnegie Mellon Slide 54 Languge Technologies Insitute Short-time Fourier analysis separates speech in time and frequency Separate time-frequency components according to direction of arrival based on time delay comparisons (ITD) Reconstruct signal from only the components that are believed to arise from desired source Missing feature restoration can smooth reconstruction Binaural processing for selective reconstruction (the ZCAE algorithm)

56 Carnegie Mellon Slide 55 Languge Technologies Insitute Sample results with 45-degree arrival angle difference: Original Separated –With no reverberation: –With 300-ms reverb time: ZCAE provides excellent separation in “dry” rooms but fails in reverberation Performance of zero-crossing-based amplitude estimation (ZCAE)

57 Carnegie Mellon Slide 56 Languge Technologies Insitute Adaptive noise cancellation (ANC) Speech + noise enters primary channel, correlated noise enters reference channel Adaptive filter attempts to convert noise in secondary channel to best resemble noise in primary channel and subtracts Performance degrades when speech leaks into reference channel and in reverberation

58 Carnegie Mellon Slide 57 Languge Technologies Insitute Possible placement of mics for ZCAE and ANC One possible arrangement: –Two omni mics for ZCAE processing –One uni mic (notch toward speaker) for ANC reference Another option is to have reference mic on back, away from speaker

59 Carnegie Mellon Slide 58 Languge Technologies Insitute “Classic” compensation for noise and filtering Model of Signal Degradation: Compensation procedures attempt to estimate characteristics of filter and additive noise, and then apply inverse operations Algorithms include CDCN, VTS, DCN Compensation applied at cepstral level (not to waveforms) Supersedes Wiener filtering

60 Carnegie Mellon Slide 59 Languge Technologies Insitute Compensation results: WSJ task in white noise VTS provides ~7-dB improvement over baseline

61 Carnegie Mellon Slide 60 Languge Technologies Insitute What does compensated speech “sound” like? Speech reconstructed after CDCN processing: Signals that sound “better” do not produce lower error rates 0 dB 20 dB Clean Original “Recovered”

62 Carnegie Mellon Slide 61 Languge Technologies Insitute Missing-feature recognition General approach: –Determine which cells of a spectrogram-like display are unreliable (or “missing”) –Ignore missing features or make best guess about their values based on data that are present Comment: –Used in the STS for transient compensation and as a supporting technology for ZCAE

63 Carnegie Mellon Slide 62 Languge Technologies Insitute Original speech spectrogram

64 Carnegie Mellon Slide 63 Languge Technologies Insitute Spectrogram corrupted by noise at SNR 15 dB Some regions are affected far more than others

65 Carnegie Mellon Slide 64 Languge Technologies Insitute Ignoring regions in the spectrogram that are corrupted by noise All regions with SNR less than 0 dB deemed missing (dark blue) Recognition performed based on colored regions alone

66 Carnegie Mellon Slide 65 Languge Technologies Insitute Robustness Summary The CMU Robust Group will implement an ensemble of complementary technologies for the Samsung STS: –ZCAE for speech separation by arrival angle –Adaptive noise compensation to reduce background interference –CDCN, VTS, DCN to reduce stationary residual noise and provide channel compensation –Cluster-based missing-feature compensation to protect against transient interference and to improve ZCAE performance Most compensation procedures are based on a tight integration of environmental compensation with the core speech recognition engine.

67 Virtual English Village (VEV) [SAMSUNG SLIDE] Service Overview –An online 3D virtual world providing language students with an opportunity to practice and develop conversational skills for pre-defined domains/scenarios. –Provide students with a fun, safe environment to develop conversational skills. –Improve students:- »Grammar  by correcting student input or suggesting other appropriate variations »Vocabulary  by suggesting other words »Pronunciation  through clear, synthesized speech and correcting students input –Service requires student to have access to a microphone. –A complement to existing English language services in Korea. Target Market –Initially targeting Korean Elementary to High-School Students (~8 to 16 year olds) with some familiarity of the English language.

68 Carnegie Mellon Slide 67 Languge Technologies Insitute Potential Carnegie Contributions to VEG ASR: Sphinx speech recognition (CMU) TTS: Festival/Festbox text-to-speech (CMU) Let’s Go & Olympus dialog systems (CMU) Native Accent pronunciation tutor (Carnegie Speech) Grammar & lexicon tutors (Carnegie Speech) –Story Friends, Story Librarian, … –Intelligent tutoring architecture Language parsing/generation (CMU & CS)

69 Carnegie Mellon Slide 68 Languge Technologies Insitute Let’s Go Performance Automated bus schedule/routes for Port Authority of Pittsburgh Answers phone every evening and on weekends since March 5, 2005 Presently over 52,000 calls from real users –Variety of noise and transmission conditions –Variety of dialects and other variants Estimated Task Success = about 75% Average number of turns = 12-14 Dialogue characteristics: –Mixed initiative after first turn –Open-ended first turn (“What can I do for you?”) –Change of topic upon comprehension problem (e.g. “what neighborhood are you in?”)

70 Carnegie Mellon Slide 69 Languge Technologies Insitute Let’s Go Architecture ASR Sphinx II Parsing Phoenix Dialogue Management Olympus II Speech Synthesis Festival HUB Galaxy NLG Rosetta

71 Carnegie Mellon Slide 70 Languge Technologies Insitute Olympus II dialogue architecture Improves timing awareness – separates DM decision making and awareness into separate modules –Improves turntaking –Higher success after barge-in –Reduces system cut-ins International spoken dialogue community uses OlympusII for different systems –About 10 different applications »Bus, air reservations, room reservations, movie line, personal minder, etc. Open Source with Documentation

72 Carnegie Mellon Slide 71 Languge Technologies Insitute Olympus II architecture

73 Carnegie Mellon Slide 72 Languge Technologies Insitute Speech synthesis Recognizer = Sphinx-II (enhanced, customized) Synthesizer = Festival –Well-known, open source –Provides very high quality limited domain synthesis »Some people calling Let’s Go think they are talking to a person Rosetta Spoken Language Generation –Generate natural language sentences using templates

74 Carnegie Mellon Slide 73 Languge Technologies Insitute Spoken Dialogue and Language Learning Correction of any of several levels of language within a dialogue while keeping the dialogue on track –“I want to go the airport” –“Going to the airport, at what time?” Adaptation to non-native speech Uses F0 unit selection (Raux, Black ASRU 2003) for emphasis in synthesis Lexical entrainment for rapid correction –Implicit strategy, no break in dialogue

75 Carnegie Mellon Slide 74 Languge Technologies Insitute Spoken Dialogue and Language Learning Lexical entrainment for rapid correction Collect speech in WOZ setup Model expected error Find closest user utterance Target Prompts ASR Hypothesis DP-based Alignment Prompt Selection Emphasis Confirmation Prompt


Download ppt "Machine Translation & Automated Speech Recognition Jaime Carbonell With Richard Stern and Alex Rudnicky Language Technologies Institute Carnegie Mellon."

Similar presentations


Ads by Google