Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk.

Similar presentations


Presentation on theme: "The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk."— Presentation transcript:

1 The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk

2 8/31/06 2 Candidacy Exam Organization Use and Meaning of Intonation Automatic Analysis of Intonation Applications Speech-to-Speech Translation L2 Learning Systems

3 8/31/06 3 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al. 1997 A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al. 1998 Cascaded / Loose Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al. 2004 Integrated Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003 ASRMTTTS ASR + MTTTS

4 8/31/06 4 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al. 1997 A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al. 1998 Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al. 2004 Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

5 8/31/06 5 On the Use of Prosody in a Speech-to-Speech Translator Strom et al. 1997 INTARC - German-English Translator produced for VERBMOBIL project. Spontaneous, limited domain (appointment scheduling) 80 minutes of prosodically labeled speech Phrase Boundary (PB) Detector Gaussian classifier based on F0, energy and time features with a 4 syl. window (acc. 80.76%) Focus Detector Rule based approach: Identifies location of steepest F0 decline (acc. 78.5%) Syntactic parsing search space is reduced by 65% Baseline syntactic parsing uses Decoder factor: product of acoustic and bi-gram scores Grammar factor: grammar model probability of a parse using the hypothesized word Prosody factor: 4-gram model of prosodic events (focus and PB) Semantic parsing search space is reduced by 24.7% The semantic grammar was augmented, labeling rules as “segment-connecting”(SC) and “segment-internal” (SI) SC rules are applied when there is a PB between segments, SI are applied when there are not. Ideal phrase boundaries reduced the number of hypotheses by 65.4% (analysis trees by 41.9%) Automatically hypothesized PBs required a backoff mechanism to handle errors and PBs that are not aligned with grammatical phrase boundaries. Prosodically driven translation is used when deep transfer (translation) fails A focused word determines (probabilistically) a dialog act which is translated based on available information from the word chain. Correct: 50%, Incomplete: 45%, Incorrect: 5%

6 8/31/06 6 A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al. 1998 Limited domain translation system (Hotel Reservations) Cascaded approach ASR: sequential model ~2k word vocabulary MT: syntactically driven ~12k word vocabulary TTS: CHATR (now unit selection, then concatenative) Early Example of “Interactive” Speech-to-Speech Translation When the system has low confidence in either recognition or MT outputs, it prompts the user for corrections. Speech Information is used in three ways in ATR-MATRIX Voice Selection Based on the source voice, either a male or female voice is used for synthesis Hypothesized phrase boundaries Using pause information along with POS N-gram information the source utterance is divided into “meaningful chunks” for translation. Phrase Final Behavior If phrase final rise is detected, it is passed to the MT module as a “lexical” item potentially indicating a question.

7 8/31/06 7 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al. 1997 A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al. 1998 Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al. 2004 Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

8 8/31/06 8 Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 Interlingua and Frame-Slot based Spanish-English translation limited domain (conference registration) spontaneous speech Cascaded Approach Two semantic parse techniques GLR* Interlingua parsing (transcript 82.9%; ASR 54%) Manually constructed grammar to parse input into interlingua robust, doesn’t not require “grammatically correct” input Search for the maximal subset covered by the grammar Generation is performed by an interlingua generator Phoenix (transcript 76.3%; ASR 48.6%) identifies key concepts and their structure parsing grammar contains specific patterns which represent domain concepts The patterns are then compiled into a “recursive transition network” Each concept has one or more fixed phrasings in the target language Phoenix is used as a backoff when GLR* fails. Transcript: 83.3%; ASR 63.6% Late stage disambiguation Multiple translations are processed through the whole system. Translation hypothesis selection occurs just before generation using scores from recognition, parsing and discourse processing. MTTTSASR disambiguation

9 8/31/06 9 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al. 2004 Process many hypotheses, then select one. In a cascaded architecture: HMM-based ASR produces N-best recognition hypotheses IBM Model 4 MT processes all N. Rescore MT hypotheses based on weighted log-linear combination of ASR and MT features. Construct the feature weight model by optimizing a translation distance metric (mWER, mPER, BLEU, NIST) Experiment Results Corpus: 162k/510/508 Japanese-English parallel sentences Baseline: no optimization of MT features Substantial improvement was obtained by optimizing feature weights based on distance metric Additional improvement was achieved by including ASR features Translation of N-best ASR hypotheses improved sentence translation accuracy of incorrectly recognized 1-best hypotheses by 7.5%

10 8/31/06 10 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al. 1997 A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al. 1998 Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al. 2004 Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

11 8/31/06 11 Finite-State Speech-to-Speech Translation Vidal 1997 FSTs can naturally be applied to translation. FSTs for statistical MT can be learned from parallel corpora. (OSTIA) Speech input is handled in two ways: Baseline cascaded approach Integrated approach Create an FST on text, replace each edge with an acoustic model of the lexical item A major drawback of using this approach is large training data requirement. Align the source and target utterances, reducing their “asynchronicity” Cluster lexical items, reducing the vocabulary size Proof of concept experiment Text: ~30 lexical items used in 16k paired sentences (Spanish- English) Greater than 99% translation accuracy is achieved Speech: 50k/400 (training/testing) paired utterances, spoken by 4 speakers Best performance: 97.2% translation acc. 97.4% recognition accuracy Requires inclusion of source and target 4-gram LMs in FST training. Travel domain experiment Text: ~600 lexical items in 169k/2k paired sentences 0.7% translation WER w/ categorization; 13.3% WER w/o Speech: 336 test utterances (~3k words) spoken by 4 speakers Text transducer was used, edges replaced by concatenation of “phonetic elements” modeled by a continuous HMM. 1.9% translation WER and 2.2% recognition WER were obtained.

12 8/31/06 12 Use word lattices weighted by HMM ASR scores as input to a weighted FST for translation Noisy Channel Model Using an alignment model, A Instead of modeling the alignment, search for the best alignment Evaluation: Material: 4 parallel corpora Spontaneous speech in the travel domain 3k - 66k paired sentences in Italian-English, Spanish-English and Spanish-Catalan Vocabulary size 1.7k-15k words On all metrics (mWER, mPER, BLEU, NIST), the translation results are as follows: Correct text Word lattice w/ acoustic scores Fully integrated ASR and MT (FUB Italian-English only) Word lattice w/o acoustic scores Single best ASR hypothesis (lower mPER than lattice w/o scores on FUB I-E) Denser ASR lattices yield reduced translation WER (on FUB Italian-English) On the Integration of Speech Recognition and Statistical Machine Translation Matusov et al. 2005 Best English sentencelengthFrench audioTarget LMTranslation model Length of sourceAligned target wordLexical contextAcoustic context

13 8/31/06 13 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003 Application of direct modeling to ASR, with the goal of direct modeling of interlingua text for MT. A direct model of target text from source acoustics could also be constructed using this approach. Composing models (e.g., noisy channel models) can lead to local or sub-optimal solutions Direct Modeling tries to avoid these by creating a single maximum entropy model p(text|acoustics,...) Direct modeling can also include other non-independent observations (features). Major considerations: To simplify computational complexity, acoustic features are quantized. Since the feature vector can get very large, reliable feature selection is necessary. In preliminary experiments, 150M features were reduced to 500K via feature selection stst s t-1 s t-2 otot o t-1 o t-2 stst s t-1 s t-2 otot o t-1 o t-2 LiLi Semantic Label WiWi Word F j-1 FjFj Phoneme stst s t-1 s t-2 otot o t-1 o t-2 Subphone Observation

14 8/31/06 14 The Use of Speech in Speech-to-Speech Translation The Use of Prosodic Event Information On the Use of Prosody in a Speech-to-Speech Translator Strom et al. 1997 A Japanese-to-English Speech Translation System: ATR-MATRIX Takezawa et al. 1998 Cascaded / Loosely Coupled Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages Lavie et al. 1997 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition and Machine Translation Zhang et al. 2004 Integrated / Tightly Coupled Approaches Finite State Speech-to-Speech Translation Vidal 1997 On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005 Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

15 Thank you.


Download ppt "The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk."

Similar presentations


Ads by Google