Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop.

Similar presentations


Presentation on theme: "Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop."— Presentation transcript:

1 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop 2006 Classification-Based Contextual Correction of Mistranslations: A Machine Learning Approach William H. Hsu Joint work with: Waleed Al-Jandal, Martin S. R. Paradesi, Tejaswi Pydimarri, Chris Meyer Thursday, 01 June 2006 Laboratory for Knowledge Discovery in Databases Kansas State University http://www.kddresearch.org/KSU/CIS/HLT-Specialized-20060601.ppt

2 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop A Technical Survey of Statistical MT: Phrase-Based Methods and Metrics

3 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

4 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Global Search Decoder Algorithm argmax e P(t) *P(s|t) Language Model P(t) Translation Model P(s|t) Input: Source Language s Output: Target Language t Training Program (e.g., GIZA) Bilingual Parallel Corpora Language Modeling toolkit Target Language Machine Translation: Generic System Architecture

5 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Based on noisy channel model  Source: foreign sentence f  Target: English sentence e Bayesian inference: Maximum A Posteriori (MAP) Background [1]: Phrase Translation Model

6 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Preliminaries: segmentation of foreign input  Result:  Use: lexical analysis tools – string tokenizer, etc. Goal: decoding  Segmented input:  Output: Distributions  Prediction:  Distortion: a i = start of f i, b i-1 = end of f i-1 Background [2]: Modeling Steps

7 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Length normalization factor:  Language Model (p LM ): Trigram [Seymour and Rosenfeld, 1997] Background [3]: Probabilistic Formulation

8 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Methods for Learning in MT: Survey Transformation-Based Learning (TBL) Example-Based Machine Translation (EBMT) Symbolic AI: Frames, Conceptual Grammars, Analogy, CBR Statistical  0. classical / naïve (cf. Weaver’s correspondence with Weiner)  1. phrase alignments from word-aligned model [Och & Ney, 2000]  2. linguistically motivated models [Yamada & Knight, 2001]  3. joint phrase model [Marcu & Wong, 2002]  4. generative phrase alignment [Koehn, Och & Marcu, 2003]  5. hierarchical models [Chiang, 2005; Taskar, 2005]  6. new approaches

9 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

10 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop N-gram precision (score is between 0 & 1) –What percentage of machine n-grams can be found in the reference translation? –n-gram: sequence of n units (words) –Not allowed to use same portion of reference translation twice (can’t cheat by repetition) Brevity penalty –Can’t just type out single word “the” p n : n-gram precision w n : positive weights r : words-in-reference c : words-in-machine Hard to “game” system (i.e., change machine output so that BLEU goes up, but quality doesn’t) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Adapted from Knight (2003) Bilingual Evaluation Understudy (BLEU) [1]: Papineni et al. (ACL, 2002)

11 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [2]: Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. © 2003 Knight, K.

12 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [3]: Tracking Human Judgment (variant of BLEU) Courtesy G. Doddington (NIST)

13 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Bilingual Evaluation Understudy (BLEU) [4]: Metrics in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) red = word not matched (bad!) © 2003 Kevin Knight

14 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Issues with BLEU Significance of High Correlation with Human Judgment Sensitivity: Need to Recalibrate for Corpus, Language? Generalizability to Other Translation Tasks  Causal explanation  Associative reasoning in customer relationship management  Collaborative recommendation  Diagnosis (form of gisting)  Speech-to-speech Meaning

15 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop New Technologies and Transfer Plan

16 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

17 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Context-Driven NLP: MT Applications Classical Natural Language Processing (NLP)  (Noun and verb) phrase extraction  Detection of named entity phrases  Word sense disambiguation  Spelling correction Interlingual Challenges  Making use of mixed resources: bilingual & monolingual  Semi-supervised learning Applications  Mixed-mode (semi-interactive) MT – assistive technology  Correcting mistranslations

18 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop The frequency of named-entity phrases in news text reflects the significance of the events they are associated with. So the news most likely be reported in many languages. For example: Translating Named Entity Phrases [1]: Arabic-English Application Translating Named Entity Phrases [1]: Arabic-English Application The Arabic newspaper article is about negotiations between the US and North Korean authorities regarding the search for the remains of US soldiers who died during the Korean war. [Knight & Al-Onaizan, 2001]

19 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Generate ranked list of translation candidates  Bilingual resources: parallel corpus  Monolingual resources Re-score list of candidates using different monolingual clues Translating Named Entity Phrases [2]: Two-Phase Approach Translating Named Entity Phrases [2]: Two-Phase Approach [Knight & Al-Onaizan, 2001]

20 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Correcting Faulty Translations Human-Assistive Technology Semi-Supervised: Two Training Corpora  Labeled: “bad translations” and “near misses”  Unlabeled: candidate translations Interactive Aspect  “Which of these translations is right?”  “Why is this candidate incorrect?” Application: Boosting Accuracy of SMT

21 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Boosting the Accuracy of SMT Parsing [Koehn et al., 2003]  Pro: Found to slow growth of translation tables  Con: Limited effect on BLEU Context-Specificity  Supported by computational linguistic theory  Some positive results in NLP prediction tasks [Elman, 1994]  Very effective in sequence learning [Barash & Friedman, 2001]  Important for Relational and First-Order Representations New Work: Semi-Supervised Approaches

22 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

23 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Popular SMT Tools Translation Model Generator: GIZA++ Search Decoder : PHARAOH, ISI ReWrite Decoder Language Model Generator : SRILM, CMU-Cambridge Statistical Language Modeling Toolkit EGYPT : A toolkit for SMT that consists GIZA/GIZA++ and word alignment tools. Evaluation packages: MTEVAL, GMT Metrics: BLEU, NIST, n-grams, WER, PER and SSER

24 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Software Tools for Graphical Models: BNJ v3 © 2005 KSU Bayesian Network tools in Java (BNJ) Development Team ALARM Network

25 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop OutlineOutline Background: Statistical Approaches to MT State of the Field: Metrics Open Problems New Approaches, Applications and Software Tools Current and Future Research Prospectus

26 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Current Work Development: End-to-End SMT System for NIST 2006 Evaluation  Arabic-English  Chinese-English Assemblage of Parallel Corpora Software Library Development: SMT Modules  Aligners  Parsers  Phrase-Based Learning  Transformation-Based Learning (TBL) Development of Graphical Models Toolkit  BNJ v4 under development: http://bnj.sourceforge.net  Integration with KSU SMT library Applications: Relational Link Mining in Social Networks © 2005 Walker Blogs

27 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Knowledge Representation Strategy Deep/Complex Shallow/Simple Learn from un- annotated data Phrase tables Word-based only Learn from annotated data Example-based MT Original statistical MT Typical transfer system Classic interlingual system Original direct approach Syntactic Constituent Structure Interlingua New Research: Context-Specificity Semantic analysis Hand-built by non-experts Hand-built by experts Electronic dictionaries Knowledge Acquisition Strategy All manual Fully automated MT Strategies (1954-2006) Slide courtesy of Laurie Gerber Future Research Directions

28 Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Questions and Discussion


Download ppt "Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop."

Similar presentations


Ads by Google