Presentation is loading. Please wait.

Presentation is loading. Please wait.

2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan

Similar presentations


Presentation on theme: "2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan"— Presentation transcript:

1 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

2 2008 – copyright SYSTRAN Overview SYSTRAN – 40 years of innovation The MT Challenges SYSTRANLab Projects Hybrid Engines From Research to Products CWMT08 Conclusions

3 2008 – copyright SYSTRAN SYSTRAN 40 years of history Located in Paris (La Défense) and San Diego +70 employees: ~ 20 linguists, ~ 30 engineers Including 10 PhDs

4 2008 – copyright SYSTRAN Core Technology Core technology “Rule-Based” Based on language description Analysis – Transfer – Generation paradigm Build a « syntax tree » based on hierarchical constituents with multi-level relationships Multi-pass analysis Morphology Analysis Homograph Resolution Clause Boundary Syntagm Identification Syntactic Role Identification … Rely heavily on linguistic resources

5 2008 – copyright SYSTRAN

6 Languages Chinese882Korean78 Arabic422Italian62 Spanish358Ukrainian47 English350Polish42 Hindi325Dutch23 Portuguese250Serbo-Croatian21 Russian170Greek18 French130Czech12 Japanese125Albanian6 Urdu100Slovak6 German100 Farsi82 22 source languages 70 language pairs Dictionaries: 200K-1M entries per LP ~ 6M reference multi-source / multi-target dictionary 3600

7 2008 – copyright SYSTRAN SYSTRAN Activity Retail products: Windows Desktop Product SYSTRAN Mobile on PDA Mac OS Dashboard Widget Online Services SYSTRAN Box, SYSTRAN Net, SYSTRAN Links Corporate customers Symantec, Cisco, Verizon, Ford, Daimler, Chemical Abstract… Institutional Customers EC and US agencies Portals - Online Translation “Babel Fish”, Google, Yahoo!, Microsoft Live, …

8 2008 – copyright SYSTRAN MT Challenges RBMT/SMT Strengths and Weaknesses - I Rule-Based system builds a translation with available linguistic resources (dictionaries, rules) Human-built resources Incremental Track the translation process Predictable output Some phenomena are hard to formalize Need semantic/pragmatic knowledge Not designed to deal with exceptions to the rules … which are very frequent

9 2008 – copyright SYSTRAN MT Challenges RBMT/SMT Strengths and Weaknesses - II Statistical system finds a translation within a choice of many, many possible translations Very easy to build Automatic training process Knowledge acquisition is easy… Not limited to predefined linguistic patterns – “phrase” … but cannot “understand” or generalize information Not even elementary rules Output is “ unpredictable ”

10 2008 – copyright SYSTRAN MT Challenges Corpus-Based or Rule-Based Approach? No conflict between “corpus” and “rule-based” approaches Possible to learn rules Already learns terminology – monolingual and multilingual Some approaches acquire complex rules Possible to find the best translation amongst several translations “Decoding” can be constrained by syntactic restrictions Linguistic rules but corpus drives !

11 2008 – copyright SYSTRAN SYSTRANLab Research Projects Overview Toward Hybrid Engines Collaborations Statistical Post-Edition Lattice Decoding Source Analysis Adaptation From Research to Products

12 2008 – copyright SYSTRAN Research Projects Resources Acquisition Consolidating a 6M entry multilingual dictionary Acquiring more from corpus – lexicon and rules Linguistic Development Entity Recognition with local grammars Autonomous Generation modules Introduction of corpus-based technology Applications More interactive applications Professional Post-Edition Module (POEM)

13 2008 – copyright SYSTRAN SYSTRANLab Research Projects The Phoenix Project Collaboration with P. Koehn (University of Edinburgh) Introduce corpus-based decision modules in SYSTRAN Specialized modules Word Sense Disambiguation Lattice Generation Preposition / Determiner Choice

14 2008 – copyright SYSTRAN SYSTRANLab Research Projects The Sphinx Project Collaboration with CNRC Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition) GALE (DARPA Project) Participated in WMT07, NIST08

15 2008 – copyright SYSTRAN SYSTRANLab Research Projects The Pegasus Project Collaboration with H. Schwenk (Université du Maine) Introduce linguistic knowledge in statistical engines Participated in WMT08

16 2008 – copyright SYSTRAN SYSTRANLab Hybrid Engines Introduce Self- Learning capability Learn “post-edition rules” Deep integration of statistical decision modules Insert linguistic knowledge in statistical engines HYBRID

17 2008 – copyright SYSTRAN CWMT08 Chinese-English MT evaluation Primary: RBMT+SPE Contrast: RBMT Started in 1994, 1.2M terms, S&T-focus BLEU4BLEU4- SBP NIST5GTMmWERmPERICT Primary-a0.22750.21937.91800.71010.72090.50850.3262 Contrast-b0.19560.19307.63560.70890.71650.51230.2942

18 2008 – copyright SYSTRAN CWMT08: SPE Usage SPE module trained on 1.8m sentences CWMT08 training data not use Not only translation by also annotation by RBMT Dates, numerals, etc. Transfer model is filtered Exclusion of “ bad rules ” by rule based filtering Examples are “ random ” quotes, entities appearing Some expressions are “ protected ” Constituents will be replaced with placeholders before SPE Translated with RBMT Re-injected in translation after SPE SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (www.statmt.org/moses)

19 2008 – copyright SYSTRAN Statistical Post-Edition A Case Study Case Study – SYMANTEC – English>Chinese BLEUPERFECTImprov / Degrad SYSTRAN Raw20.892 - SYSTRAN Cust34.494.8 ref SYSTRAN Raw + Translation Model 46.867.4- SYSTRAN Cust + Translation Model 50.9010.515

20 2008 – copyright SYSTRAN Conclusions Our approach is to start with rule-based framework Developed techniques give very competitive results Major focus on “degradation” control Learn more advanced post-edition rules Generic Translation – still a long way to go Bigger still better? Domain Translation Quality is there – statistics provides adaptation and fluidity  Need dedicated applications, workflow Bootstrapping new language pair development


Download ppt "2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan"

Similar presentations


Ads by Google