Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI Arona, September 2005 JRC Enlargement.

Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI eisele@dfki.de Arona, September 2005 JRC Enlargement and Integration Workshop Exploiting parallel corpora in up to 20 languages

Exploiting Multilingual Corpora2eisele@dfki.de Overview nMultilingual/MT Projects & Tools at DFKI nMT-Related Activities at Saarland University nWork in the PTOLEMAIOS Project nPlans for Near-Term Future

Exploiting Multilingual Corpora3eisele@dfki.de Multilingual Projects at DFKI Main LT Application Areas: nMultilingual Natural Communication nMultilingual Document Production nCrosslingual Information and Knowledge Management

Exploiting Multilingual Corpora4eisele@dfki.de Multilingual Natural Communication nNL Dialogue Systems (DISCO, COSMA, Interprice)COSMA nSpeech Dialogue Processing (Verbmobil, Interprice)Verbmobil nRobust Speech Parsing (Verbmobil, Interprice)Verbmobil nAutomatic Processing and Answering of Email (COSMA, ICC, XtraMind)COSMAICCXtraMind nNatural Speech Synthesis (Mary, Interprice) Sample Application Areas: e-commerce (product search, CRM) Application Projects with Interprice, AOL Europe and spin-off company XtraMind TechnologiesXtraMind Technologies

Exploiting Multilingual Corpora5eisele@dfki.de Multilingual Document Production nTerminology Checking (DiET, FLAG, WHITEBOARD, SKATE)DiETFLAGWHITEBOARD nGrammar and Style Checking (LATESLAV, FLAG, SKATE)FLAG nControlled Language Checking (FLAG, WHITEBOARD, SKATE)FLAGWHITEBOARD nAutomatic XML Tagging (WHITEBOARD)WHITEBOARD nConsistency Control (BiLD, WHITEBOARD)WHITEBOARD Sample Application Areas: multilingual document production, web-content production Application Project with SAP Spin-Off company

Exploiting Multilingual Corpora6eisele@dfki.de Crosslingual Information and Knowledge Management nCrosslingual Content Management (TWENTYONE, MUCHMORE)TWENTYONE nCrosslingual Information Retrieval (TWENTYONE, MULINEX, MIETTA, MUCHMORE)TWENTYONEMULINEXMIETTA nCrosslingual Multimedia Retrieval (POP-EYE, OLIVE, MUMIS, DIRECT INFO)POP-EYEOLIVE nCrosslingual Information Extraction (PARADIME, WHITEBOARD, DIRECT INFO)PARADIMEWHITEBOARD nCrosslingual Text Mining, Terminology Extraction (GETESS, AIRFORCE, WIPO)GETESS nMultilingual Summarization (MULINEX, MUCHMORE, MUSI)MULINEXMUSI nMultilingual Language Generation (TG/2, TEMSIS, MIETTA)TG/2TEMSISMIETTA Sample Application Areas: multilingual and crosslingual search, tourism information on the web, up to date air quality reporting, information management for mega-events (world championship, Olympic Games), phonetic trademark search, term extraction from patent translations Application Projects with German Telekom, ESG, Dresdner Bank, law firm Boehmert&Boehmert, feasibility study on terminology extraction with WIPO (via acrolinx), …

Exploiting Multilingual Corpora7eisele@dfki.de Multilingual Resources at DFKI nPOS-tagger TnT (T.Brants) and Chunkie can be trained for arbitrary languages nMiddleware HoG for multilingual robust shallow and HPSG-based deep analysis (mapping into RMRSs) nMorphologies from MMorph project exist for German, English, French, Spanish, Italian nMorphologies are encoded as FS transducers, usable for error-tolerant analysis and generation nAdding more languages is very easy (as done for Arabic with A.Soudi) nUniform handling of all EU languages would be extremely convenient, but linguistic resources are currently lacking

Exploiting Multilingual Corpora8eisele@dfki.de Multilingual Projects at DFKI Main LT Application Areas: nMultilingual Natural Communication nMultilingual Document Production nCrosslingual Information and Knowledge Management Topic emerging since 2005: nMachine Translation

Exploiting Multilingual Corpora9eisele@dfki.de Machine Translation at DFKI Topics in Compass (Digital Olympics 2006): M ulti-Engine Machine Translation, Speech Technologies, Multilingual Content Management, Cross-lingual Information Retrieval and Multilingual Question Answering Open LOGOS qLOGOS MT ® = one of the largest and most powerful among the commercial MT engines qDFKI turned LOGOS MT into an open source product (in cooperation with GlobalWare AG) Plans for integrated, hybrid MT from rule-based and stochastic engines (code name: EuroMatrix)

Exploiting Multilingual Corpora10eisele@dfki.de MT Activities at Saarland University Guiding principle: Start with method that works today, improve it by adding linguistic functionality as appropriate Starting point: Phrase-based SMT (Köhn,Och,Marcu, HLT-NAACL2003) qConceptually, phrase-based SMT is an intermediate step between TM and MT, combines TM’s ability to learn from examples with compositionality of MT qAmong best approaches in ongoing DARPA evaluation campaign qEasy to deploy (thanks to tools by F.J. Och and P. Köhn) qConceptually very simple, hence a good candidate to enrich models with linguistic sophistication

Exploiting Multilingual Corpora11eisele@dfki.de MT Activities at Saarland University nApril ’05: participation in ACL shared task on statistical machine translation with a multi-engine approach {Finnish,French,German,Spanish}  English nMay ‘05: participation in DARPA MT evaluation with baseline phrase- based SMT system (Chinese  English) nProject seminar on empirical MT, students learned to turn parallel corpora into SMT systems (based on EuroParl corpus, but also Welsh ↔ English and Arabic ↔ English) nDiploma Thesis on corpus-based MT via RMRS alignment Experience: Using parallel corpora for MT quickly yields very promising results! We should have more language pairs and more data… nCrawling of UN document repository, collection of 6-way parallel {Arabic,Chinese,English,French,Russian,Spanish} corpus (+ some German)

Exploiting Multilingual Corpora12eisele@dfki.de The PTOLEMAIOS project Assumptions: nAdvanced language technology for truly multilingual applications is a key challenge for computational linguistics nTreebanking and supervised learning have been successful for English (and some other languages), but may not be feasible for “smaller” languages nParallel corpora can be used to transfer knowledge about linguistic relations across languages or to induce linguistic knowledge from data nWord alignments derived from simple models (GIZA++) can help to support this process “Parallel-Text-based Optimization for Language learning ― Exploiting Multilingual Alignment for the Induction Of Syntactic grammars”

Exploiting Multilingual Corpora13eisele@dfki.de PTOLEMAIOS Funding: Emmy-Noether fellowship from DFG, P.I. Jonas Kuhn Expected Duration: April 2005 – March 2009 Original Goal: Induce grammars from parallel corpora (and evaluate them in isolation) Revised Goal (since August’05): Evaluate grammars wrt. impact on MT performance First Steps: Use GIZA++-derived word alignment as filter to speed up parsing, several papers on suitable parsing algorithms Use of LinearB’s SMT decoder on phrase-aligned EuroParl corpus Planned Steps: Explore the usefulness of syntactic analyses for phrase-based SMT word-based and syntax-based partial analyses are offered to decoder decoder can exploit syntax if useful, fall back to plain PBSMT if not optimal weight of syntactic dependencies can be determined empirically Work on more languages (UN corpus in 6 languages, AC corpus)

Exploiting Multilingual Corpora14eisele@dfki.de EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh) MT systems per language pair (data taken from J.Hutchins’ Compendium of Translation Software, 10 th Edition)

Exploiting Multilingual Corpora15eisele@dfki.de EuroMatrix: current situation Most language pairs remain uncovered

Exploiting Multilingual Corpora16eisele@dfki.de EuroMatrix: SMT for many languages EuroParl Corpus has been constructed to build statistical MT systems Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005

Exploiting Multilingual Corpora17eisele@dfki.de EuroMatrix: SMT for many languages Multilingual corpora can be aligned across all languages… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005

Exploiting Multilingual Corpora18eisele@dfki.de EuroMatrix: SMT for many languages SMT systems derived from the corpora vary in quality Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005

Exploiting Multilingual Corpora19eisele@dfki.de EuroMatrix: SMT for many languages Difficulty of translation into and from a given language may differ widely… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005

Exploiting Multilingual Corpora20eisele@dfki.de EuroMatrix Ideas: nFor language pairs where rule-based MT and SMT based on parallel corpora exist, they should be integrated to exploit complementary strengths of both approaches nParallel corpora can then be used in two ways qfeeding the SMT sub-system qfine-tuning the integrated setup nFor language pairs where only monolingual resources (lexicons, morphologies, taggers,…) and parallel corpora exist, transfer rules operating on linguistic representations should be derived from data nWe need a generic framework that allows to plug and play with different approaches (an open source MT toolbox) nDevelopment of MT systems needs open evaluation campaign, in the style of DARPA MTeval / ACL shared task

Exploiting Multilingual Corpora21eisele@dfki.de Conclusion nMachine translation performance can be enabled/ boosted by parallel corpora nCurrent work just scratches the surface of what can be done nSMT systems for the languages of new member states should soon emerge from AC corpus nMore parallel data for these languages would be desirable (100MW much better than 10MW!) nIt would be very helpful to cooperate with teams from “new” countries for morphologies, taggers, parsers,…

Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI Arona, September 2005 JRC Enlargement.

Similar presentations

Presentation on theme: "Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI Arona, September 2005 JRC Enlargement."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI Arona, September 2005 JRC Enlargement.

Similar presentations

Presentation on theme: "Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI Arona, September 2005 JRC Enlargement."— Presentation transcript:

Similar presentations

About project

Feedback