Presentation on theme: "En->Cz MT system based on TR Zdeněk Žabokrtský IFAL, Charles University in Prague."— Presentation transcript:
En->Cz MT system based on TR Zdeněk Žabokrtský IFAL, Charles University in Prague
Goals primary goal to build a high-quality linguistically motivated MT system using the PDT layered framework secondary goal to create a system for testing the true usefulness of various NLP tools within a real-life application
MT pyramid in terms of PDT source language w-layer target language analysis synthesis m-layer a-layer t-layer transfer ?
Building the first prototype... chosen direction: English -> Czech main design decisions: several well-defined, linguistically relevant intermediate levels modularity - decompose the task into many isolated subtasks neutral w.r.t. chosen methodology (e.g. rules vs. statistics) available resources experience (and sw tools) from PDT and PCEDT freely available NLP tools for analysis on the English side an existing module for sentence synthesis on the Czech side
MT pyramid in the prototype input textoutput text src-m-layer src-p-layersrc-a-layer src-t-layertrg-t-layer
Data representation different types of structures associated with each source sentence they should be stored simultaneously and interlinked, instead of being rewritten new data format supported by TrEd tree bundles (instead of single trees) for each sentence simplified addition of new attributes Johnfor PP VP John for John
Translation scenarios translation scenario – a chain of translation modules modules implemented as (or wrapped by) btred/ntred macros (Perl) well-defined phases, so that the modules can be easily substituted Scenario 1: Scenario 2: Scenario 3:
Input text src-m-data 1) segment the input text into sentences (Lingua::EN::Tagger from CPAN) 2) create an empty tree bundle for each sentence 3) tokenize+tag the sentences (Lingua::EN::Tagger from CPAN) 4) lemmatize each token by Schmidt tree-tagger
src-p-data src-a-data 7) mark phrase heads (Collins’s heads + minor arrangements) 8) run constituency dependency transformation 9) assign (selected) analytical functions 10) mark subject nodes 11) add a-node identifiers
src-a-data src-t-data 12) determine the t-tree topology (collapsing fw. subtrees) 13) label t-nodes with t-lemmas 14) assign coordination/apposition functors 15) mark t-nodes corresponding to finite clauses 16) assign (some of) the remaining functors 17) fill the nodetype attribute 18) detect grammatical co-reference in relative clauses 19) determine the semantic part of speech 20) fill grammateme attributes (number, tense, degree...) 21) detect the sentence modality
src-t-data trg-t-data 22) clone the source-language t-tree 23) translate t-lemmas using a simple 1:1 probabilistic lexicon 24) set the gender attribute according to the noun lemma 25) set the aspect attribute according to the verb lemma 26) apply specific conversion rules (e.g. for indefinite pronouns)
trg-t-data output sentence 27) for prepositional groups, guess the target-language surface form 28) run Jan Ptáček’s sentence generator
Translation sample A Turkish girl has died from bird flu, days after her brother and sister died from the disease. The girl, 11, who lived on a poultry farm in eastern Turkey's Van province, was being treated in hospital after her siblings became infected with bird flu. The cases are the first human deaths from bird flu outside Asia, where the virus has killed more than 70 people. The hospital in Van is treating 15 others, three of whom are in a critical condition, according to a doctor there. The latest victim, Hulya Kocyigit, died early on Friday at the hospital. Turecká ďouka zemřela z ptačí chřipky dny after, že její bratr a sestra zemřeli z nemoci. Ďouka 11, kdo žilo v drůbeží farmě ve van provincii východního Turecka, jsoucno zacházet v nemocnici, že její sourozenci slušeli nakažený s ptačí chřipkou. Případy jsou přední lidské smrti z ptačí chřipky mimo Asii, kde virus zabilo than 70 lid. Nemocnice ve Van zachází 15 zbývajících, whom three of v kritické podmínce souzvuk lékaře tam. Nejpozdnější oběť Kocyigit Hulya zemřela brzy v pátku v nemocnici.
Final remarks Indeed, we have just started (<1000 Perl LOCs, <50 development hours) and the performance is limited at this moment... However, the system works and can be tested and gradually improved. Every translation error can be traced back to its source. Any part of the system can be easily “unplugged” and substituted with a better module.