En->Cz MT system based on TR Zdeněk Žabokrtský IFAL, Charles University in Prague.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Chapter 4 Syntax.
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University
Statistical NLP: Lecture 3
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Creation of a Russian-English Translation Program Karen Shiells.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
1/36 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 14, Feb 27, 2007.
GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
Good Question! Statistical Ranking for Question Generation
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
For Friday Finish chapter 24 No written homework.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Unit 8 Syntax. Syntax Syntax deals with rules for combining words into sentences, as well as with relationship between elements in one sentence Basic.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
SYNTAX.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
NATURAL LANGUAGE PROCESSING
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
ICS312 Introduction to Compilers Set 23. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Advanced Computer Systems
Beginning Syntax Linda Thomas
Statistical NLP: Lecture 3
ENGLISH MORPHOLOGY Week 1.
4.3 The Generative Approach
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
A Teaching Plan Presentation
Linguistic aspects of interlanguage
Parts of Speech II.
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

En->Cz MT system based on TR Zdeněk Žabokrtský IFAL, Charles University in Prague

Goals primary goal to build a high-quality linguistically motivated MT system using the PDT layered framework secondary goal to create a system for testing the true usefulness of various NLP tools within a real-life application

MT pyramid in terms of PDT source language w-layer target language analysis synthesis m-layer a-layer t-layer transfer ?

Building the first prototype... chosen direction: English -> Czech main design decisions: several well-defined, linguistically relevant intermediate levels modularity - decompose the task into many isolated subtasks neutral w.r.t. chosen methodology (e.g. rules vs. statistics) available resources experience (and sw tools) from PDT and PCEDT freely available NLP tools for analysis on the English side an existing module for sentence synthesis on the Czech side

MT pyramid in the prototype input textoutput text src-m-layer src-p-layersrc-a-layer src-t-layertrg-t-layer

Data representation different types of structures associated with each source sentence they should be stored simultaneously and interlinked, instead of being rewritten new data format supported by TrEd tree bundles (instead of single trees) for each sentence simplified addition of new attributes Johnfor PP VP John for John

Translation scenarios translation scenario – a chain of translation modules modules implemented as (or wrapped by) btred/ntred macros (Perl) well-defined phases, so that the modules can be easily substituted Scenario 1: Scenario 2: Scenario 3:

Input text  src-m-data 1) segment the input text into sentences (Lingua::EN::Tagger from CPAN) 2) create an empty tree bundle for each sentence 3) tokenize+tag the sentences (Lingua::EN::Tagger from CPAN) 4) lemmatize each token by Schmidt tree-tagger

src-m-data  src-p-data 5) phrase-structure parsing (Lingua::CollinsParser from CPAN) 6) add p-node identifiers

src-p-data  src-a-data 7) mark phrase heads (Collins’s heads + minor arrangements) 8) run constituency  dependency transformation 9) assign (selected) analytical functions 10) mark subject nodes 11) add a-node identifiers

src-a-data  src-t-data 12) determine the t-tree topology (collapsing fw. subtrees) 13) label t-nodes with t-lemmas 14) assign coordination/apposition functors 15) mark t-nodes corresponding to finite clauses 16) assign (some of) the remaining functors 17) fill the nodetype attribute 18) detect grammatical co-reference in relative clauses 19) determine the semantic part of speech 20) fill grammateme attributes (number, tense, degree...) 21) detect the sentence modality

src-t-data  trg-t-data 22) clone the source-language t-tree 23) translate t-lemmas using a simple 1:1 probabilistic lexicon 24) set the gender attribute according to the noun lemma 25) set the aspect attribute according to the verb lemma 26) apply specific conversion rules (e.g. for indefinite pronouns)

trg-t-data  output sentence 27) for prepositional groups, guess the target-language surface form 28) run Jan Ptáček’s sentence generator

Translation sample A Turkish girl has died from bird flu, days after her brother and sister died from the disease. The girl, 11, who lived on a poultry farm in eastern Turkey's Van province, was being treated in hospital after her siblings became infected with bird flu. The cases are the first human deaths from bird flu outside Asia, where the virus has killed more than 70 people. The hospital in Van is treating 15 others, three of whom are in a critical condition, according to a doctor there. The latest victim, Hulya Kocyigit, died early on Friday at the hospital. Turecká ďouka zemřela z ptačí chřipky dny after, že její bratr a sestra zemřeli z nemoci. Ďouka 11, kdo žilo v drůbeží farmě ve van provincii východního Turecka, jsoucno zacházet v nemocnici, že její sourozenci slušeli nakažený s ptačí chřipkou. Případy jsou přední lidské smrti z ptačí chřipky mimo Asii, kde virus zabilo than 70 lid. Nemocnice ve Van zachází 15 zbývajících, whom three of v kritické podmínce souzvuk lékaře tam. Nejpozdnější oběť Kocyigit Hulya zemřela brzy v pátku v nemocnici.

Final remarks Indeed, we have just started (<1000 Perl LOCs, <50 development hours) and the performance is limited at this moment... However, the system works and can be tested and gradually improved. Every translation error can be traced back to its source. Any part of the system can be easily “unplugged” and substituted with a better module.