Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.

Slides:



Advertisements
Similar presentations
Chapter 11 Trees Graphs III (Trees, MSTs) Reading: Epp Chp 11.5, 11.6.
Advertisements

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.
Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
David Luebke 1 8/25/2014 CS 332: Algorithms Red-Black Trees.
Exploring the Effectiveness of Lexical Ontologies for Modeling Temporal Relations with Markov Logic Eun Y. Ha, Alok Baikadi, Carlyle Licata, Bradford Mott,
Authors Sebastian Riedel and James Clarke Paper review by Anusha Buchireddygari Incremental Integer Linear Programming for Non-projective Dependency Parsing.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.
R Yun-Nung Chen 資工碩一 陳縕儂 1 /39.  Non-projective Dependency Parsing using Spanning Tree Algorithms (HLT/EMNLP 2005)  Ryan McDonald, Fernando.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University
Using Percolated Dependencies in PBSMT Ankit K. Srivastava and Andy Way Dublin City University CLUKI XII: April 24, 2009.
Minimum Spanning Trees
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Natural Language Processing Expectation Maximization.
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Monday seminar, ÚFAL April 2, 2012,
PFA Node Alignment Algorithm Consider the parse trees of a Chinese-English parallel pair of sentences.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Sentence Compression Based on ILP Decoding Method Hongling Wang, Yonglei Zhang, Guodong Zhou NLP Lab, Soochow University.
Tokenization & POS-Tagging
Supertagging CMSC Natural Language Processing January 31, 2006.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Language Identification and Part-of-Speech Tagging
Statistical Machine Translation Part II: Word Alignments and EM
Tools for Natural Language Processing Applications
David Mareček and Zdeněk Žabokrtský
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Improving Parsing Accuracy by Combining Diverse Dependency Parsers
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Statistical Machine Translation Papers from COLING 2004
A Path-based Transfer Model for Machine Translation
Presentation transcript:

Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics FEAST Saarbrücken, Germany, October 18, 2010

Motivation For many languages we dont have any manually annotated data for training statistical parsers But, for many of these languages, there exists some form of parallel corpus often with English Our goal is: make a word-alignment on this parallel corpus run a statistical dependency parser on the English side transfer the dependencies from English to our language using the alignment train the parser on the resulting trees

Outline Word alignment uni-directonal alignment symmetrization methods Algorithm for projecting dependencies using alignment projecting tags in case we dont have any tagger Training and evaluating MST parser using tagger trained on manually annotated corpus tags projected from English across our parallel corpus Ways how to filter out the noise from training data recognition of the bad trees

Word alignment GIZA++ toolkit [Och and Ney, 2003] assymetric output: For each word in one language a counterpart from the other language is found GIZA++ is run in both the directions and then it can be symmetrized English-to-X X-to-English Intersection symmetrization Grow-diag-final-and symmetrization

Word alignment - example English to German Coordination of fiscal policies indeed, can be counterproductive. Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein.

Word alignment - example German to English Coordination of fiscal policies indeed, can be counterproductive. Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein.

Word alignment - example Intersection symmetrization intersection of previous two unidirectional alignments Coordination of fiscal policies indeed, can be counterproductive. Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein.

Word alignment - example Ground-diag-final-and symmetrization links from intersection links where one or two its ends are neighbouring with some already added link Coordination of fiscal policies indeed, can be counterproductive. Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein.

Alignment links used for the projection We use only such links that appeared in unidirectional X-to-English alignment we need to find some parent for each token in the language X we dont care about not aligned english words We recognize three weights of links 1: links that appeared only in X-to-English alignment (blue) 2: links that appeared also in grow-diag-final-and symmetrization (yellow) 3: links that appeared in intersection symmetrization (red) Coordination of fiscal policies indeed, can be counterproductive. Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein.

The Algorithm for dependency projection e_root = technical root of English parse tree; f_root = technical root of foreign parse tree; build_subtree(e_root, f_root); function build_subtree(e_node, f_node); begin for e_child in e_node->get_children do begin links = all alignment links leading from e_child; if not links then build_subtree(e_child, f_node); else begin main_link = the link with the highest weight (or the first one of them); main_f_child = the node which is connected to e_child by main_link; other_f_children = nodes connected to e_child by other links; main_f_child->set_parent(f_node); main_f_child->set_tag(e_child->get_tag); for f_child in other_f_children do f_child->set_parent(main_f_child); build_subtree (e_child, main_f_child); end;

Algorithm - example Coordination of fiscal policies indeed, can be counterproductive. Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein.

Training a parser – tagging We need some tags for training the parser. If we have a tagger for our language, we can use it and tag our data. In case we dont have any tagger and any human-anotated corpus, we can make a projection of the English tags into our language we assign a special tag ?? to tokens that havent got any tag by the projection

Projecting tags Coordination of fiscal policies indeed, can be counterproductive. Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein. NN IN JJ NN RB, MD VB JJ. NN JJ NN MD RB JJ VB.?? ?? ??

Filtering the training data Why filtering? A lot of noise in the data. non-parallel sentences very free translations very strange trees Training a parser on the whole corpus would take a lot of time hours, days, weeks... We will filter out such trees that have wrong alignment (alignment sparseness) We will filter out such trees that have a lot of non-projective dependencies often caused by wrong alignment

The sentence is not good if there are not many intersection links related to the length of the sentences We filter out all the sentences with sparseness greater than some limit Smax Alignment sparseness limit S = 1 - #links (e_length + f_length) / 2

The sentence is not good if there are many non- projective edges in the projected tree We filter out all sentences that have more non-projectivities than some limit Mesured for S=0.25 N = count of non-projectivities Non-projectivity limit

Experimental setup Languages used: Czech German Parallel Corpora Project Syndicate (news-commentaries from WMT10 translation task) about 100,000 parallel sentences Treebanks dtest sets from CoNLL shared task 2006/2007 Parser Maximum spanning tree parser [McDonald, 2005] Tagger Morce tagger for English [Spoustova, 2007] Tree-tagger [Schmidt, 1994]

Results The best results were achieved with the following filtering: We filter out all the trees with at least one non-projective dependency We filter out all trees where the alignment sparseness S was gretater than LanguageTagsSentencesAccuracyComplete Czechby tagger15, %10.7 % Germanby tagger17, %14.9 % Czechprojected15, %7.14 % Germanprojected17, %11.7 %

Conclusions We proved that its possible to create a dependency parser without having a manually annotated treebank. The unlabeled accuracy is about 60%. We tested it on languages for which we have some treebank The problem of testing is in a different anotation guidelines for each treebank

Thank you for your attention