Universität des Saarlandes, Saarbrücken

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Introduction to Machine Learning Approach Lecture 5.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
1 Alignment Entropy as an Automated Predictor of Bitext Fidelity for Statistical Machine Translation Shankar Ananthakrishnan Rohit Prasad Prem Natarajan.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Natural Language Processing Vasile Rus
Neural Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
Approaches to Machine Translation
PRESENTED BY: PEAR A BHUIYAN
Tools for Natural Language Processing Applications
Parsing in Multiple Languages
Improving a Pipeline Architecture for Shallow Discourse Parsing
Statistical NLP: Lecture 13
Statistical NLP: Lecture 9
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
Parsing Unrestricted Text
Dekai Wu Presented by David Goss-Grubbs
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Universität des Saarlandes, Saarbrücken Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg, 16. Juni 2005

Talk Outline Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation Specification of the intended prototype system Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus Study 2 A chart-based parsing algorithm for bilingual parallel grammars based on a word alignment Conclusion The PTOLEMAIOS research agenda

Parallel Corpora Collections of texts and their translation into different languages Alignment across languages at various levels Document Section Paragraph Sentence (not necessarily one-to-one) Phrase Word Varying quality depending on origin of translation Translation is often not literal Parts of a document may occur only in one version

Examples of Parallel Corpora Hansards of the 36th Parliament of Canada (http://www.isi.edu/natural-language/download/hansard/) 1.3 million sentence pairs (19.8 million word forms English/21.2 million word forms French) The Bible Europarl corpus (European Parliament Proceedings Parallel Corpus 1996-2003) (http://www.isi.edu/~koehn/europarl/) 11 languages: Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish Up to 28 million word forms per language OPUS – an open source parallel corpus (http://logos.uio.no/opus/) Includes Europarl corpus and various manuals for open source software (parts of which have been translated into more than 20 languages, incl. e.g., Chinese, Hebrew, Japanese, Korean, Russian, and Turkish)

Uses for Parallel Corpora Building bilingual dictionaries Resource for Machine Translation research “Classical” MT: data basis for transfer rules/dictionary entries Statistical MT: training data for statistical alignment models Paragraph level Word level Resource for (multilingual) NLP applications “Annotation projection” for training monolingual tools

Query interface From Opus corpus website http://logos.uio.no/cgi-bin/opus/opuscqp.pl (based on the corpus work bench/CQP from IMS Stuttgart)

Query interface

Training Statistical Alignments GIZA++ Tool http://www.fjoch.com/GIZA++.html Implementation of “IBM models” for Statistical MT Extension of the program GIZA (part of the SMT toolkit EGYPT) developed at a summer workshop in 1999 at Johns-Hopkins University Sample word alignment (from Europarl corpus): This was actually referred to in the Minutes Dies war im Protokoll nachzulesen

“Annotation Projection” Yarowsky/Ngai/Wicentowski 2001 Parallel corpus and word alignment given Tagger/chunker for English exists Projected annotation is used as training data for a tagger/chunker in the target language Robust learning techniques based on confidence in training data DT JJ NN IN JJ NN [ a significant producer ] for [ crude oil ] [ un producteur important ] de [ petrole brut ] DT NN JJ IN NN JJ

Quality of projected information Evaluation results for part-of-speech tagging (from Yarowsky et al. 2001)

Train grammars on parallel corpora? New weakly supervised learning approach for (probabilistic) syntactic grammars: Training data: Parallel corpora – collections of original texts and their translations into one or more languages Preparatory step: Identification of word correspondences with known statistical techniques (word alignment from statistical machine translation) en de fr Heute stellt sich die Lage jedoch völlig anders dar The situation now however is radically different

Train grammars on parallel corpora? Beyond lexical information, patterns in the word correspondence relation contain rich implicit information about the grammars of the languages One should be able to exploit this implicit information about structure and meaning for grammar learning Little manual annotation effort should be required Combination of insights from linguistics and machine learning Heute stellt sich die Lage jedoch völlig anders dar The situation now however is radically different

The PTOLEMAIOS Project Parallel Corpus-Based Grammar Induction: PTOLEMAIOS Parallel-Text-based Optimization for Language Learning – Exploiting Multilingual Alignment for the Induction Of Syntactic Grammars Funded by DFG (German Research Foundation) as an Emmy Noether research group Universität des Saarlandes (Saarbrücken), Department of Computational Linguistics Starting date: 1 April 2005 Expected duration: 4 years (1-year extension possible) Rosetta Stone

Project Goals Development of formalisms and algorithms to support grammar induction for arbitrary languages from parallel corpora To make goals tangible… Intended prototype: The PTOLEMAIOS I system for building grammars for new (sub-)languages

The PTOLEMAIOS I system Resources required: Parallel corpus of language L and one or (ideally) more other languages No NLP tools for language L required Preparatory work required: Manual annotation of a set of seed sentence pairs (e.g., 50-100 pairs) Phrasal correspondence across languages “Lean” bracketing: mark only full argument/modifier phrases (PPs, NPs) and full clauses

The PTOLEMAIOS I system Training steps: (Sentence alignment on parallel corpus) Word alignment on parallel corpus Using standard techniques from Statistical Machine Translation (GIZA++ tool) Part-of-speech clustering for L Bootstrapping learning of syntactic grammars for L and the other language(s) Starting from annotated seed data Exploit large amounts of unannotated data, finding systematic patterns in phrasal correspondences Assuming implicit underlying representation (“pseudo meaning representation”) Relying on consensus across the grammars

The PTOLEMAIOS I system Result: Robust probabilistic grammar for L Representation of predicate-argument and modifier relations Models predict probabilities for cross-linguistic argument/modifier links (These will be particularly useful in lexicalized models) Application: Multilingual Information Extraction, Question Answering Intermediate step for syntax-based MT

Motivation Practical Theoretical Explore alternative to standard treebank training of grammars For “smaller” languages, it is unrealistic to do the necessary manual resource annotation Theoretical Establish parallel corpora as an empirical basis for (crosslinguistic or monolingual) linguistic studies Frequency-related phenomena (like multi-word expressions/collocations) are otherwise hard to assess empirically at the level of syntax Learnability properties as a criterion for assessing formal models for natural language

Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) exploiting partial information from a parallel corpus [Kuhn 2004 (ACL)] Underlying consideration: The distribution of word correspondences in the translation of a string contains partial information about possible phrase boundaries Heute stellt sich die Lage jedoch völlig anders dar The situation now however is radically different

Alignment-induced word blocks Word alignments obtained with the standard model are asymmetrical [Brown et al. 1993] Each word from language L1 is mapped to a set of words from L2 (For L2-words without an overt correspondent in L1, an artificial word NULL is assumed in each sentence in L1) Definition A word alignment mapping  induces a Word block w1...wn in language L1, iff the union over (w1)...(wn) forms a continuous string in L2 [separated only by words from (NULL)] (The)={die} (situation)={Lage}  (NULL)={der} L2 Heute ist die Lage in der Türkei anders L1 NULL The situation in Turkey now is different

Word blocks and constituents Let’s call a word block maximal if adding a word to the left or right leads to a non-word block Maximal word blocks are possible constituents – but not necessary ones Conservative formulation: Exclusion of impossible constituents (“distituents”) … whenever a word sequence crosses block boundaries … … without fully covering one of the adjacent blocks Wir brauchen bald eine Entscheidung We need a decision soon

Unsupervised Learning Grammar induction of an X-bar grammar Using a variant of standard PCFG induction with the Inside-Outside algorithm (an Expectation Maximization algorithm) All word spans are considered as phrase candidates – except the excluded distituents Automatic generalization based on patterns in the learning data (after part-of-speech tagging) Exclusion of distituents can reduce the effect of frequent non-phrasal word sequences

Empirical Results Comparative experiment [Kuhn 2004 (ACL)] A: Grammar induction based on English corpus data only B: Induction including partial information from parallel corpus (Europarl corpus) [Koehn 2002] Statistical word alignment, trained with GIZA++ [Al-Onaizan et al., 1999; Och/Ney, 2003] Exclusion of “distituents” based on alignment-induced word blocks Evaluation: Parsing sentences from the Penn Treebank (Wall Street Journal) with the induced grammars Comparison of the “automatical” analyses with the gold standard treebank analyses created by linguists

Empirical Results Mean of Precision und Recall # correctly identified phrases # proposed phrases # correctly identified phrases # gold standard phrases

Study 2 Underlying consideration: It should be possible to learn more complex syntactic generalization from parallel corpora if phrase correspondences can be observed systematically Cross-linguistic “consensus structure” as a poor man’s meaning representation (pseudo-meaning representation) Form-meaning relation is important for Optimality Theory-style discriminative learning S XP S XP XP XP XP look must we So agricultural the at policy XP XP XP Wir müssen deshalb die Agrarpolitik prüfen

Parallel Parsing Prerequisites for structural learning from parallel corpora: Grammar formalism generating sentence pairs or tuples (bitexts or multitexts) Generalization of monolingual context-free grammars, generating two or more trees simultaneously (keeping note of phrase correspondences) Algorithms for parsing and learning (Parallel Parsing/Synchronous Parsing) Problem: higher complexity than monolingual parsing Compare theoretical work on machine translation [Wu 1997, Melamed 2004] Focus for this study: Efficient parallel parsing based on a word-alignment [Kuhn 2005 (IJCAI)]

Chart Parsing (review) In chart parsing for context-free grammars, partial analyses covering the same string under the same category are merged WP covering string position j through m Two possible internal analyses – a single chart entry Internal distinction irrelevant from external point of view WP: j-m Reading 1 XP: j-k YP: k-m WP  XP YP  WP: j-m Reading 2 ZP: j-n XP: n-m WP  ZP XP WP YP XP XP ZP i k n m

Data Structure for Parallel Parsing The input is not a one-dimensional word string like in classical parsing, but a two-dimensional word array Representation of the word alignment policy agricultural the at look must we So agricultural the at policy look must we So Wir Wir prüfen Agrarpolitik die deshalb müssen Wir müssen müssen deshalb deshalb die die Agrarpolitik Agrarpolitik prüfen prüfen

“3-D” Parsing policy agricultural the at prüfen look Agrarpolitik must die deshalb müssen Wir XP S prüfen “3-D” Parsing XP look must we So agricultural the at policy S policy agricultural the at prüfen look Agrarpolitik must die we deshalb müssen So Wir

Assumptions for Parallel Parsing A particular word alignment is given (or the n best alignments according to a statistical model) Note: Both languages may contain words without a correspondent (“null words”) Complete constituents must be continuous in both languages There may however be phrase-internal “discontinuities” soon decision a need We Wir brauchen bald eine Entscheidung

Generalized Earley Parsing Main idea: One of the languages is the “master language” Choice of master language is arbitrary in principle For efficiency: pick language with more null words Primary index for chart entries according to master language: string span (from-to) Secondary index: Bit vector for word coverage in the secondary language (cp. chart-based generation) Example: Active chart entry for partial constituent 5: soon 4: decision 3: a 2: need PI: 1-3, SI: [01001] 1: We 1:Wir 2:brauchen 3:bald 4:eine 5:Entscheidung

Primary/Secondary Index in Parsing Combination of chart entries 5: soon 4: decision 3: a 2: need 1: We 1:Wir 2:brauchen 3:bald 4:eine 5:Entscheidung PI: 1-3, SI: [01001] PI: 3-5, SI: [00110]  PI: 1-5, SI: [01111]

Illegal as a passive chart entry Complexity With a given word alignment, the secondary index is fully determined by the primary index unless there are any null words in the secondary language In the absence of null words the worst case parsing complexity is essentially the same as in monolingual parsing The secondary index does not add any free variables to the inference rules for parsing (Bitvector operations are very cheap – ignored here) In the average case the search space is even reduced over monolingual parsing Illegal as a passive chart entry Example: (Incorrect) combination of a chart entry with an incomplete constituent is excluded early PI: 0-1, SI: [10000] 5: soon PI: 1-3, SI: [01001] 4: decision 3: a 2: need PI: 0-3, SI: [11001] 1: We 1:Wir 2:brauchen 3:bald 4:eine 5:Entscheidung

Complexity and Null Words Null words in the secondary language L2 create possible variations for the secondary index (with a single primary index) However the effect is fairly local thanks to the continuity assumption policy agricultural the at look must we So Wir müssen deshalb die Agrarpolitik prüfen look agricultural the at policy XP 1. PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-4, SI: [0,0,0,0,0,1,0,0] PI: 3-4, SI: [0,0,0,0,1,1,0,0] 2. PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-5, SI: [0,0,0,0,0,1,1,1] PI: 3-5, SI: [0,0,0,0,1,1,1,1] 3. PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-7, SI: [0,0,0,1,0,1,1,1]  PI: 3-7, SI: [0,0,0,1,1,1,1,1] look agricultural the at policy XP look agricultural the at policy XP 8: policy 7: agricultural 6: the 5: at 4: look 4: die 5: Agrarpolitik 6: prüfen No passive entry, i.e., can be used only locally

Complexity and Null Words Variability due to null words increases the worst case complexity depending on the number of null words in L2 n: total number of words in L1 m: number of null words in L2 (note: typically clearly smaller than n) Complexity class for alignment-based parallel parsing (time complexity for non-lexicalized grammars): O(n3m3) For comparison – complexity for general parallel parsing without a fixed alignment: O(n6) [vgl. Wu 1997, Melamed 2004] O(n6) O(n3m3) O(n3)

Experimental Results Prototype implementation of parser Comparison of SWI Prolog (chart implemented as a hash function) Probabilistic variant: Viterbi parser (determining the most probable reading) Scripts for generation of training data (currently based on hand-labeled data using MMAX2) [http://mmax.eml-research.de] Comparison of “Correspondence-guided synchronous parsing” (CGSP) (Simulated) monolingual parsing

Experimental Results

Experimental Results

Alignment-based Parallel Parsing Conclusion: Alignment-based parsing of parallel corpora on a larger scale should be realistic Use in (weakly supervised) grammar learning should be possible Sentences with a large proportion of null words on both sides could be filtered out Open questions: Effective heuristic for determining a single word alignment from a statistical alignment Tractable relaxation of continuity assumption

Talk Summary Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation Specification of the intended prototype system Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus Study 2 A chart-based parsing algorithm for bilingual parallel grammars based on a word alignment Conclusion The PTOLEMAIOS research agenda

The PTOLEMAIOS research agenda Formal grammar model for parallel linguistic analysis Specific choice of linguistic representations/constraints Efficient parallel parsing algorithms Probability models for parallel structural representations Weakly supervised learning techniques for bootstrapping the grammars Grammar formalism Algorithmic realization Linguistic specification Probabilistic modeling Bootstrapping learning

Conclusion The planned PTOLEMAIOS architecture

Selected References Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky. 1999. Statistical machine translation. Final report, JHU Workshop. Brown, P.F., S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19:263–311. Dubey, Amit, and Frank Keller. 2003. Probabilistic parsing for German using sister-head dependencies. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 96–103, Sapporo. Koehn, Philipp. 2002. Europarl: A multilingual corpus for evaluation of machine translation. Ms., University of Southern California. Kuhn, Jonas. 2004. Experiments in parallel-text based grammar induction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona. Melamed, I. Dan. 2004. Multitext grammars and synchronous parsers. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona. Och, Franz Josef and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Wu, Dekai. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–403. Yarowsky, D., G. Ngai and R. Wicentowski. 2001. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In Proceedings of HLT. 161—168.