Universität des Saarlandes, Saarbrücken

Universität des Saarlandes, Saarbrücken
Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg, 16. Juni 2005

Talk Outline Parallel Corpora
The PTOLEMAIOS Project: Goals and Motivation Specification of the intended prototype system Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus Study 2 A chart-based parsing algorithm for bilingual parallel grammars based on a word alignment Conclusion The PTOLEMAIOS research agenda

Parallel Corpora Collections of texts and their translation into different languages Alignment across languages at various levels Document Section Paragraph Sentence (not necessarily one-to-one) Phrase Word Varying quality depending on origin of translation Translation is often not literal Parts of a document may occur only in one version

Examples of Parallel Corpora
Hansards of the 36th Parliament of Canada ( 1.3 million sentence pairs (19.8 million word forms English/21.2 million word forms French) The Bible Europarl corpus (European Parliament Proceedings Parallel Corpus ) ( 11 languages: Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish Up to 28 million word forms per language OPUS – an open source parallel corpus ( Includes Europarl corpus and various manuals for open source software (parts of which have been translated into more than 20 languages, incl. e.g., Chinese, Hebrew, Japanese, Korean, Russian, and Turkish)

Uses for Parallel Corpora
Building bilingual dictionaries Resource for Machine Translation research “Classical” MT: data basis for transfer rules/dictionary entries Statistical MT: training data for statistical alignment models Paragraph level Word level Resource for (multilingual) NLP applications “Annotation projection” for training monolingual tools

Query interface From Opus corpus website
(based on the corpus work bench/CQP from IMS Stuttgart)

Query interface

Training Statistical Alignments
GIZA++ Tool Implementation of “IBM models” for Statistical MT Extension of the program GIZA (part of the SMT toolkit EGYPT) developed at a summer workshop in 1999 at Johns-Hopkins University Sample word alignment (from Europarl corpus): This was actually referred to in the Minutes Dies war im Protokoll nachzulesen

“Annotation Projection”
Yarowsky/Ngai/Wicentowski 2001 Parallel corpus and word alignment given Tagger/chunker for English exists Projected annotation is used as training data for a tagger/chunker in the target language Robust learning techniques based on confidence in training data DT JJ NN IN JJ NN [ a significant producer ] for [ crude oil ] [ un producteur important ] de [ petrole brut ] DT NN JJ IN NN JJ

Quality of projected information
Evaluation results for part-of-speech tagging (from Yarowsky et al. 2001)

Train grammars on parallel corpora?
New weakly supervised learning approach for (probabilistic) syntactic grammars: Training data: Parallel corpora – collections of original texts and their translations into one or more languages Preparatory step: Identification of word correspondences with known statistical techniques (word alignment from statistical machine translation) en de fr Heute stellt sich die Lage jedoch völlig anders dar The situation now however is radically different

Train grammars on parallel corpora?
Beyond lexical information, patterns in the word correspondence relation contain rich implicit information about the grammars of the languages One should be able to exploit this implicit information about structure and meaning for grammar learning Little manual annotation effort should be required Combination of insights from linguistics and machine learning Heute stellt sich die Lage jedoch völlig anders dar The situation now however is radically different

The PTOLEMAIOS Project
Parallel Corpus-Based Grammar Induction: PTOLEMAIOS Parallel-Text-based Optimization for Language Learning – Exploiting Multilingual Alignment for the Induction Of Syntactic Grammars Funded by DFG (German Research Foundation) as an Emmy Noether research group Universität des Saarlandes (Saarbrücken), Department of Computational Linguistics Starting date: 1 April 2005 Expected duration: 4 years (1-year extension possible) Rosetta Stone

Project Goals Development of formalisms and algorithms to support grammar induction for arbitrary languages from parallel corpora To make goals tangible… Intended prototype: The PTOLEMAIOS I system for building grammars for new (sub-)languages

The PTOLEMAIOS I system
Resources required: Parallel corpus of language L and one or (ideally) more other languages No NLP tools for language L required Preparatory work required: Manual annotation of a set of seed sentence pairs (e.g., pairs) Phrasal correspondence across languages “Lean” bracketing: mark only full argument/modifier phrases (PPs, NPs) and full clauses

Training steps: (Sentence alignment on parallel corpus) Word alignment on parallel corpus Using standard techniques from Statistical Machine Translation (GIZA++ tool) Part-of-speech clustering for L Bootstrapping learning of syntactic grammars for L and the other language(s) Starting from annotated seed data Exploit large amounts of unannotated data, finding systematic patterns in phrasal correspondences Assuming implicit underlying representation (“pseudo meaning representation”) Relying on consensus across the grammars

Result: Robust probabilistic grammar for L Representation of predicate-argument and modifier relations Models predict probabilities for cross-linguistic argument/modifier links (These will be particularly useful in lexicalized models) Application: Multilingual Information Extraction, Question Answering Intermediate step for syntax-based MT

Motivation Practical Theoretical
Explore alternative to standard treebank training of grammars For “smaller” languages, it is unrealistic to do the necessary manual resource annotation Theoretical Establish parallel corpora as an empirical basis for (crosslinguistic or monolingual) linguistic studies Frequency-related phenomena (like multi-word expressions/collocations) are otherwise hard to assess empirically at the level of syntax Learnability properties as a criterion for assessing formal models for natural language

Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) exploiting partial information from a parallel corpus [Kuhn 2004 (ACL)] Underlying consideration: The distribution of word correspondences in the translation of a string contains partial information about possible phrase boundaries Heute stellt sich die Lage jedoch völlig anders dar The situation now however is radically different

Alignment-induced word blocks
Word alignments obtained with the standard model are asymmetrical [Brown et al. 1993] Each word from language L1 is mapped to a set of words from L2 (For L2-words without an overt correspondent in L1, an artificial word NULL is assumed in each sentence in L1) Definition A word alignment mapping  induces a Word block w1...wn in language L1, iff the union over (w1)...(wn) forms a continuous string in L2 [separated only by words from (NULL)] (The)={die} (situation)={Lage}  (NULL)={der} L2 Heute ist die Lage in der Türkei anders L1 NULL The situation in Turkey now is different

Word blocks and constituents
Let’s call a word block maximal if adding a word to the left or right leads to a non-word block Maximal word blocks are possible constituents – but not necessary ones Conservative formulation: Exclusion of impossible constituents (“distituents”) … whenever a word sequence crosses block boundaries … … without fully covering one of the adjacent blocks Wir brauchen bald eine Entscheidung We need a decision soon

Unsupervised Learning
Grammar induction of an X-bar grammar Using a variant of standard PCFG induction with the Inside-Outside algorithm (an Expectation Maximization algorithm) All word spans are considered as phrase candidates – except the excluded distituents Automatic generalization based on patterns in the learning data (after part-of-speech tagging) Exclusion of distituents can reduce the effect of frequent non-phrasal word sequences

Empirical Results Comparative experiment [Kuhn 2004 (ACL)]
A: Grammar induction based on English corpus data only B: Induction including partial information from parallel corpus (Europarl corpus) [Koehn 2002] Statistical word alignment, trained with GIZA++ [Al-Onaizan et al., 1999; Och/Ney, 2003] Exclusion of “distituents” based on alignment-induced word blocks Evaluation: Parsing sentences from the Penn Treebank (Wall Street Journal) with the induced grammars Comparison of the “automatical” analyses with the gold standard treebank analyses created by linguists

Empirical Results Mean of Precision und Recall
# correctly identified phrases # proposed phrases # correctly identified phrases # gold standard phrases

Study 2 Underlying consideration:
It should be possible to learn more complex syntactic generalization from parallel corpora if phrase correspondences can be observed systematically Cross-linguistic “consensus structure” as a poor man’s meaning representation (pseudo-meaning representation) Form-meaning relation is important for Optimality Theory-style discriminative learning S XP S XP XP XP XP look must we So agricultural the at policy XP XP XP Wir müssen deshalb die Agrarpolitik prüfen

Parallel Parsing Prerequisites for structural learning from parallel corpora: Grammar formalism generating sentence pairs or tuples (bitexts or multitexts) Generalization of monolingual context-free grammars, generating two or more trees simultaneously (keeping note of phrase correspondences) Algorithms for parsing and learning (Parallel Parsing/Synchronous Parsing) Problem: higher complexity than monolingual parsing Compare theoretical work on machine translation [Wu 1997, Melamed 2004] Focus for this study: Efficient parallel parsing based on a word-alignment [Kuhn 2005 (IJCAI)]

Chart Parsing (review)
In chart parsing for context-free grammars, partial analyses covering the same string under the same category are merged WP covering string position j through m Two possible internal analyses – a single chart entry Internal distinction irrelevant from external point of view WP: j-m Reading 1 XP: j-k YP: k-m WP  XP YP  WP: j-m Reading 2 ZP: j-n XP: n-m WP  ZP XP WP YP XP XP ZP i k n m

Data Structure for Parallel Parsing
The input is not a one-dimensional word string like in classical parsing, but a two-dimensional word array Representation of the word alignment policy agricultural the at look must we So agricultural the at policy look must we So Wir Wir prüfen Agrarpolitik die deshalb müssen Wir müssen müssen deshalb deshalb die die Agrarpolitik Agrarpolitik prüfen prüfen

“3-D” Parsing policy agricultural the at prüfen look Agrarpolitik must
die deshalb müssen Wir XP S prüfen “3-D” Parsing XP look must we So agricultural the at policy S policy agricultural the at prüfen look Agrarpolitik must die we deshalb müssen So Wir

Assumptions for Parallel Parsing
A particular word alignment is given (or the n best alignments according to a statistical model) Note: Both languages may contain words without a correspondent (“null words”) Complete constituents must be continuous in both languages There may however be phrase-internal “discontinuities” soon decision a need We Wir brauchen bald eine Entscheidung

Generalized Earley Parsing
Main idea: One of the languages is the “master language” Choice of master language is arbitrary in principle For efficiency: pick language with more null words Primary index for chart entries according to master language: string span (from-to) Secondary index: Bit vector for word coverage in the secondary language (cp. chart-based generation) Example: Active chart entry for partial constituent 5: soon 4: decision 3: a 2: need PI: 1-3, SI: [01001] 1: We 1:Wir 2:brauchen 3:bald 4:eine 5:Entscheidung

Primary/Secondary Index in Parsing
Combination of chart entries 5: soon 4: decision 3: a 2: need 1: We 1:Wir 2:brauchen 3:bald 4:eine 5:Entscheidung PI: 1-3, SI: [01001] PI: 3-5, SI: [00110]  PI: 1-5, SI: [01111]

Illegal as a passive chart entry
Complexity With a given word alignment, the secondary index is fully determined by the primary index unless there are any null words in the secondary language In the absence of null words the worst case parsing complexity is essentially the same as in monolingual parsing The secondary index does not add any free variables to the inference rules for parsing (Bitvector operations are very cheap – ignored here) In the average case the search space is even reduced over monolingual parsing Illegal as a passive chart entry Example: (Incorrect) combination of a chart entry with an incomplete constituent is excluded early PI: 0-1, SI: [10000] 5: soon PI: 1-3, SI: [01001] 4: decision 3: a 2: need PI: 0-3, SI: [11001] 1: We 1:Wir 2:brauchen 3:bald 4:eine 5:Entscheidung

Complexity and Null Words
Null words in the secondary language L2 create possible variations for the secondary index (with a single primary index) However the effect is fairly local thanks to the continuity assumption policy agricultural the at look must we So Wir müssen deshalb die Agrarpolitik prüfen look agricultural the at policy XP 1. PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-4, SI: [0,0,0,0,0,1,0,0] PI: 3-4, SI: [0,0,0,0,1,1,0,0] 2. PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-5, SI: [0,0,0,0,0,1,1,1] PI: 3-5, SI: [0,0,0,0,1,1,1,1] 3. PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-7, SI: [0,0,0,1,0,1,1,1]  PI: 3-7, SI: [0,0,0,1,1,1,1,1] look agricultural the at policy XP look agricultural the at policy XP 8: policy 7: agricultural 6: the 5: at 4: look 4: die 5: Agrarpolitik 6: prüfen No passive entry, i.e., can be used only locally

Complexity and Null Words
Variability due to null words increases the worst case complexity depending on the number of null words in L2 n: total number of words in L1 m: number of null words in L2 (note: typically clearly smaller than n) Complexity class for alignment-based parallel parsing (time complexity for non-lexicalized grammars): O(n3m3) For comparison – complexity for general parallel parsing without a fixed alignment: O(n6) [vgl. Wu 1997, Melamed 2004] O(n6) O(n3m3) O(n3)

Experimental Results Prototype implementation of parser Comparison of
SWI Prolog (chart implemented as a hash function) Probabilistic variant: Viterbi parser (determining the most probable reading) Scripts for generation of training data (currently based on hand-labeled data using MMAX2) [ Comparison of “Correspondence-guided synchronous parsing” (CGSP) (Simulated) monolingual parsing

Experimental Results

Alignment-based Parallel Parsing
Conclusion: Alignment-based parsing of parallel corpora on a larger scale should be realistic Use in (weakly supervised) grammar learning should be possible Sentences with a large proportion of null words on both sides could be filtered out Open questions: Effective heuristic for determining a single word alignment from a statistical alignment Tractable relaxation of continuity assumption

Talk Summary Parallel Corpora
The PTOLEMAIOS Project: Goals and Motivation Specification of the intended prototype system Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus Study 2 A chart-based parsing algorithm for bilingual parallel grammars based on a word alignment Conclusion The PTOLEMAIOS research agenda

The PTOLEMAIOS research agenda
Formal grammar model for parallel linguistic analysis Specific choice of linguistic representations/constraints Efficient parallel parsing algorithms Probability models for parallel structural representations Weakly supervised learning techniques for bootstrapping the grammars Grammar formalism Algorithmic realization Linguistic specification Probabilistic modeling Bootstrapping learning

Conclusion The planned PTOLEMAIOS architecture

Selected References Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky Statistical machine translation. Final report, JHU Workshop. Brown, P.F., S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19:263–311. Dubey, Amit, and Frank Keller Probabilistic parsing for German using sister-head dependencies. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 96–103, Sapporo. Koehn, Philipp Europarl: A multilingual corpus for evaluation of machine translation. Ms., University of Southern California. Kuhn, Jonas Experiments in parallel-text based grammar induction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona. Melamed, I. Dan Multitext grammars and synchronous parsers. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona. Och, Franz Josef and Hermann Ney A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Wu, Dekai Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–403. Yarowsky, D., G. Ngai and R. Wicentowski Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In Proceedings of HLT. 161—168.

Universität des Saarlandes, Saarbrücken

Similar presentations

Presentation on theme: "Universität des Saarlandes, Saarbrücken"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Universität des Saarlandes, Saarbrücken

Similar presentations

Presentation on theme: "Universität des Saarlandes, Saarbrücken"— Presentation transcript:

Similar presentations

About project

Feedback