Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg,

Similar presentations


Presentation on theme: "Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg,"— Presentation transcript:

1 Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg, 16. Juni 2005

2 Talk Outline Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation Specification of the intended prototype system Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus Study 2 A chart-based parsing algorithm for bilingual parallel grammars based on a word alignment Conclusion The PTOLEMAIOS research agenda

3 Parallel Corpora Collections of texts and their translation into different languages Alignment across languages at various levels Document Section Paragraph Sentence (not necessarily one-to-one) Phrase Word Varying quality depending on origin of translation Translation is often not literal Parts of a document may occur only in one version

4 Examples of Parallel Corpora Hansards of the 36th Parliament of Canada (http://www.isi.edu/natural-language/download/hansard/) 1.3 million sentence pairs (19.8 million word forms English/21.2 million word forms French) The Bible Europarl corpus (European Parliament Proceedings Parallel Corpus ) (http://www.isi.edu/~koehn/europarl/) 11 languages: Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish Up to 28 million word forms per language OPUS – an open source parallel corpus (http://logos.uio.no/opus/) Includes Europarl corpus and various manuals for open source software (parts of which have been translated into more than 20 languages, incl. e.g., Chinese, Hebrew, Japanese, Korean, Russian, and Turkish)

5 Uses for Parallel Corpora Building bilingual dictionaries Resource for Machine Translation research Classical MT: data basis for transfer rules/dictionary entries Statistical MT: training data for statistical alignment models Paragraph level Word level Resource for (multilingual) NLP applications Annotation projection for training monolingual tools

6 Query interface From Opus corpus website (based on the corpus work bench/CQP from IMS Stuttgart)

7 Query interface

8 Training Statistical Alignments GIZA++ Tool Implementation of IBM models for Statistical MT Extension of the program GIZA (part of the SMT toolkit EGYPT) developed at a summer workshop in 1999 at Johns-Hopkins University Sample word alignment (from Europarl corpus): im actuallyMinutesthe nachzulesenProtokollwarDies intoreferredwasThis

9 Annotation Projection Yarowsky/Ngai/Wicentowski 2001 Parallel corpus and word alignment given Tagger/chunker for English exists Projected annotation is used as training data for a tagger/chunker in the target language Robust learning techniques based on confidence in training data ]oil NN crude JJ [for IN ]producersignificanta[ NNJJDT JJNNINJJNNDT ]brutpetrole[de]importantproducteurun[

10 Quality of projected information Evaluation results for part-of-speech tagging (from Yarowsky et al. 2001)

11 Train grammars on parallel corpora? New weakly supervised learning approach for (probabilistic) syntactic grammars: Training data: Parallel corpora – collections of original texts and their translations into one or more languages Preparatory step: Identification of word correspondences with known statistical techniques (word alignment from statistical machine translation) endefr darandersvölligjedochLagediesichstelltHeute isThesituationnowhoweverradicallydifferent

12 Train grammars on parallel corpora? Beyond lexical information, patterns in the word correspondence relation contain rich implicit information about the grammars of the languages One should be able to exploit this implicit information about structure and meaning for grammar learning Little manual annotation effort should be required Combination of insights from linguistics and machine learning darandersvölligjedochLagediesichstelltHeute isThesituationnowhoweverradicallydifferent

13 The PTOLEMAIOS Project Rosetta Stone Parallel Corpus-Based Grammar Induction: PTOLEMAIOS Parallel-Text-based Optimization for Language Learning – Exploiting Multilingual Alignment for the Induction Of Syntactic Grammars Funded by DFG (German Research Foundation) as an Emmy Noether research group Universität des Saarlandes (Saarbrücken), Department of Computational Linguistics Starting date: 1 April 2005 Expected duration: 4 years (1-year extension possible)

14 Project Goals Development of formalisms and algorithms to support grammar induction for arbitrary languages from parallel corpora To make goals tangible… Intended prototype: The PTOLEMAIOS I system for building grammars for new (sub-)languages

15 The PTOLEMAIOS I system Resources required: Parallel corpus of language L and one or (ideally) more other languages No NLP tools for language L required Preparatory work required: Manual annotation of a set of seed sentence pairs (e.g., pairs) Phrasal correspondence across languages Lean bracketing: mark only full argument/modifier phrases (PPs, NPs) and full clauses

16 The PTOLEMAIOS I system Training steps: (Sentence alignment on parallel corpus) Word alignment on parallel corpus Using standard techniques from Statistical Machine Translation (GIZA++ tool) Part-of-speech clustering for L Bootstrapping learning of syntactic grammars for L and the other language(s) Starting from annotated seed data Exploit large amounts of unannotated data, finding systematic patterns in phrasal correspondences Assuming implicit underlying representation (pseudo meaning representation) Relying on consensus across the grammars

17 The PTOLEMAIOS I system Result: Robust probabilistic grammar for L Representation of predicate-argument and modifier relations Models predict probabilities for cross-linguistic argument/modifier links (These will be particularly useful in lexicalized models) Application: Multilingual Information Extraction, Question Answering Intermediate step for syntax-based MT

18 Motivation Practical Explore alternative to standard treebank training of grammars For smaller languages, it is unrealistic to do the necessary manual resource annotation Theoretical Establish parallel corpora as an empirical basis for (crosslinguistic or monolingual) linguistic studies Frequency-related phenomena (like multi-word expressions/collocations) are otherwise hard to assess empirically at the level of syntax Learnability properties as a criterion for assessing formal models for natural language

19 Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) exploiting partial information from a parallel corpus [Kuhn 2004 (ACL)] Underlying consideration: The distribution of word correspondences in the translation of a string contains partial information about possible phrase boundaries darandersvölligjedochLagediesichstelltHeute isThesituationnowhoweverradicallydifferent

20 L2L2 Alignment-induced word blocks andersLagedieHeute now L1L1 Thesituationis istderinTürkei inTurkeyNULL (The)={die} (situation)={Lage} (NULL)={der} Word alignments obtained with the standard model are asymmetrical [Brown et al. 1993] Each word from language L 1 is mapped to a set of words from L 2 (For L 2 -words without an overt correspondent in L 1, an artificial word NULL is assumed in each sentence in L 1 ) Definition A word alignment mapping induces a Word block w 1...w n in language L 1, iff the union over (w 1 )... (w n ) forms a continuous string in L 2 [separated only by words from (NULL)]

21 Lets call a word block maximal if adding a word to the left or right leads to a non-word block Maximal word blocks are possible constituents – but not necessary ones Conservative formulation: Exclusion of impossible constituents (distituents) … whenever a word sequence crosses block boundaries … … without fully covering one of the adjacent blocks Word blocks and constituents einebald soonWedecision brauchenEntscheidungWir aneed

22 Unsupervised Learning Grammar induction of an X-bar grammar Using a variant of standard PCFG induction with the Inside-Outside algorithm (an Expectation Maximization algorithm) All word spans are considered as phrase candidates – except the excluded distituents Automatic generalization based on patterns in the learning data (after part-of-speech tagging) Exclusion of distituents can reduce the effect of frequent non-phrasal word sequences

23 Empirical Results Comparative experiment [Kuhn 2004 (ACL)] A: Grammar induction based on English corpus data only B: Induction including partial information from parallel corpus (Europarl corpus) [Koehn 2002] Statistical word alignment, trained with GIZA++ [Al-Onaizan et al., 1999; Och/Ney, 2003] Exclusion of distituents based on alignment-induced word blocks Evaluation: Parsing sentences from the Penn Treebank (Wall Street Journal) with the induced grammars Comparison of the automatical analyses with the gold standard treebank analyses created by linguists

24 Empirical Results # correctly identified phrases # proposed phrases # correctly identified phrases # gold standard phrases Mean of Precision und Recall

25 lookmustweSoagriculturaltheatpolicy prüfenAgrarpolitikdiedeshalbmüssenWir Study 2 Underlying consideration: It should be possible to learn more complex syntactic generalization from parallel corpora if phrase correspondences can be observed systematically Cross-linguistic consensus structure as a poor mans meaning representation (pseudo-meaning representation) Form-meaning relation is important for Optimality Theory- style discriminative learning XP S S

26 Parallel Parsing Prerequisites for structural learning from parallel corpora: Grammar formalism generating sentence pairs or tuples (bitexts or multitexts) Generalization of monolingual context-free grammars, generating two or more trees simultaneously (keeping note of phrase correspondences) Algorithms for parsing and learning (Parallel Parsing/Synchronous Parsing) Problem: higher complexity than monolingual parsing Compare theoretical work on machine translation [Wu 1997, Melamed 2004] Focus for this study: Efficient parallel parsing based on a word-alignment [Kuhn 2005 (IJCAI)]

27 Chart Parsing (review) In chart parsing for context-free grammars, partial analyses covering the same string under the same category are merged WP covering string position j through m Two possible internal analyses – a single chart entry Internal distinction irrelevant from external point of view i k n m XP YP ZP XP WP: j-m Reading 1 XP: j-k YP: k-m WP XP YP WP: j-m Reading 2 ZP: j-n XP: n-m WP ZP XP WP: j-m WP

28 prüfenAgrarpolitikdiedeshalbmüssenWirprüfenAgrarpolitikdiedeshalbmüssenWir lookmustweSoagriculturaltheatpolicy Data Structure for Parallel Parsing So we must look at the agricultural policy prüfenAgrarpolitikdiedeshalbmüssenWir The input is not a one-dimensional word string like in classical parsing, but a two-dimensional word array Representation of the word alignment

29 3-D Parsing Wir deshalb müssen die prüfen Agrarpolitik must we So XP lookmustweSoagriculturaltheatpolicy S AgrarpolitikdiedeshalbmüssenWir XP S prüfen XP policy agricultural the at look

30 Assumptions for Parallel Parsing A particular word alignment is given (or the n best alignments according to a statistical model) Note: Both languages may contain words without a correspondent (null words) Complete constituents must be continuous in both languages There may however be phrase-internal discontinuities soon decision EntscheidungeinebaldbrauchenWir We need a

31 Generalized Earley Parsing Main idea: One of the languages is the master language Choice of master language is arbitrary in principle For efficiency: pick language with more null words Primary index for chart entries according to master language: string span (from-to) Secondary index: Bit vector for word coverage in the secondary language (cp. chart-based generation) 5: soon 4: decision 5: Entscheidung 4: eine 3: bald 2: brauchen 1: Wir 1: We 2: need 3: a Example: Active chart entry for partial constituent PI: 1-3, SI: [01001]

32 Primary/Secondary Index in Parsing Combination of chart entries 5: soon 4: decision 5: Entscheidung 4: eine 3: bald 2: brauchen 1: Wir 1: We 2: need 3: a PI: 1-3, SI: [01001] PI: 3-5, SI: [00110] PI: 1-5, SI: [01111]

33 Complexity With a given word alignment, the secondary index is fully determined by the primary index unless there are any null words in the secondary language In the absence of null words the worst case parsing complexity is essentially the same as in monolingual parsing The secondary index does not add any free variables to the inference rules for parsing (Bitvector operations are very cheap – ignored here) In the average case the search space is even reduced over monolingual parsing 5: soon 4: decision 5: Entscheidung 4: eine 3: bald 2: brauchen 1: Wir 1: We 2: need 3: a Example: (Incorrect) combination of a chart entry with an incomplete constituent is excluded early PI: 0-1, SI: [10000] PI: 1-3, SI: [01001] PI: 0-3, SI: [11001] Illegal as a passive chart entry

34 Complexity and Null Words policy agricultural the at look must we So WirmüssendeshalbdieAgrarpolitikprüfen 8: policy 7 : agricultural 6: the 5: at 4: look 4: die5: Agrarpolitik6: prüfen 1.PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-4, SI: [0,0,0,0,0,1,0,0] PI: 3-4, SI: [0,0,0,0,1,1,0,0] 2.PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-5, SI: [0,0,0,0,0,1,1,1] PI: 3-5, SI: [0,0,0,0,1,1,1,1] 3.PI: 3-3, SI: [0,0,0,0,1,0,0,0] PI: 3-7, SI: [0,0,0,1,0,1,1,1] PI: 3-7, SI: [0,0,0,1,1,1,1,1] Null words in the secondary language L 2 create possible variations for the secondary index (with a single primary index) However the effect is fairly local thanks to the continuity assumption lookagriculturalthe at policy XP lookagriculturalthe at policy XP lookagriculturalthe at policy XP No passive entry, i.e., can be used only locally

35 Complexity and Null Words Variability due to null words increases the worst case complexity depending on the number of null words in L 2 n: total number of words in L 1 m: number of null words in L 2 (note: typically clearly smaller than n) Complexity class for alignment-based parallel parsing (time complexity for non-lexicalized grammars): O(n 3 m 3 ) For comparison – complexity for general parallel parsing without a fixed alignment: O(n 6 ) [vgl. Wu 1997, Melamed 2004] O(n 6 )O(n 3 m 3 ) O(n 3 )

36 Experimental Results Prototype implementation of parser SWI Prolog (chart implemented as a hash function) Probabilistic variant: Viterbi parser (determining the most probable reading) Scripts for generation of training data (currently based on hand-labeled data using MMAX2) [http://mmax.eml-research.de] Comparison of Correspondence-guided synchronous parsing (CGSP) (Simulated) monolingual parsing

37 Experimental Results

38

39 Alignment-based Parallel Parsing Conclusion: Alignment-based parsing of parallel corpora on a larger scale should be realistic Use in (weakly supervised) grammar learning should be possible Sentences with a large proportion of null words on both sides could be filtered out Open questions: Effective heuristic for determining a single word alignment from a statistical alignment Tractable relaxation of continuity assumption

40 Talk Summary Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation Specification of the intended prototype system Study 1 Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus Study 2 A chart-based parsing algorithm for bilingual parallel grammars based on a word alignment Conclusion The PTOLEMAIOS research agenda

41 Formal grammar model for parallel linguistic analysis Specific choice of linguistic representations/constraints Efficient parallel parsing algorithms Probability models for parallel structural representations Weakly supervised learning techniques for bootstrapping the grammars Grammar formalism Linguistic specification Algorithmic realization Probabilistic modeling Bootstrapping learning

42 Conclusion The planned PTOLEMAIOS architecture

43 Selected References Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky Statistical machine translation. Final report, JHU Workshop. Brown, P.F., S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19:263– 311. Dubey, Amit, and Frank Keller Probabilistic parsing for German using sister-head dependencies. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 96– 103, Sapporo. Koehn, Philipp Europarl: A multilingual corpus for evaluation of machine translation. Ms., University of Southern California. Kuhn, Jonas Experiments in parallel- text based grammar induction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona. Melamed, I. Dan Multitext grammars and synchronous parsers. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona. Och, Franz Josef and Hermann Ney A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Wu, Dekai Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377– 403. Yarowsky, D., G. Ngai and R. Wicentowski Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In Proceedings of HLT


Download ppt "Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg,"

Similar presentations


Ads by Google