Download presentation
Presentation is loading. Please wait.
Published byElizabeth Robbins Modified over 9 years ago
1
Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart INFuture2009: Digital Resources and Knowledge Sharing Nov 4 th 2009, Zagreb
2
Outline The Institute for Natural Language Processing at the University of Stuttgart Bitext parsing Using morphosyntactic correspondence
3
IfNLP Stuttgart The Institute for Natural Language Processing (IfNLP/IMS) at the University of Stuttgart Dogil (Phonetics and Speech) Large department Kuhn/Rohrer (LFG syntax and semantics) Cahill (LFG generation) Heid (Terminology extraction, morphology) Padó (Semantics, lexical semantics) Schütze (Statistical NLP and Information Retrieval) More on next slide
4
IfNLP – Statistical NLP Group Hinrich Schütze (director since 2004) Bernd Möbius – Speech recognition and synthesis Helmut Schmid - Parsing, morphology (known for TreeTagger, BitPar) Sabine Schulte im Walde – NLP and cognitive modeling of lexical semantics Michael Walsh – Speech, exemplar theoretic syntax Alex Fraser - Statistical machine translation, parsing, cross-lingual information retrieval General department areas of research New statistical NLP models and methods Semi-supervised and active learning Cognitive/linguistic representation models Applied to: NLP, retrieval, MT, speech, e-learning, …
5
IfNLP - Partnerships Partnerships Stuttgart: large projects with linguistics, computer science, EE signal processing, high performance computing Germany: Darmstadt, Tübingen, DSPIN/CLARIN consortium (UIMA- based German processing) International: large French-led European project (6 universities, 4 industrial partners), collaborations on South African languages, Edinburgh, CLARIN Industrial: various projects with publishers (many focusing on terminology)
6
Outline The Institute for Natural Language Processing at the University of Stuttgart Bitext parsing Using morphosyntactic correspondence
7
What is bitext parsing? Bitext: a text and its translation Sentences and their translations are aligned Sometimes called a parallel corpus Syntactic parsing: automatically find the syntactic structure of a sentence (syntactic parse) Bitext parsing: automatically find the syntactic structure of the parallel sentences in a bitext We will use the complementarity of the syntax of the two languages to obtain improved parses
8
Motivation for bitext parsing Many advances in syntactic parsing come from better modeling But the overall bottleneck is the size of the treebank Our research asks a different question: Where can we (cheaply) obtain additional information, which helps to supplement the treebank? A new information source for resolving ambiguity is a translation The human translator understands the sentence and disambiguates for us! Our research goal was to build large databases of improved parses to help establish preferences for difficult phenomena like PP-attachment
9
Clause attachment ambiguity Parse 1:high attachment (wrong) Parse 2: low attachment (correct)
10
Not ambiguous in German Number agreement disambiguates FRAU (woman) and HATTE (had) agree Unambiguous low attachment
11
Parse reranking of bitext Goal: improve English parsing accuracy Parse English sentence, obtain list of 100 best parse candidates Parse German sentence, obtain single best parse Determine the correspondence of German to English words using a word alignment Calculate syntactic divergence of each English parse candidate and the projection of the German parse Choose probable English parse candidate with low syntactic divergence
12
Measuring syntactic divergence P(e | g) = exp ∑ m λ m h m (g, e, a) ∑ e exp ∑ m λ m h m (g, e, a) Define features to capture different (overlapping) aspects of syntactic divergence. Functions of: Candidate English parse e German parse g Word alignment a Combine in log-linear model Discriminatively train λ parameters to maximize parsing accuracy on a training set (minimum error rate training)
13
Rich bitext projection features Defined 36 features by looking at common English parsing errors No monolingual features, except baseline parser probability General features Is there a probable label correspondence between German and the hypothesized English parse? How expected is the size of each constituent in the hypothesized English parse given the German parse? Specific features Are coordinations realized identically? Is the NP structure the same? Mix of probabilistic and heuristic features
14
Training Use BitPar syntactic forest parser English BitPar trained on Penn Treebank German BitPar trained on Tiger Treebank Probabilistic feature functions built using large parallel text (Europarl) Weights on feature functions (lambda vector) trained on portion of the Penn Treebank together with its translation into German Minimum error rate training using F score
15
Reranking English parses Difficult task German is difficult to parse Our knowledge source, the German parser, is out-of- domain (poor performance) Baseline English parser we are trying to improve is in- domain (good performance) Test set has long sentences Result: 0.70% F1 improvement on test data (stat. significant)
16
New results Reranking German parses We needed German gold standard parses (and English translations) Sebastian Pado has made a small parallel treebank for Europarl available No engineering on German yet We are using the same syntactic divergence features which were designed to improve English parsing There are German specific ambiguities which could be modeled, such as subject- object ambiguity (e.g., Die Maus jagt die Katze, “the mouse chases the cat” or “the cat chases the mouse”) But easier task because the parser we are trying to improve is weaker (German is hard to parse, Europarl is out of domain) 2.3% F1 improvement currently, we think this can be further improved
17
Summary: bitext parsing I showed you an approach for bitext parsing Reranking the parses of English to minimize syntactic divergence with an automatically generated German parse I then showed our first results for reranking German parses using a single English parse The approach we used for this kind of morphosyntactic correspondence is more general than just parse reranking Machine translation involves morphosyntactic correspondence And this is where we are interested in looking at Croatian
18
Outline The Institute for Natural Language Processing at the University of Stuttgart Bitext parsing Using morphosyntactic correspondence
19
Morphosyntactic processing I am co-PI of a new IfNLP project funded by the DFG (German Science Foundation) Project: morphosyntactic modeling for statistical machine translation (SMT) SMT research, up until recently, has been dominated by translation into English English expresses a lot of information through word order, very little through inflection Approaches to translating morphologically rich languages to English are preprocessing based
20
Present: linguistic preprocessing Linguistic preprocessing for SMT (stat. machine translation) From: freer syntax, morphologically rich language To: rigid syntax, morphologically poor language Existing examples: German to English, Czech to English
21
Present: linguistic preprocessing How this works Produce morphosyntactic analysis of German (or Czech) Reorder words in the German/Czech sentence to be in English order Reduce morphological inflection (for instance, remove case marking, remove all agreement on adjectives, etc) For Czech: insert pseudo-words (e.g. indicate PRO-drop pronouns) Use statistics on this “simplified” German or Czech to map directly to English using SMT
22
Present: linguistic preprocessing How well does this work? German to English SMT with linguistic preprocessing (Stuttgart system) Results from 2008 ACL workshop on machine translation (extensive human evaluation) Only system limited to organizer’s data competitive with: The best system of 5 rule-based MT systems Saarbrücken hybrid rule-based/SMT system Google Translate, which does not use linguistic preprocessing but does use vastly more data
23
Future: modeling What about translating from English to German or to Slavic languages? Problem: morphological generation is more difficult It is easy to reduce multiple inflections to one (for instance, stemming) Harder to learn to generate the right inflection
24
Future: modeling Current work on morphological generation Work at Charles University in Prague on Czech Tectogrammatical representation is not (yet) competitive with simple statistics (little explicit knowledge of morphology or syntax) Best English to German SMT systems also use little or no morphological knowledge And they are much worse than rule-based English to German systems Challenge: to use morphosyntactic knowledge with statistical approaches requires more than just linguistic preprocessing morphosyntactic modeling
25
Morphosyntactic correspondence In fact, all multilingual problems involve morphosyntactic correspondence: If we have a source parse tree, and source text, and we would like a target text, this is machine translation If we have a source parse tree, source text and target text, and we would like a target parse, this is bitext parsing If we would like to know which word in the target text is a translation of a particular word in the source text and we use morphosyntactic analysis, this is syntactic word alignment The same thinking can be used for cross-lingual information retrieval Very relevant when one of the languages is morphologically rich
26
Conclusion I introduced the IfNLP Stuttgart I presented a new approach to improving parsing using morphosyntactic correspondence: bitext parsing I discussed the general challenge of using morphosyntactic correspondence, focusing on statistical machine translation Biggest challenge is translating into freer word order, morphologically rich (e.g., German and particularly Slavic languages) We are interested in the challenge of building systems to translate to Croatian To do this: we need partners who are working on Croatian analysis! We also request that you think about multilingual applications when producing Croatian NLP resources The type of approach I showed for bitext parsing is useful for other multilingual applications
27
Thank you!
28
Title text
29
Statistical Approach Using statistical models Create many alternatives, called hypotheses Give a score to each hypothesis Find the hypothesis with the best score through search Disadvantages Difficulties handling structurally rich models (math and computation) Need data to train the model parameters Difficult to understand decision process made by system Advantages Avoid hard decisions Speed can be traded with quality, no all-or-nothing Works better in the presence of unexpected input Learns automatically as more data becomes available Modified from Vogel
30
Morphosyntactic knowledge We use: morphological analyzers & treebanks, which are combined in parsing models learned from treebanks English models have little morphological analysis (suffix analysis to determine POS for unknown words) German syntactic parser BitPar (Schmid) uses SMOR (Stuttgart Morphological Analyzer) Given inflected form, SMOR returns possible fine-grained POS tags E.g., for nouns/adjectives: POS, case, gender, number, definiteness BitPar puts possible analyses in the chart, and disambiguates Slavic languages require even more morphological knowledge than German
31
Transferring syntactic knowledge Need knowledge source! English syntactic parser About 90% bracketing accuracy Mapping Requires bitext Work discussed here uses German/English Europarl (European Parliament Proceedings) Resource for Croatian: Acquis Communautaire Automatically generated word alignment
32
Additional details in the paper Formalization of bitext parsing as a parse reranking task Definitions of bitext feature functions Analysis of feature functions through feature selection Comparison of MERT (minimum error rate training) with SVM- Rank
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.