Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart.

1 Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart INFuture2009: Digital Resources and Knowledge Sharing Nov 4 th 2009, Zagreb

2 Outline  The Institute for Natural Language Processing at the University of Stuttgart  Bitext parsing  Using morphosyntactic correspondence

3 IfNLP Stuttgart  The Institute for Natural Language Processing (IfNLP/IMS) at the University of Stuttgart  Dogil (Phonetics and Speech)  Large department  Kuhn/Rohrer (LFG syntax and semantics)  Cahill (LFG generation)  Heid (Terminology extraction, morphology)  Padó (Semantics, lexical semantics)  Schütze (Statistical NLP and Information Retrieval)  More on next slide

4 IfNLP – Statistical NLP Group  Hinrich Schütze (director since 2004)  Bernd Möbius – Speech recognition and synthesis  Helmut Schmid - Parsing, morphology (known for TreeTagger, BitPar)  Sabine Schulte im Walde – NLP and cognitive modeling of lexical semantics  Michael Walsh – Speech, exemplar theoretic syntax  Alex Fraser - Statistical machine translation, parsing, cross-lingual information retrieval  General department areas of research  New statistical NLP models and methods  Semi-supervised and active learning  Cognitive/linguistic representation models  Applied to: NLP, retrieval, MT, speech, e-learning, …

5 IfNLP - Partnerships  Partnerships  Stuttgart: large projects with linguistics, computer science, EE signal processing, high performance computing  Germany: Darmstadt, Tübingen, DSPIN/CLARIN consortium (UIMA- based German processing)  International: large French-led European project (6 universities, 4 industrial partners), collaborations on South African languages, Edinburgh, CLARIN  Industrial: various projects with publishers (many focusing on terminology)

6 Outline  The Institute for Natural Language Processing at the University of Stuttgart  Bitext parsing  Using morphosyntactic correspondence

7 What is bitext parsing?  Bitext: a text and its translation  Sentences and their translations are aligned  Sometimes called a parallel corpus  Syntactic parsing: automatically find the syntactic structure of a sentence (syntactic parse)  Bitext parsing: automatically find the syntactic structure of the parallel sentences in a bitext  We will use the complementarity of the syntax of the two languages to obtain improved parses

8 Motivation for bitext parsing  Many advances in syntactic parsing come from better modeling  But the overall bottleneck is the size of the treebank  Our research asks a different question:  Where can we (cheaply) obtain additional information, which helps to supplement the treebank?  A new information source for resolving ambiguity is a translation  The human translator understands the sentence and disambiguates for us!  Our research goal was to build large databases of improved parses to help establish preferences for difficult phenomena like PP-attachment

9 Clause attachment ambiguity Parse 1:high attachment (wrong) Parse 2: low attachment (correct)

10 Not ambiguous in German  Number agreement disambiguates  FRAU (woman) and HATTE (had) agree  Unambiguous low attachment

11 Parse reranking of bitext  Goal: improve English parsing accuracy  Parse English sentence, obtain list of 100 best parse candidates  Parse German sentence, obtain single best parse  Determine the correspondence of German to English words using a word alignment  Calculate syntactic divergence of each English parse candidate and the projection of the German parse  Choose probable English parse candidate with low syntactic divergence

12 Measuring syntactic divergence P(e | g) = exp ∑ m λ m h m (g, e, a) ∑ e exp ∑ m λ m h m (g, e, a)  Define features to capture different (overlapping) aspects of syntactic divergence. Functions of:  Candidate English parse e  German parse g  Word alignment a  Combine in log-linear model  Discriminatively train λ parameters to maximize parsing accuracy on a training set (minimum error rate training)

13 Rich bitext projection features  Defined 36 features by looking at common English parsing errors  No monolingual features, except baseline parser probability  General features  Is there a probable label correspondence between German and the hypothesized English parse?  How expected is the size of each constituent in the hypothesized English parse given the German parse?  Specific features  Are coordinations realized identically?  Is the NP structure the same?  Mix of probabilistic and heuristic features

14 Training  Use BitPar syntactic forest parser  English BitPar trained on Penn Treebank  German BitPar trained on Tiger Treebank  Probabilistic feature functions built using large parallel text (Europarl)  Weights on feature functions (lambda vector) trained on portion of the Penn Treebank together with its translation into German  Minimum error rate training using F score

15 Reranking English parses  Difficult task  German is difficult to parse  Our knowledge source, the German parser, is out-of- domain (poor performance)  Baseline English parser we are trying to improve is in- domain (good performance)  Test set has long sentences  Result: 0.70% F1 improvement on test data (stat. significant)

16 New results  Reranking German parses  We needed German gold standard parses (and English translations)  Sebastian Pado has made a small parallel treebank for Europarl available  No engineering on German yet  We are using the same syntactic divergence features which were designed to improve English parsing  There are German specific ambiguities which could be modeled, such as subject- object ambiguity (e.g., Die Maus jagt die Katze, “the mouse chases the cat” or “the cat chases the mouse”)  But easier task because the parser we are trying to improve is weaker (German is hard to parse, Europarl is out of domain)  2.3% F1 improvement currently, we think this can be further improved

17 Summary: bitext parsing  I showed you an approach for bitext parsing  Reranking the parses of English to minimize syntactic divergence with an automatically generated German parse  I then showed our first results for reranking German parses using a single English parse  The approach we used for this kind of morphosyntactic correspondence is more general than just parse reranking  Machine translation involves morphosyntactic correspondence  And this is where we are interested in looking at Croatian

18 Outline  The Institute for Natural Language Processing at the University of Stuttgart  Bitext parsing  Using morphosyntactic correspondence

19 Morphosyntactic processing  I am co-PI of a new IfNLP project funded by the DFG (German Science Foundation)  Project: morphosyntactic modeling for statistical machine translation (SMT)  SMT research, up until recently, has been dominated by translation into English  English expresses a lot of information through word order, very little through inflection  Approaches to translating morphologically rich languages to English are preprocessing based

20 Present: linguistic preprocessing  Linguistic preprocessing for SMT (stat. machine translation)  From: freer syntax, morphologically rich language  To: rigid syntax, morphologically poor language  Existing examples: German to English, Czech to English

21 Present: linguistic preprocessing  How this works  Produce morphosyntactic analysis of German (or Czech)  Reorder words in the German/Czech sentence to be in English order  Reduce morphological inflection (for instance, remove case marking, remove all agreement on adjectives, etc)  For Czech: insert pseudo-words (e.g. indicate PRO-drop pronouns)  Use statistics on this “simplified” German or Czech to map directly to English using SMT

22 Present: linguistic preprocessing  How well does this work?  German to English SMT with linguistic preprocessing (Stuttgart system)  Results from 2008 ACL workshop on machine translation (extensive human evaluation)  Only system limited to organizer’s data competitive with:  The best system of 5 rule-based MT systems  Saarbrücken hybrid rule-based/SMT system  Google Translate, which does not use linguistic preprocessing but does use vastly more data

23 Future: modeling  What about translating from English to German or to Slavic languages?  Problem: morphological generation is more difficult  It is easy to reduce multiple inflections to one (for instance, stemming)  Harder to learn to generate the right inflection

24 Future: modeling  Current work on morphological generation  Work at Charles University in Prague on Czech  Tectogrammatical representation is not (yet) competitive with simple statistics (little explicit knowledge of morphology or syntax)  Best English to German SMT systems also use little or no morphological knowledge  And they are much worse than rule-based English to German systems  Challenge: to use morphosyntactic knowledge with statistical approaches requires more than just linguistic preprocessing  morphosyntactic modeling

25 Morphosyntactic correspondence  In fact, all multilingual problems involve morphosyntactic correspondence:  If we have a source parse tree, and source text, and we would like a target text, this is machine translation  If we have a source parse tree, source text and target text, and we would like a target parse, this is bitext parsing  If we would like to know which word in the target text is a translation of a particular word in the source text and we use morphosyntactic analysis, this is syntactic word alignment  The same thinking can be used for cross-lingual information retrieval  Very relevant when one of the languages is morphologically rich

26 Conclusion  I introduced the IfNLP Stuttgart  I presented a new approach to improving parsing using morphosyntactic correspondence: bitext parsing  I discussed the general challenge of using morphosyntactic correspondence, focusing on statistical machine translation  Biggest challenge is translating into freer word order, morphologically rich (e.g., German and particularly Slavic languages)  We are interested in the challenge of building systems to translate to Croatian  To do this: we need partners who are working on Croatian analysis!  We also request that you think about multilingual applications when producing Croatian NLP resources  The type of approach I showed for bitext parsing is useful for other multilingual applications

27 Thank you!

29 Statistical Approach  Using statistical models  Create many alternatives, called hypotheses  Give a score to each hypothesis  Find the hypothesis with the best score through search  Disadvantages  Difficulties handling structurally rich models (math and computation)  Need data to train the model parameters  Difficult to understand decision process made by system  Advantages  Avoid hard decisions  Speed can be traded with quality, no all-or-nothing  Works better in the presence of unexpected input  Learns automatically as more data becomes available Modified from Vogel

30 Morphosyntactic knowledge  We use: morphological analyzers & treebanks, which are combined in parsing models learned from treebanks  English models have little morphological analysis (suffix analysis to determine POS for unknown words)  German syntactic parser BitPar (Schmid) uses SMOR (Stuttgart Morphological Analyzer)  Given inflected form, SMOR returns possible fine-grained POS tags  E.g., for nouns/adjectives: POS, case, gender, number, definiteness  BitPar puts possible analyses in the chart, and disambiguates  Slavic languages require even more morphological knowledge than German

31 Transferring syntactic knowledge  Need knowledge source!  English syntactic parser  About 90% bracketing accuracy  Mapping  Requires bitext  Work discussed here uses German/English Europarl (European Parliament Proceedings)  Resource for Croatian: Acquis Communautaire  Automatically generated word alignment

32 Additional details in the paper  Formalization of bitext parsing as a parse reranking task  Definitions of bitext feature functions  Analysis of feature functions through feature selection  Comparison of MERT (minimum error rate training) with SVM- Rank

