Presentation is loading. Please wait.

Presentation is loading. Please wait.

NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

Similar presentations


Presentation on theme: "NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and."— Presentation transcript:

1 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007

2 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources2 “Shallow” grammar: defines language (set of strings) “Deep” Grammar: as above + maps strings to “meaning” representation: predicate-argument structure, dependencies, simple logical form …, usually involves some form of long-distance dependency (LDD) resolution Deep grammars (HPSG, LFG, CCG, TAG …) usually hand-crafted Very difficult & expensive to scale to unrestricted text Motivation for treebank-based deep grammar acquisition (LFG/CCG/HPSG/TAG/DepGr/…)!! LFG: [Kaplan and Bresnan, 82; Dalrymple, 2001; Bresnan, 2001] Constraint-based (“unification”), lexicalised c(onstituent)-str & f(unctional) structure c-str: surface configuration (CFG trees) f-str: abstract grammatical functions/relations (SUBJ, OBJ, OBL, COMP, XCOMP, ADJN, POSS, APP, …) f-str: AVM (feature-structure) encoding of dependencies/pred-arg. Lexical-Functional Grammar (LFG)

3 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources3 Lexical-Functional Grammar LFG

4 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources4 Lexical-Functional Grammar LFG Treebank: trees How do we get from trees to f-structures? What’s missing is the equations! Automatic f-structure annotation algorithm Traverses tree and assigns LFG equations Principle-based c-str/f-str interface

5 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources5 F-Structure Annotation Algorithm Algorithm exploits: –Categorial information (NP, VP, VBZ, …) –Configurational information: Local head, left/right of head Leftmost NP sister to right of V(erbal) head: (  OBJ)=  –Morphological information: Him: (  OBJ)=  –“Functional” tag information: -LGS (  PASSIVE)=+, -SBJ, -CLR, … –Trace/co-indexation information Translate traces + co-indexation to corresponding re-entrancies at f- str.

6 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources6 F-Structure Annotation Algorithm Left-Right Context Annotation Principles Coordination Annotation Principles Catch-All and Clean-Up Traces Proto F-Structures Proper F-Structures Head-Lexicalization [Magerman,1994] Lemmatization + Macros Lexical Entries Defaults – “Functional Tags”

7 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources7 Treebank Annotation: Control & Wh-Rel. LDD

8 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources8 Multilingual Treebank-Based LFG Resources English + Penn-II: parsers (+ LDD resolution), generators, subcat-frame extraction, bootstrapping of new TB-resources (QuestionBank), transfer Pilots/proof of concept: multilingual treebank-based LFG acquisition: –German: TIGER (Cahill et al 2003, 2005) –Chinese: CTB (Burke et al 2004) –Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006) GramLab Project (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German

9 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources9 Multilingual Treebank-Based LFG Resources LanguageTreebank EnglishPenn-II Chinese CTB 5.1 JapaneseKTC 4.0 GermanTIGER 2.0 German TűBa-D/Z SpanishCast3LB ArabicATB FrenchP7T SizeCoding/Data 50,000CFG+traces+FT 18,000CFG+traces+FT 38,000Dep (+traces) 50,000Graphs+CFG+Dep 22,000CFG+Dep+f-traces 3,500CFG+Dep+f-traces 300,000 (words) 20,000CFG+Dep+f-traces --------  > 200,000

10 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources10 Q2 What was missing in TB resource? –F-structures, pred-argument structure, dependencies => f-structure annotation algorithm –Limited domain in Penn-II (most treebanks …) => bootstrap grammar and QuestionBank (4000 questions from TREC and CCG) –GFs, active/passive, decl/interrog/imp, control, raising, LDDs, pro-drop, zero- anaphora, tense/aspect, … What was done by hand? –F-structure annotation algorithm (principle-based c-/f-str interface) –No restructuring, no clean-up of TB (unlike CCG/HPSG/TAG – but see P7T) –No manual additions (unlike CCG/HPSG/TAG) –Future work …

11 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources11 Q3 Methodological Issues - Quality Assurance: Evaluation against hand-crafted/corrected Gold Standard DepBanks –PARC 700 –CBS 500 –PropBank –Own Gold standard DepBanks for: English, Chinese, Japanese, German, Arabic, Spanish, French (200-500) CCG-style evaluation against automatically annotated Gold (Silver-) Standard DepBanks based on WSJ Sec. 23 trees (CCG, HPSG) Quality of annotation process and parsing resources: treebank-based LFG parsing statistically significantly outperform XLE and RASP (PARC 700 & CBS 500)

12 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources12 Q4 Phrase Structure or Dependencies? Both!!! Why?: Phrase Structure good for parsing and generation => tab into lots of mature, efficient & well understood technology (but see dependency parsing) Dependencies close to f-structure/predicate-argument structures … –Penn-II: CFG-trees + traces/co-indexation + “functional” labels/tags –TIGER: graphs + CFG-categories + grammatical function labels + LDDs through crossing edges –Cast3LB/P7T/TűBa-DZ: CFG trees + grammatical function labels + LDDs through GF paths

13 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources13 Q5 & Q6 Pros/Cons Formalism-Specific Treebank? –Formalism-Specific Treebank? Bad!  Limits usefulness/user group/… –Better to have generic TB with CFG + Dep Label + LDDs + other feature labels (as required). And then extract LFG/HPSG/CCG/TAG/Dependency Grammars Grammar First vs. Treebank First? –Depends on what you want to do … –If you want high-quality, wide-coverage resources (that can parse unrestricted text) then its definitely better to do treebanking-first (or use bootstrapping) –Problem: many traditionally trained linguists see TreeBanking as menial task –Highly qualified and interesting task: empirical linguistics: confront/rather than invent data –Sociological task: how to make treebanking/bootstrapping sexy?

14 NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources14 Some Resources ESSLLI 2006 course material: Treebank-Based Acquisition of LFG, HPSG and CCG Resources. J. van Genabith, Y. Miyao and J. Hockenmaier http://www.computing.dcu.ie/~josef/Malaga06.ppt LFG parser demo: http://lfg-demo.computing.dcu.ie/lfgparser.html A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005 A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Kluwer Academic Press, 2005 R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005 M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLING-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004 A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of ACL-04, pp. 320-7, Barcelona, Spain, 2004 Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): LFG’02, Athens, Greece, CSLI Publications, Stanford, CA., pp.76--95. 2002


Download ppt "NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and."

Similar presentations


Ads by Google