Presentation is loading. Please wait.

Presentation is loading. Please wait.

ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

Similar presentations


Presentation on theme: "ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef."— Presentation transcript:

1 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef van Genabith, Dublin City University Yusuke Miyao, University of Tokyo Julia Hockenmaier, University of Pennsylvania and University of Edinburgh ESSLLI 2006 18 th European Summer School for Language, Logic and Information, University of Malaga, July – August 2006

2 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources2 Josef van Genabith, National Centre for Language Technology NCLT, School of Computing, Dublin City University, Dublin 9, Ireland, josef@computing.dcu.iejosef@computing.dcu.ie Julia Hockenmaier, juliahr@cis.upenn.edujuliahr@cis.upenn.edu Yusuke Miyao, Department of Computer Science, The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113- 0033, JAPAN, yusuke@is.s.u-tokyo.ac.jpyusuke@is.s.u-tokyo.ac.jp Lecturer Contact Information

3 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources3 Motivation What do grammars do? –Grammars define languages as sets of strings –Grammars define what strings are grammatical and what strings are not –Grammars tell us about the syntactic structure of (associated with) strings “Shallow” vs. “Deep” grammars Shallow grammars do all of the above Deep grammars (in addition) relate text to information/meaning representation Information: predicate-argument-adjunct structure, deep dependency relations, logical forms, … In natural languages, linguistic material is not always interpreted locally where you encounter it: long-distance dependencies (LDDs) Resolution of LDDs crucial to construct accurate and complete information/meaning representations. Deep grammars := (text meaning) + (LDD resolution)

4 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources4 Motivation Unification (Constraint-Based) Grammar Formalisms (FU, GPSG, PATR-II, …) –Lexical-Functional Grammar (LFG) –Head-Driven Phrase Structure Grammar (HPSG) –Combinatory Categorial Grammar (CCG) –Tree-Adjoining Grammar (TAG) Traditionally, deep constraint-based grammars are hand-crafted LFG ParGram, HPSG LingoErg, Core Language Engine CLE, Alvey Tools, RASP, ALPINO, … Wide-coverage, deep unification (constraint-based) grammar development is knowledge extensive and expensive! Very hard to scale hand-crafted grammars to unrestricted text! English XLE (Riezler et al. 2002); German XLE (Forst and Rohrer 2006); Japanese XLE (Masuichi and Okuma 2003); RASP (Carroll and Briscoe 2002); ALPINO (Bouma, van Noord and Malouf, 2000)

5 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources5 Motivation Instance of “knowledge acquisition bottleneck” familiar from classical “rationalist” rule/knowledge-based AI/NLP Alternative to classical “rationalist” rule/knowledge-based AI/NLP “Empiricist” research paradigm (AI/NLP): –Corpora, treebanks, …, machine-learning-based and statistical approaches, … –Treebank-based grammar acquisition, probabilistic parsing –Advantage: grammars can be induced (learned) automatically –Very low development cost, wide-coverage, robust, but … Most treebank-based grammar induction/parsing technology produces “shallow” grammars Shallow grammars don’t resolve LDDs (but see (Johnson 2002); …), do not map strings to information/meaning representations …

6 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources6 Motivation Poses a research question: Can we address the knowledge acquisition bottleneck for deep grammar development by combining insights from rationalist and empiricist research paradigms? Specifically: Can we automatically acquire wide-coverage “deep”, probabilistic, constraint-based grammars from treebanks? How do we use them in parsing? Can we use them for generation? Can we acquire resources for different languages and treebank encodings? How do these resources compare with hand-crafted resources? …

7 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources7 Course Overview Monday: Tuesday: Wednesday: Thursday: Friday: Motivation, Course Overview, Introductions to TAG, LFG, CCG, HPSG and Penn-II TreeBank, TAG Resources Penn-II-Based Acquisition of LFG Resources Penn-II-Based Acquisition of CCG Resources Penn-II-Based Acquisition of HPSG Resources Multilingual Resources, Formal Semantics, Comparing LFG, CCG, HPSG and TAG-Based Approaches, Demos, Current and Future Work, Discussion

8 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources8 Course Overview Tuesday/Wednesday/Thursday Penn-II-Based Acquisition of XXG Resources: Treebank Preprocessing/Clean-Up Treebank Annotation/Conversion Grammar and Lexicon Extraction Parsing (Architectures, Probability Models, Evaluation) Generation (Architectures, Probability Models, Evaluation) Other (Sematics, Domain Variation, …)

9 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources9 Grammar Formalisms

10 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources10 Grammar formalisms and linguistic theories Linguistics aims to explain natural language: –What is universal grammar? –What are language-specific constraints? Formalisms are mathematical theories: –They provide a language in which linguistic theories can be expressed (like calculus for physics) –They define elementary objects (trees, strings, feature structures) and recursive operations which generate complex objects from simple objects. –They do impose linguistic constraints (e.g. on the kinds of dependencies they can capture)

11 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources11 Lexicalised Grammar Formalisms: TAG, CCG, LFG and HPSG

12 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources12 Lexicalised formalisms (TAG, CCG, LFG and HPSG) The lexicon: –pairs words with elementary objects –specifies all language-specific information (number and location of arguments, control and binding theory) The grammatical operations: –are universal –define (and impose constraints on) recursion

13 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources13 TAG, CCG, LFG and HPSG They describe different kinds of linguistic objects: –TAG is a theory of trees –CCG is a theory of (syntactic and semantic) types –LFG is a multi-level theory based on a projection architecture relating different types of linguistic objects (trees, AVMs, linear logic–based semantics) –HPSG uses single, uniform formalism (typed feature structures) to describe phonological, morphological, syntactic and semantic representations (signs) They differ in details: –treatment of wh-movement, coordination, etc.

14 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources14 TAG, CCG, LFG and HPSG TAG and CCG are weakly equivalent. Both are mildly context-sensitive: –can capture Dutch crossing dependencies –but are still efficiently parseable (in polynomial time) LFG context-sensitive

15 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources15 Tree-Adjoining Grammar (TAG) Tree-Adjoining Grammar

16 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources16 (Lexicalized) Tree-Adjoining Grammar TAG is a tree-rewriting formalism: –TAG defines operations (substitution and adjunction) on trees. –The elementary objects in TAG are trees (not strings) TAG is lexicalized: –Each elementary tree is anchored to a lexical item (word) –“Extended domain of locality”: The elementary tree contains all arguments of the anchor. –TAG requires a linguistic theory which specifies the shape of these elementary trees. TAG is mildly context-sensitive: –can capture Dutch crossing dependencies –but is still efficiently parseable AK Joshi and Y Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A. Salomaa, Eds., Handbook of Formal Languages

17 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources17 TAG substitution (arguments) Substitute XY XX YY  X  Y    Derivation tree: Derived tree:

18 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources18 ADJOIN TAG adjunction (modifiers) X X* X X Auxiliary tree Foot node     Derived tree: Derivation tree:

19 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources19 A small TAG lexicon S NPVP VBZ NP eats  NP John  VP RB VP* always  NP tapas 

20 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources20 A TAG derivation S NPVP VBZ NP eats NP John NP tapas VP RB VP* always NP      

21 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources21 A TAG derivation S NPVP VBZ NP eats tapas VP RB VP* always John VP  

22 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources22 A TAG derivation S NP VBZ VP NP eatstapas VP RB VP* always John

23 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources23 Combinatory Categorial Grammar (CCG) Combinatory Categorial Grammar

24 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources24 Combinatory Categorial Grammar CCG is a lexicalized grammar formalism (the “rules” of the grammar are completely general, all language-specific information is given in the lexicon) CCG is nearly context-free (can capture Dutch crossing dependencies, but is still efficiently parseable) CCG has a flexible constituent structure CCG has a simple, unified treatment of extraction and coordination CCG has a transparent syntax-semantics interface (every syntactic category and operation has a semantic counterpart) CCG rules are monotonic (movement or traces don’t exist) CCG rules are type-driven, not structure-driven (this means e.g. that intransitive verbs and VPs are indistinguishable)

25 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources25 Categories: specify subcat lists of words/constituents. Combinatory rules: specify how constituents can combine. The lexicon: specifies which categories a word can have. Derivations: spell out process of combining constituents. CCG: the machinery

26 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources26 CCG categories Simple categories: NP, S, PP Complex categories: functions which return a result when combined with an argument: VP or intransitive verb:S\NP Transitive verb: (S\NP)/NP Adverb:(S\NP)\(S\NP) PPs:((S\NP)\(S\NP))/NP (NP\NP)/NP Every category has a semantic interpretation

27 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources27 Function application Combines a function with its argument to yield a result: (S\NP)/NP NP -> S\NP eats tapas eats tapas NP S\NP-> S John eats tapasJohn eats tapas Used in all variants of categorial grammar

28 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources28 A (C)CG derivation

29 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources29 Type-raising and function composition Type-raising: turns an argument into a function. Corresponds to case: NP -> S/(S\NP) (nominative) NP -> (S\NP)/((S\NP)/NP) (accusative) Function composition: composes two functions (complex categories) (S\NP)/PP PP/NP -> (S\NP)/NP S/(S\NP) (S\NP)/NP -> S/NP

30 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources30 Type-raising and Composition Wh-movement: Right-node raising:

31 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources31 Another CCG derivation We will only be concerned with canonical “normal-form” derivations, which only use function composition and type-raising when syntactically necessary.

32 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources32 CCG: semantics Every syntactic category and rule has a semantic counterpart:

33 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources33 The CCG lexicon Pairs words with their syntactic categories (and semantic interpretation): eats (S\NP)/NP x y.eats’xy S\NP x.eats’x The main bottleneck for wide-coverage CCG parsing

34 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources34 Why use CCG for statistical parsing? CCG derivations are binary trees: we can use standard chart parsing techniques. CCG derivations represent long-range dependencies and complement-adjunct distinctions directly:

35 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources35 A comparison with Penn Treebank parsers Standard Treebank parsers do not recover the null elements and function tags that are necessary for semantic interpretation:

36 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources36 Lexical-Functional Grammar (LFG) Lexical-Functional Grammar

37 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources37 Lexical-Functional Grammar LFG Lexical-Functional Grammar (LFG) (Bresnan & Kaplan 1981, Bresnan 2001, Dalrymple 2001) is a unification- (or constraint-) based theory of grammar. Two (basic) levels of representation: C-structure: represents surface grammatical configurations such as word order, annotated CFG data structures F-structure: represents abstract syntactic functions such as SUBJ(ject), OBJ(ect), OBL(ique), PRED(icate), COMP(lement), ADJ(unct) …, AVM attribute-value matrices/structures F-structure approximates to basic predicate-argument structure, dependency representation, logical form (van Genabith and Crouch, 1996; 1997)

38 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources38 Lexical-Functional Grammar LFG

39 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources39 Lexical-Functional Grammar LFG

40 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources40 Lexical-Functional Grammar LFG

41 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources41 LFG Grammar Rules and Lexical Entries

42 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources42 LFG Parse Tree (with Equations/Constraints)

43 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources43 LFG Constraint Resolution (1/3)

44 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources44 LFG Constraint Resolution (2/3)

45 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources45 LFG Constraint Resolution (3/3)

46 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources46 LFG Subcategorisation & Long Distance Dependencies Subcategorisation: –Semantic forms (subcat frames): sign –Completeness: all GFs in semantic form present at local f-structure –Coherence: only the GFs in semantic form present at local f- structure Long Distance Dependencies (LDDs): resolved at f-structure with Functional Uncertainty Equations (regular expressions specifying paths in f-structure).

47 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources47 LFG LDDs: Complement Relative Clause

48 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources48 LFG LDDs: Complement Relative Clause

49 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources49 LFG LDDs: Complement Relative Clause

50 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources50 Head-Driven Phrase Structure Grammar (HPSG) Head-Driven Phrase Structure Grammar

51 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources51 Head-Driven Phrase Structure Grammar HPSG HPSG (Pollard and Sag 1994, Sag et al. 2003) is a unification-/constraint-based theory of grammar HPSG is a lexicalized grammar formalism HPSG aims to explain generic regularities that underlie phrase structures, lexicons, and semantics, as well as language-specific/-independent constraints Syntactic/semantic constraints are uniformly denoted by signs, which are represented with feature structures Two components of HPSG –Lexical entries represent word-specific constraints (corresponding to elementary objects) –Principles express generic grammatical regularities (corresponding to grammatical operations)

52 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources52 Sign Sign is a formal representation of combinations of phonological forms, syntactic and semantic constraints sign PHON string SYNSEM LOCAL NONLOCAL CAT CONT content HEAD VAL valence SPR list SUBJ list COMPS list head MOD synsem synsem local category nonlocal QUE list REL list SLASH list phonological form syntactic/semantic constraints local constraints syntactic category syntactic head modifying constraints subcategorization frames semantic representations non-local dependencies DTRS dtrs daughter structures

53 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources53 Lexical entries Lexical entries express word-specific constraints PHON “loves” HEAD verb SUBJ COMPS We use simplified notations in this lecture

54 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources54 Principles Principles describe generic regularities of grammar –Not corresponding to construction rules Head Feature Principle –The value of HEAD must be percolated from the head daughter Valence Principle –Subcats not consumed are percolated to the mother Immediate Dominance (ID) Principle –A mother and her immediate daughters must satisfy one of ID schemas Many other principles: percolation of NONLOCAL features, semantics construction, etc. HEAD 1 1 head daughter

55 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources55 ID schemas ID schemas correspond to construction rules in CFGs and other grammar formalisms –For subject-head constructions (ex. “John runs” ) –For head-complement constructions (ex. “loves Mary” ) –For filler-head constructions (ex. “what he bought” ) COMPS 12 1 SUBJ <> 1 SUBJ 1 COMPS 2 SLASH 121 2

56 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources56 Example: HPSG parsing Lexical entries determine syntactic/semantic constraints of words HEAD noun SUBJ <> COMPS <> John Mary HEAD verb SUBJ COMPS HEAD noun SUBJ <> COMPS <> saw Lexical entries

57 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources57 Example: HPSG parsing Principles determine generic constraints of grammar HEAD noun SUBJ <> COMPS <> John Mary HEAD verb SUBJ COMPS HEAD noun SUBJ <> COMPS <> saw HEAD SUBJ COMPS 2 34 1 3 1 2 4 Unification

58 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources58 Example: HPSG parsing Principle application produces phrasal signs HEAD noun SUBJ <> COMPS <> John Mary HEAD verb SUBJ COMPS HEAD noun SUBJ <> COMPS <> saw HEAD verb SUBJ COMPS <>

59 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources59 Example: HPSG parsing Recursive applications of principles produce syntactic/semantic structures of sentences HEAD noun SUBJ <> COMPS <> John Mary HEAD verb SUBJ COMPS HEAD noun SUBJ <> COMPS <> saw HEAD verb SUBJ COMPS <> HEAD verb SUBJ <> COMPS <>

60 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources60 Example: LDDs NONLOCAL features (SLASH, REL, etc.) explain long-distance dependencies –WH movements –Topicalization –Relative clauses etc...

61 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources61 Brief Intro to Penn Treebank

62 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources62 The Penn Treebank The first large syntactically annotated corpus Contains text from different domains: –Wall Street Journal (50,000 sentences, 1 Million words) –Switchboard –Brown corpus –ATIS The annotation: –POS-tagged (Ratnaparkhi’s MXPOST) –Manually annotated with phrase-structure trees –Traces and other null elements used to represent non-local dependencies (movement, PRO, etc.) –Designed to facilitate extraction of predicate-argument structure

63 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources63 A Treebank tree Relatively flat structures: –There is no noun level –VP arguments and adjuncts appear at the same level Co-indexed null elements indicate long-range dependencies Function tags indicate complement-adjunct distinction (?)

64 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources64 Penn-II Treebank

65 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources65 Penn-II Treebank Until Congress acts, the government hasn't any authority to issue new debt obligations of any kind, the Treasury said.

66 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources66 Penn-II Treebank

67 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources67 Penn-II Treebank

68 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources68 Penn-II Treebank

69 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources69 Penn-II Treebank

70 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources70 Penn-II Treebank (Simple Transitive Verb)

71 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources71 Penn-II Treebank (Simple Coordination)

72 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources72 Penn-II Treebank (Passive)

73 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources73 Penn-II Treebank (Subject WH-Relative Clause)

74 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources74 Penn-II Treebank (WH-Less Complement Relative Cl.)

75 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources75 Penn-II Treebank (Control and WH-Compl. Rel. Cl.)

76 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources76 Penn-II Treebank (Adv. Relative Clause)

77 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources77 Penn-II Treebank (Coord. and Right Node Raising)

78 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources78 The Parseval measure Standard evaluation metric for Treebank parsers. Two components: –Precision: how many of the proposed NTs are correct? –Recall: how many of the correct NTs are proposed? Measures recovery of nonterminals (span + syntactic category) Ignores function tags and null elements  Has biased research towards parsers that produce linguistically shallow output (Collins, Charniak)

79 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources79 Treebank-Based Acquisition of TAG resources

80 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources80 Extracting a TAG from the Treebank Two different approaches: –F. Xia. Automatic Grammar Generation From Two Different Perspectives. PhD thesis, University of Pennsylvania, 2001. –J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining Grammars from Treebanks, Natural Language Engineering (forthcoming) This lecture: just the basic ideas!

81 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources81 Extracting a TAG from the Penn Treebank Input: a Treebank tree (= the TAG derived tree) Output: a set of elementary trees (= the TAG lexicon)

82 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources82 Extracting a TAG: the head -Identify the head path (requires a head percolation table) S VPVP VBG making VPVP - Find the arguments of the head (requires an argument table) - Ignore modifiers (requires an adjunct table) - Merge unary productions (VP -> VP) NP-SBJ NP

83 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources83 Extracting a TAG: the head This is the elementary tree for the head:

84 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources84 Extracting a TAG: arguments Arguments are combined via substitution Recurse on the arguments:

85 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources85 Extracting a TAG: adjuncts Adjuncts require auxiliary trees (use adjunction to be combined with the head) Auxiliary trees require a foot node (with the same label as the root) is VBZ VP ADVP-MNR officially NP DT the

86 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources86 Extracting a TAG: adjuncts Adjuncts require auxiliary trees (use adjunction to be combined with the head) Auxiliary trees require a foot node (with the same label as the root)

87 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources87 Special cases Coordination Null elements (e.g. traces for wh-movement): The trace has to be part of the elementary tree of the main verb Punctuation marks

88 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources88 Wh-movement: relative clauses (NP (NP a charge)) (SBAR (WHNP-2 (-NONE- 0)) (S (NP-SBJ Mr. Coleman)) (VP (VBZ denies) (NP (-NONE- *T*-2))))))) NP SBAR NP S VP VBZ WHNP -NONE- *T*-2 0 denies

89 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources89 Evaluating an extracted grammar/lexicon Grammar/lexicon size? –Depends on head table, argument/adjunct distinction, treatment of null elements, mapping of Treebank labels/POS tags to categories in extracted grammar etc. –For TAGs, between 3,000-8,500 elementary tree types, and 100,000-130,000 lexical entries. Lexical coverage? –For TAGs, around 92-93% Distribution of tree types? Convergence? Quality? –Inspection, comparison with manual grammar

90 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources90 References: TAG extraction TAG: A.K. Joshi and Y. Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A. Salomaa, Eds., Handbook of Formal Languages TAG extraction: F. Xia. Automatic Grammar Generation From Two Different Perspectives. PhD thesis, University of Pennsylvania, 2001. J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining Grammars from Treebanks, Natural Language Engineering (forthcoming) Also: L. Shen and A.K. Joshi, Building an LTAG Treebank, Technical Report MS-CIS-05-15, CIS Department, University of Pennsylvania, 2005 Parsing with extracted TAGs: D. Chiang. Statistical parsing with an automatically extracted tree adjoining grammar. In Data Oriented Parsing, CSLI Publications, pages 299–316. L. Shen and A.K. Joshi. Incremental LTAG parsing, HLT/EMNLP 2005

91 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources91 Penn-II-Based Acquisition of LFG Resources Lexical-Functional Grammar

92 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources92 Penn-II-Based Acquisition of LFG Resources Introduction Treebank Preprocessing/Clean-Up Treebank Annotation/Conversion Grammar and Lexicon Extraction Parsing (Architectures, Probability Models, Evaluation) Generation (Architectures, Probability Models, Evaluation) Other (Semantics, Domain Variation, … )

93 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources93 Introduction: Penn-II & LFG If we had f-structure annotated version of Penn-II, we could use (standard) machine learning methods to extract probabilistic, wide- coverage LFG resources How do we get f-structure annotated Penn-II? Manually? No: 50,000 trees … ! Automatically! Yes: F-Structure annotation algorithm … ! Penn-II is a 2 nd generation treebank – contains lots of annotations to support derivation of deep meaning representations: trees, Penn-II “ functional ” tags, traces & coindexation – f-structure annotation algorithm can exploit those.

94 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources94 Introduction: Penn-II & LFG What is the task? Given a Penn-II tree, the f-structure annotation algorithm has to traverse the tree and associate all tree nodes with f-structure equations (including lexical equations at the leaves of the tree). A simple example

95 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources95 Introduction: Penn-II & LFG S NP-SBJVP NNNNS Factorypayrolls VBDPP-TMP fell INNP NNPin ↑=↓↑=↓ ↑ subj= ↓ ↑=↓↑=↓ ↑=↓↑=↓ ↓  ↑ adjunct ↑=↓↑=↓ ↑=↓↑=↓ ↑ obj= ↓ ↑=↓↑=↓ September Factory payrolls fell in September.

96 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources96 Introduction: Penn-II & LFG subj : pred : payroll num : pl pers : 3 adjunct : 2 : pred : factory num : sg pers : 3 adjunct : 1 : pred : in obj : pred : september num : sg pers : 3 pred : fall tense : past

97 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources97 Treebank Preprocessing/Clean-Up: Penn-II & LFG Penn-II treebank: often flat analyses (coordination, NPs …), a certain amount of noise: inconsistent annotations, errors … No treebank preprocessing or clean-up in the LFG approach Take Penn-II treebank as is, but Remove all trees with FRAG or X labelled constituents Frag = fragments, X = not known how to annotate Total of 48,424 trees as they are.

98 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources98 Treebank Annotation: Penn-II & LFG Annotation-based (rather than conversion-based) Automatic annotation of nodes in Penn-II treebank tress with f- structure equations F-structure Annotation Algorithm Annotation Algorithm exploits: –Head information –Categorial information –Configurational information –Penn-II functional tags –Trace information

99 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources99 Treebank Annotation: Penn-II & LFG Architecture of a modular algorithm to assign LFG f-structure equations to trees in the Penn-II treebank: Left-Right Context Annotation Principles Coordination Annotation Principles Catch-All and Clean-Up Traces Proto F-Structures Proper F-Structures Head-Lexicalisation [Magerman,1994]

100 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources100 Treebank Annotation: Penn-II & LFG Head Lexicalisation: modified rules based on (Magerman, 1994)

101 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources101 Treebank Annotation: Penn-II & LFG Left-Right Context Annotation Principles: Head of NP likely to be rightmost noun … Mother → Left Context Head Right Context Left Context Right Context Head

102 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources102 Treebank Annotation: Penn-II & LFG Left ContextHeadRight Context DT: ↑ spec:det= ↓ QP: ↑ spec:quant= ↓ JJ, ADJP: ↓  ↑ adjunct NN, NNS: ↑ = ↓ NP: ↓  ↑ app PP: ↓  ↑ adjunct S, SBAR: ↓  ↑ relmod NP DT RB ADJP very politicized NN JJdeala NP ↑ spec:det= ↓ DT RB ↓  ↑ adjunct ADJP very politicized ↑ = ↓ NN JJdeala → NP: Left-Right Annotation Matrix

103 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources103 Treebank Annotation: Penn-II & LFG

104 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources104 Treebank Annotation: Penn-II & LFG Do annotation matrix for each of the monadic categories (without –Fun tags) in Penn-II Based on analysing the most frequent rule types for each category such that  sum total of token frequencies of these rule types is greater than 85% of total number of rule tokens for that category 100% 85% 100% 85%  NP 6595 102 VP 10239 307  S 2602 20 ADVP 234 6 Apply annotation matrix to all (i.e. also unseen) rules/sub-trees, i.e. also those NP-LOC, NP-TMP etc.

105 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources105 Treebank Annotation: Penn-II & LFG Co-ordination Annotation Principles Often flat Penn-II analysis of coordination: Co-ordinated Element Object Modifier

106 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources106 Treebank Annotation: Penn-II & LFG Unlike constituents coordination: Co-ordinated Element

107 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources107 Treebank Annotation: Penn-II & LFG Traces Module: Long Distance Dependencies Topicalisation Wh- and wh-less questions Relative clauses Passivisation Control constructions ICH (interpret constituent here) RNR (right node raising) … Translate Penn-II traces and coindexation into corresponding reentrancy in f-structure

108 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources108 Treebank Annotation: WH-Relative Clauses

109 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources109 Treebank Annotation: Wh-Less Relative Clauses

110 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources110 Treebank Annotation: Control & Wh-Rel. LDD

111 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources111 Treebank Annotation: Adv. Relative Clause

112 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources112 Treebank Annotation: Right Node Raising

113 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources113 Treebank Annotation: Right Node Raising

114 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources114 Treebank Annotation: Penn-II & LFG Catch-All and Clean-Up Module: Penn-II Functional Tags are used to identify potential errors –e.g. Nodes with the tag -SBJ should be annotated as the subject … Correction of Overgeneralisations –e.g. Change a second OBJ annotations to OBJ2 … –e.g. Change arguments of head nouns erroneously annotated as relative clauses to COMP arguments: … signs [that managers expect declines]_RELCL … … signs [that managers expect declines]_COMP … Unannotated Nodes –Defaults …

115 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources115 Treebank Annotation: Penn-II & LFG Left-Right Context Annotation Principles Coordination Annotation Principles Catch-All and Clean-Up Traces Proto F-Structures Proper F-Structures Head-Lexicalisation [Magerman,1995] Constraint Solver

116 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources116 Treebank Annotation: Penn-II & LFG Collect f-structure equations Send to constraint solver Generates f-structures F-structure annotation algorithm implemented in Java, constraint solver in Prolog ~3 min annotating approx. 50,000 Penn-II trees ~5 min producing approx. 50,000 f-structures

117 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources117 Treebank Annotation: Penn-II & LFG

118 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources118 Treebank Annotation: Penn-II & LFG

119 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources119 Evaluation (Quantitative): Burke (2006) Coverage: Over 99.8% of Penn-II sentences (without X and FRAG constituents) receive a single covering and connected f-structure: 0 F-structures 45 0.093% 1 F-structure4832999.804% 2 F-structures 50 0.103% Treebank Annotation: Penn-II & LFG

120 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources120 Evaluation (Qualitative): Burke (2006) F-structure quality evaluation against DCU 105, a manually annotated dependency gold standard of 105 sentences randomly extracted from WSJ section 23. Triples are extracted from the gold standard and the automatically produced f-structures using the evaluation software from (Crouch et al. 2002) and (Riezler et al. 2002) relation(predicate~0, argument~1) Results calculated in terms of Precision and Recall Treebank Annotation: Penn-II & LFG

121 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources121 Treebank Annotation: Penn-II & LFG Precision and Recall for DCU 105 Dependency Bank results are calculated for All Annotations and for Preds-Only DCU 105All AnnotationsPreds-Only Precision 97.06% 94.28% Recall 96.80% 94.28%

122 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources122 Treebank Annotation: Penn-II & LFG DCU 105 FeaturePrecisionRecallF-Score adjunct 892/968 = 92 892/950 = 94 93 app 16/16 = 100 16/19 = 84 91 comp 88/92 = 96 88/102 = 86 91 coord 153/184 = 83 153/167 = 92 87 obj 442/459 = 96 442/461 = 96 96 obl 50/52 = 96 50/61 = 82 88 oblag 12/12 = 100 12/12 = 100 100 passive 76/79 = 96 76/80 = 95 96 poss 74/79 = 94 74/81 = 91 92 quant 40/64 = 62 40/52 = 77 69 relmod 46/48 = 96 46/50 = 92 94 subj 396/412 = 96 396/414 = 96 96 topic 13/13 = 100 13/13 = 100 100 topicrel 46/49 = 94 46/52 = 88 91 xcomp 145/153 = 95 145/146 = 99 97

123 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources123 Treebank Annotation: Penn-II & LFG Following (Kaplan et al. 2004) Precision and Recall for PARC 700 Dependency Bank calculated for: all annotations  PARC features  preds-only Mapping required (Burke 2006) PARC 700PARC features Precision 88.31% Recall 86.38%

124 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources124 Grammar and Lexicon Extraction : Penn-II & LFG Lexical Resources: Lexical information extremely important in modern lexicalised grammar formalisms LFG, HPSG, CCG, TAG, … Lexicon development is time consuming and extremely expensive Rarely if ever complete Familiar knowledge acquisition bottleneck … Subcategorisation frame induction (LFG semantic forms) from f- Structure annotated version of Penn-II and -III Evaluation against COMLEX

125 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources125 Grammar and Lexicon Extraction: Penn-II & LFG Lexicon Construction –Manual vs. Automated Our Approach: – F-Structure Annotation of Penn-II and Penn-III – Frames not Predefined – Functional and Categorial Information – Parameterised for Prepositions and Particles – Active and Passive – Long Distance Dependencies – Conditional Probabilities

126 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources126 Grammar and Lexicon Extraction: Penn-II & LFG Extraction Methodology –Automatic F-Structure Annotation of Penn-II & III –Lexical Extraction Algorithm –Examples Evaluation –Gold Standards (COMLEX, OALD) –Experimental Architecture –Results

127 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources127 Grammar and Lexicon Extraction: Penn-II & LFG sign

128 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources128 Grammar and Lexicon Extraction: Penn-II & LFG Semantic Forms: PRED Governable Grammatical Functions (Arguments) –SUBJ, OBJ, OBJ θ, OBL, OBL θ, COMP, XCOMP, PART… Non-Governable Grammatical Functions (Adjuncts) –ADJ, XADJ, APP, RELMOD, …

129 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources129 Grammar and Lexicon Extraction: Penn-II & LFG Penn-II Treebank Automatic F-Structure Annotation Algorithm LFG F-Structures Extraction Algorithm Semantic Forms

130 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources130 Grammar and Lexicon Extraction: Penn-II & LFG Extraction Algorithm: For each f-structure F For each level of embedding in F Determine the local predicate PRED Collect all subcategorisable grammatical functions GF 1, …, GF n Return: PRED

131 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources131 Grammar and Lexicon Extraction: Penn-II & LFG subj : spec : det : pred : the pred : inquiry num : sg pers : 3 adjunct : 1 : pred : soon pred : focus tense : past obl : pform : on obj : spec : det : pred : the pred : judge num : sg pers : 3 “The inquiry soon focused on the judge” (wsj_0267_72) Prepositions and OBLs: focus([subj,obl:on]) on([obj])

132 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources132 Grammar and Lexicon Extraction: Penn-II & LFG topic : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3 … pred : have tense : pres subj : spec : det : pred : the pers : 3 pred : treasury num : sing comp : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3 … pred : have tense : pres pred : say tense : past LDDs: say([subj,comp]) “Until Congress acts, the government hasn't any authority to issue new debt obligations of any kind, the Treasury said.” (wsj_0008_2)

133 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources133 Grammar and Lexicon Extraction: Penn-II & LFG subj : pred : pro pron_form : it passive : + to_inf : + pred : be xcomp : subj : pred : pro pron_form : it passive : + pred : consider tense : past obl : pform : as obj : spec : det : pred : a ……… pred : risk num : sg pers : 3 Passive: consider([subj,obl:as],p) “… to be considered as an additional risk for the investor…”(wsj_0018_14)

134 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources134 Grammar and Lexicon Extraction: Penn-II & LFG subj : spec : det : pred : the cat : dt pred : inquiry num : sg pers : 3 cat : nn adjunct : 1 : pred : soon cat : rb pred : focus tense : past cat : vbd obl : pform : on obj : spec : det : pred : the cat : dt pred : judge num : sg pers : 3 cat : nn CFG categories: focus(v,[subj,obl:on]) focus(v,[subj(n),obl:on]) “The inquiry soon focused on the judge.” (wsj_0267_72)

135 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources135 Grammar and Lexicon Extraction: Penn-II & LFG Lexicon extracted from Penn-II (O’Donovan et al 2005):

136 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources136 Grammar and Lexicon Extraction: Penn-II & LFG Evaluation for all active verbs (2992) extracted from Penn-II against COMLEX Largest evaluation for English subcat frame extraction system Carroll and Rooth (1998) – 200 verbs Schulte im Walde (2000) – over 3000 German verbs (VERB:ORTH “reimburse”:SUBC((NP-NP) (NP-PP :PVAL (“for”)) (NP))) (vp-framenp-np:cs((np 2)(np 3)) :gs(:subject 1 :obj 2 :obj2 3) :ex“she asked him his name”)

137 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources137 Grammar and Lexicon Extraction: Penn-II & LFG Following Schulte im Walde (2000): Experiment 1: Exclude prepositional phrases entirely (e.g. [subj,obl:on] is [subj]) Experiment 2: Include prepositional phrase but not specific preposition (e.g. [subj,obl]). –2a (+ Part value) Experiment 3: Include details of specific preposition (e.g. [subj,obl:on]) –3a (+ Part value) Relative Thresholds of 1% and 5%

138 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources138 Grammar and Lexicon Extraction: Penn-II & LFG

139 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources139 Grammar and Lexicon Extraction: Penn-II & LFG Directional Prepositions (about, across, along, around, behind, below, beneath, between, beyond, by, down, from…) included in COMLEX by “default” for verbs that have at least one p-dir … (VERB :ORTH "cycle" :SUBC ((PP :PVAL ("p-dir")))

140 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources140 Grammar and Lexicon Extraction: Penn-II & LFG Penn-III = Penn-II + the parsed section of the Brown Corpus –About 300,000 of a total of 1 Million Words Brown Corpus –Balanced Corpus (8 genres) e.g. Humour, Science Fiction etc. Subcategorisation variation across domains More data, more verbs -CLR tag (closely related)

141 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources141 Grammar and Lexicon Extraction: Penn-II & LFG

142 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources142 Grammar and Lexicon Extraction: Penn-II & LFG

143 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources143 Grammar and Lexicon Extraction: Penn-II & LFG

144 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources144 Grammar and Lexicon Extraction: Penn-II & LFG Applications: Porting to other languages –German (TIGER) –Spanish (CAST3LB ) –Chinese (CTB-I and II) LDD resolution in parsing new text (Cahill et al., 2004)

145 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources145 Grammar and Lexicon Extraction: Penn-II & LFG Parsing-Based Subcat Frame Extraction (O’Donovan 2006): Treebank-based vs. parsing-based subcat frame extraction We parsed British National Corpus BNC (100 million words) with our automatically induced LFGs 19 days on single machine: ~5 million words per day Subcat frame extraction for ~10,000 verb lemmas Evaluation against COMLEX and OALD Evaluation against Korhonen (2002) gold standard Our method is statistically significantly better

146 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources146 Parsing: Penn-II and LFG Overview Parsing Architectures: Pipeline & Integrated Long-Distance Dependency Resolution at F-Structure Evaluation

147 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources147 Parsing: Penn-II and LFG

148 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources148 Parsing: Penn-II and LFG PCFG consists of CFG rules with associated probabilities A-PCFG treats strings consisting of CFG categories followed by 1 or more functional annotation(s) as monadic categories (e.g. NP[up- obj=down] ) Probabilistic parsing technology (PCFGs, History-Based and Lexicalised Parsers) produces trees without LDDs Exceptions: (Collins 1999): wh-relclauses; (Johnson 2002) post- processing; … In our (standard) architecture new text is parsed into proto f-structures. LDD resolution at f-structure

149 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources149 Parsing: Penn-II and LFG Penn-II tree with traces and co-indexation for LDDs “U.N. signs treaty, the paper said” S S-1 NPVP NP VP DTNN VBDS NNP VBZNP -NONE- NN *T*-1 U.N.signs treaty the papersaid

150 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources150 Parsing: Penn-II and LFG Trace and coindexaction in tree translated into reentrancy at f-structure by annotation algorithm: “U.N. signs treaty, the headline said”

151 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources151 Parsing: Penn-II and LFG Parse tree from PCFG and History-Based Parsers without traces: “U.N. signs treaty, the paper said” S S NPVP NP VP DTNN VBD NNP VBZNP NNU.N.signs treaty the papersaid

152 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources152 Parsing: Penn-II and LFG Basic, but possibly incomplete, predicate-argument structures (proto-f- structures): “U.N. signs treaty, the headline said”

153 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources153 Parsing: Penn-II and LFG Require: –subcategorisation frames (O’Donovan et al., 2004, 2005; O’Donovan 2006) –functional uncertainty equations Previous Example: –say([subj,comp]) –  topic =  comp*comp (search along a path of 0 or more comps)

154 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources154 Parsing: Penn-II and LFG Subcat Frames: Automatically acquired from automatically f-structure-annotated Penn-II Treebank following (O’Donovan et al. 2004, 2005; O’Donovan 2006) Distinction between active and passive frames Associated with probabilities O’Donovan et al. evaluate against COMLEX resource Extracted from sections 02-21 10960 active lemma-frame types (semantic forms/subcat frames), 2241 passive types

155 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources155 Parsing: Penn-II and LFG Functional Uncertainty equations: Automatically acquire finite approximations of FU-equations Extract paths between co-indexed material in automatically generated f- structures from sections 02-21 from Penn-II 26 TOPIC, 60 TOPICREL, 13 FOCUS path types 99.69% coverage of paths in section 23 Each path type associated with a probability

156 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources156 Parsing: Penn-II and LFG Sample TOPICREL paths with frequencies: up-subj7894 up-obj1167 up-xcomp 956 up-xcomp:obj 793 up-xcomp:xcomp161 up-xcomp:xcomp:obj135 up-comp:subj119 up-xcomp:subj 92 Sample TOPIC paths with probabilities: up-topic=up-comp0.940 up-topic=up-xcomp:comp0.006 up-topic=up-comp:comp0.001

157 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources157 Parsing: Penn-II and LFG LDD Resolution Algorithm: recursively traverse an f-structure and –find TOPIC:T attribute-value pair –retrieve TOPIC paths –for each path p of the form GF 1 :…: GF n :GF, traverse the f-structure along the TOPIC path GF 1 :…: GF n to local sub f-structure g at g retrieve local PRED:P add GF:T to g iff –GF is not present at g –g together with GF is locally complete and coherent with respect to a semantic form s for P –multiply path and semantic form probabilities involved to rank resolution

158 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources158 Parsing: Penn-II and LFG

159 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources159 Subcategorisation Frames say([subj])0.06 say([comp,subj])0.87 say([subj,xcomp])0.02... Subcategorisation Frames say([subj])0.06 Subcategorisation Frames say([subj])0.06 say([comp,subj])0.87 topic : pred : sign subj : pred : U.N. obj : pred : treaty pred : say subj : spec : the pred : paper Parsing: Penn-II and LFG comp : pred : sign subj : pred : U.N. obj : pred : treaty FU-path approximations up-topic=up-comp0.940 up-topic=up-xcomp:comp0.006 up-topic=up-comp:comp0.001... topic pred : say 0.940 0.87 FU-path approximations up-topic=up-comp

160 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources160 Parsing: Penn-II and LFG How do treebank-based constraint grammars compare to deep hand- crafted grammars like XLE and RASP? XLE (Riezler et al. 2002, Kaplan et al. 2004) –hand-crafted, wide-coverage, deep, state-of-the-art English LFG and XLE parsing system with log-linear-based probability models for disambiguation –PARC 700 Dependency Bank gold standard (King et al. 2003), Penn-II Section 23-based RASP (Carroll and Briscoe 2002) –hand-crafted, wide-coverage, deep, state-of-the-art English probabilistic unification grammar and parsing system (RASP Rapid Accurate Statistical Parsing) –CBS 500 Dependency Bank gold standard (Carroll, Briscoe and Sanfillippo 1999), Susanne-based

161 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources161 Parsing: Penn-II and LFG

162 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources162 Choose best treebank-based LFG system to compare with XLE/RASP: C-structure engines (state-of-the-art history based, lexicalised parsers): –(Collins 1999) –(Charniak 2000) –(Bikel 2002) (Bikel 2002) retrained to retain Penn-II functional tags (-SBJ, -SBJ, -LOC, -TMP, -CLR, etc.) Pipeline architecture: tagged text  Bikel retrained + f-structure annotation algorithm + LDD resolution  f-structures  automatic conversion  evaluation against XLE/RASP gold standards PARC- 700/CBS-500 dependency banks Parsing: Penn-II and LFG

163 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources163 Systematic differences between our f-structures and PARC 700 and CBS 500 dependency representations Automatic conversion of our f-structures to PARC 700 / CBS 500 -like structures (Burke et al. 2004, Burke 2006, Cahill et al. under review) Best XLE and RASP resources with better results than those reported in literature to date (Crouch et al. 2002) and (Carroll and Briscoe 2002) evaluation software (Noreen 1989) Approximate Randomisation Test to test for statistical significance of results Parsing: Penn-II and LFG

164 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources164 Parsing: Penn-II and LFG Result dependency f-scores: PARC 700 XLE vs. BKR-LFG: –80.55% XLE –83.08% BKR-LFG (+2.53%) CBS 500 RASP vs. BKR-LFG: –76.57% RASP –80.23% BKR-LFG (+3.66%) Results statistically significant at  95% level (Noreen 1989) Approximate Randomisation Test BKR-LFG = treebank-induced Lexical-Functional Grammar resources with Bickel retrained (BKR) as c-structure engine in pipeline architecture

165 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources165 Parsing: Penn-II and LFG PARC 700 Evaluation:

166 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources166 Parsing: Penn-II and LFG

167 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources167 Parsing: Penn-II and LFG

168 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources168 Parsing: Penn-II and LFG

169 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources169 Parsing: Penn-II and LFG

170 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources170 Parsing: Penn-II and LFG

171 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources171 Probability Models: Penn-II & LFG Probability Models: Our approach does not constitute proper probability model (Abney, 1996) Why? Probability model leaks: Highest ranking parse tree may feature f-structure equations that cannot be resolved into f-structure Probability associated with that parse tree is lost Doesn’t happen often in practise (coverage >99.5% on unseen data) Research on appropriate discriminative, log-linear or maximum entropy models is important (Miyao and Tsujii, 2002) (Riezler et al. 2002)

172 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources172 Generation: Penn-II & LFG Cahill and van Genabith, 2006

173 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources173 Generation: Penn-II & LFG

174 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources174 Generation: Penn-II & LFG

175 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources175 Generation: Penn-II & LFG

176 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources176 Generation: Penn-II & LFG

177 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources177 Generation: Penn-II & LFG

178 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources178 Generation: Penn-II & LFG

179 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources179 Generation: Penn-II & LFG

180 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources180 Generation: Penn-II & LFG

181 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources181 Generation: the Good, the Bad and the Ugly Orig: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process, and to preserve the safety and fitness of the industry. Gen: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process, and to preserve the safety and fitness of the industry. Orig: The upshot of the downshoot is that the A 's go into San Francisco 's Candlestick Park tonight up two games to none in the best-of-seven fest. Gen: The upshot of the downshoot is that the A 's tonight go into San Francisco 's Candlestick Park up two games to none in the best-of-seven fest. Orig: By this time, it was 4:30 a.m. in New York, and Mr. Smith fielded a call from a New York customer wanting an opinion on the British stock market, which had been having troubles of its own even before Friday 's New York market break. Gen: Mr. Smith fielded a call from New a customer York wanting an opinion on the market British stock which had been having troubles of its own even before Friday 's New York market break by this time and in New York, it was 4:30 a.m.. Orig: Only half the usual lunchtime crowd gathered at the tony Corney & Barrow wine bar on Old Broad Street nearby. Gen: At wine tony Corney & Barrow the bar on Old Broad Street nearby gathered usual, lunchtime only half the crowd,.

182 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources182 Domain Variation, Multilingual LFG Resources, etc. Domain variation: ATIS (Judge et al 2005) and QuestionBank (Judge et al 2006) F-Str -> (Q)LF Quasi-Logical Forms (Cahill et al. 2003) Multilingual treebank-based LFG acquisition: –German: TIGER treebank (Cahill et al 2003), (Cahill et al 2005) –Chinese: Chinese Penn Treebank (Burke et al 2004) –Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006) GramLab Project at DCU (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German

183 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources183 Demo System http://lfg-demo.computing.dcu.ie/lfgparser.html

184 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources184 Publications A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG- Approximations, COLING/ACL 2006, Sydney, Australia J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia G. Chrupala and J. van Genabith, Using Machine-Learning to Assign Function Labels to Parser Output for Spanish, COLING/ACL 2006, Sydney, Australia M. Burke, Automatic Treebank Annotation for the Acquisition of LFG Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005 R. O’Donovan, Automatic Extraction of Large-Scale Multilingual Lexical Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005 R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005 A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Special Issue on "Shared Representations in Multilingual Grammar Engineering", (eds.) E. Bender, D. Flickinger, F. Fouvry and M. Siegel, Kluwer Academic Press, 2005

185 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources185 Publications R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005 J. Judge, M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. Strong Domain Variation and Treebank-Induced LFG Resources; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway,2005 M. Burke, A. Cahill, J. van Genabith, and A. Way. Evaluating Automatically Acquired F-Structures against PropBank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005 M. Burke, A. Cahill, M. McCarthy, R.O'Donovan, J. van Genabith and A. Way. Evaluating Automatic F- Structure Annotation for the Penn-II Treebank; Journal of Language and Computation; Special Issue on "Treebanks and Linguistic Theories", (eds.) E. Hinrichs and K.Simov, Kluwer Academic Press. 2005. pages 523-547 A. Cahill. Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations. Ph.D. Thesis. School of Computing, Dublin City University, Dublin 9, Ireland. 2004 M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank- Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLIC-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004

186 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources186 Publications M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank, In Proceedings of the Ninth International Conference on LFG, Christchurch, New Zealand, pages 101-121, 2004 A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 320- 327, Barcelona, Spain, 2004 R. O'Donovan, M. Burke, A. Cahill, J. van Genabith, and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 368-375, Barcelona, Spain, 2004 M. Burke, Cahill A., R. O' Donovan, J. van Genabith and A. Way. Treebank-Based Acquisition of Wide- Coverage, Probabilistic LFG Resources: Project Overview, Results and Evaluation, The First International Joint Conference on Natural Language Processing (IJCNLP-04), Workshop "Beyond shallow analyses - Formalisms and statistical modeling for deep analyses"; March 22-24, 2004 Sanya City, Hainan Island, China, 2004 Cahill A., M. Forst, M. McCarthy, R. O' Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Multilingual Unification-Grammar Development. In the Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development, at the 15th European Summer School in Logic Language and Information, Vienna, Austria, 18th - 29th August 2003

187 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources187 Publications Cahill A, M. McCarthy, J. van Genabith and A. Way. Quasi-Logical Forms for the Penn Treebank; In (eds.) Harry Bunt, Ielka van der Sluis and Roser Morante; Proceedings of the Fifth International Workshop on Computational Semantics, IWCS-05, January 15-17, 2003, Tilburg, The Netherlands, ISBN: 90-74029- 24-8, pp.55-71, 2003 Cahill A, M. McCarthy, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the Penn-II Treebank. TLT 2002, Treebanks and Linguistic Theories 2002, 20th and 21st September 2002, Sozopol, Bulgaria, (eds.) E. Hinrichs and K. Simov, Proceedings of the First Workshop on Treebanks and Linguistic Theories (TLT 2002), pp. 42-60, 2002 Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): Proceedings of the Seventh International Conference on LFG CSLI Publications, Stanford, CA., pp.76--95. 2002 Cahill A, and J. van Genabith. TTS - A Treebank Tool; in LREC 2002, The Third International Conference on Language Resources and Evaluation, Las Palmas de Grand Canaria, Spain, May 27th--June 2nd, 2002, Proceedings of the Conference, Volume V, (eds.) M.G.Rodriguez and C.P. Suarez Arnajo, ISBN 2- 9517408-0-8, pp. 1712-1717, 2002 Cahill A, M. McCarthy, J. van Genabith and A. Way. Automatic Annotation of the Penn-Treebank with LFG F- Structure Information; LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation - Bootstrapping Annotated Language Data, LREC 2002, Third International Conference on Language Resources and Evaluation, post-conference workshop, June 1st, 2002, proceedings of the workshop, (eds.) A. Lenci, S. Montemagni and V. Pirelli, ELRA - European Language Resources Association, Paris France, pp. 8-15, 2002

188 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources188 Penn-II-Based Acquisition of CCG Resources Combinatory Categorial Grammar

189 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources189 This lecture Recap: CCG Translating the Penn Treebank to CCG –The translation algorithm –CCGbank: the acquired grammar and lexicon Wide-coverage parsing with CCG

190 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources190 Categories: specify subcat lists of words/constituents. Combinatory rules: specify how constituents can combine. The lexicon: specifies which categories a word can have. Derivations: spell out process of combining constituents. CCG: the machinery

191 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources191 CCG categories Simple categories: NP, S, PP Complex categories: functions which return a result when combined with an argument: VP or intransitive verb:S\NP Transitive verb: (S\NP)/NP Adverb:(S\NP)\(S\NP) PPs:((S\NP)\(S\NP))/NP (NP\NP)/NP

192 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources192 The combinatory rules Function application: x.f(x) a  f(a) X/YY  X(>) YX\Y  X(<) Function composition: x.f(x) y.g(y)  x.f(g(x)) X/YY/Z  X/Z (>B) Y\ZX\Y  X/Z (<B) X/YY\Z  X\Z(>Bx) Y/ZX\Y  X/Z(<Bx) Type-raising: a  f.f(a) X  T/(T\X) (>T) X  T\(T/X) (<T)

193 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources193 CCG derivations Canonical “normal-form” derivations (mostly function application): Alternative derivations:

194 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources194 Type-raising and Composition Wh-movement: Right-node raising:

195 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources195 CCG: semantics Every syntactic category and rule has a semantic counterpart:

196 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources196 From the Penn Treebank to CCG The basic translation algorithm Dealing with null elements Type-changing rules in the grammar Preprocessing CCGbank: The extracted lexicon/grammar

197 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources197 Input: Penn Treebank tree Flat phrase-structure tree Traces/null elements and indices represent underlying dependencies Function tags

198 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources198 Output: CCG derivation Binary derivation tree with explicit “deep” dependency structures and subcategorization information. No null elements

199 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources199 I. Identify heads, arguments, adjuncts

200 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources200 II. Binarise the tree

201 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources201 III. Assign CCG categories

202 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources202 Morphosyntactic Features Features on verbal categories: declarative, infinitival, past participle, present participle, passive Sentential features: wh-questions, yes-no questions, embedded questions, embedded declaratives, fragments, etc. CCGbank has no case or number distinction!

203 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources203 III. Assign CCG categories: adjuncts

204 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources204 III. Assign CCG categories: arguments

205 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources205 IV. Assign predicate-argument structure We approximate predicate-argument structure by word-word dependencies These are defined by the argument slots of functor catgeories: just (S\NP)/(S\NP) opened opened (S[dcl]\NP)/NP doors

206 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources206 IV. Assign predicate-argument structure Non-local dependencies arise through: –Binding and control: “He may want you to listen” –Extraction: “the tapas that he told us she ate” Both are mediated by lexical categories: –Control verbs, auxiliaries/modals –Relative pronouns We represent this via coindexation: (NP\NP i )/(S[dcl]/NP i ) In CCGbank: added automatically to certain category types

207 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources207 Lexical categories that mediate dependencies Auxiliaries/modals, raising verbs: will, might, seem (S[dcl]\NP i )/(S[b]\NP i ) Control verbs: persuade you to go ((S[dcl]\NP)/(S[to]\NP i ))/NP i Relative pronouns: which, who, that (NP\NP i )/(S[dcl]/NP i ) Many more (listed in CCGbank manual)

208 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources208 Summary: The basic algorithm 1.Identify heads, complements and adjuncts. 2.Binarize the tree. 3.Assign CCG categories. 4.Add co-indexation to lexical categories. 5.Create predicate-argument structure.

209 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources209 Problems with basic algorithm Depends on Treebank markup: –Complement/adjunct distinction –The analyses don’t always correspond to CCG analysis –Errors in Treebank annotation Proliferation of categories:

210 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources210 The need for preprocessing Eliminating (some of) the noise: –POS-tagging errors –Bracketing errors (coordination!) Changing the Treebank analyses: –Small clauses Adding more structure: –Insert a noun level into NPs –Analyze QPs, fragments, parentheticals, multiword- expressions

211 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources211 Compacting the grammar: Type-changing rules Type-changing rules for adjuncts capture syntactic regularities:

212 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources212 Null elements, traces, and coindexation *-null elements: passive, PRO *T*-traces: wh-movement, tough movement *RNR*-traces: right-node raising Other null elements: –*EXP*: expletive, –*ICH* (“insert constituent here”): extraposition –*U* (units): $ 500 *U* –*PPA* (permanent predictable ambiguity) =-coindexation: argument cluster coordination and gapping

213 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources213 Used for passive or PRO (arbitrary or controlled): Only the passive * matters for translation: (S with null subject = VP = S\NP) * null elements

214 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources214 Unbounded long-range dependencies … arising through extraction (*T*): –Wh-movement (relative clauses and wh-questions): the articles that (you believed he saw that…) I filed –Tough-movement: Peter is easy to please –Parasitic gaps: the articles that I filed without reading … arising through coordination (*RNR* and =): – Right-node raising: [[Mary ordered] and [John ate]] the tapas. – Argument cluster coordination: Mary ordered [[tapas for herself] and [wine for John]]. – Sentential gapping: [[Mary ordered tapas] and [John beer]].

215 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources215 Dealing with extraction Penn Treebank: *T* traces indicate extraction

216 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources216 Dealing with extraction Pass the extracted NP up to relative clause. The relative pronoun subcategorizes for an ‘incomplete’ sentence: (NP\NP)/(S[dcl]\NP) for subject relatives (NP\NP)/(S[dcl]/NP) for object relatives The derivation uses type-raising and composition

217 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources217 Right node raising in the Penn Treebank

218 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources218 Right node raising in CCGbank

219 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources219 Argument-cluster coordination “Template gapping” annotation: Co-indexation between constituents in conjuncts The first conjunct contains the head

220 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources220 Argument-cluster coordination in CCGbank The shared constituents are coordinated (via type-raising and composition): X  T\(T/NP) (<T) NP  (S\NP)\((S\NP)/NP) (<T)

221 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources221 Sentential Gapping In the Treebank: CCG uses decomposition to obtain the types (interpretation is given extragrammatically)

222 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources222 Remaining problems: NP level Lists and appositives are indistinguishable: Compound nouns have no internal structure:

223 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources223 Remaining problems: other constructions Complement-adjunct distinction:

224 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources224 Putting it all together…. Funds that are or soon will be listed in New York or London

225 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources225 The CCG derivation

226 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources226 that: (NP i \NP i )/(S[dcl]\NP i ) funds are,will The relative clause:

227 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources227 The right-node-raising VP

228 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources228 CCGbank Coverage of the translation algorithm: 99.44% of all sentences in the Treebank (main problem: sentential gapping) The lexicon (sec.02-21): –74,669 entries for 44,210 word types –1286 lexical category types (439 appear once, 556 appear 5 times or more) The grammar (sec. 02-21): –3262 rule instantiations (1146 appear once)

229 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources229 The most ambiguous words

230 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources230 Frequency distribution of categories

231 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources231 Lexical coverage How well does our lexicon cover unseen data? “Training” data: sections 02-21 Test data: section 00 The lexicon contains the correct entries for 94.0% of the tokens in section 00. 3.8% of the tokens in section 00 do not appear in sections 02-21. 35% of the unknown tokens are N 29% of the unknown tokens are N/N

232 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources232 Statistical Parsing with CCG The data: CCGbank The algorithms: standard CKY chart parsing (and a supertagger) The models: –Generative: Hockenmaier and Steedman (2002) –Conditional: Clark and Curran (2004)

233 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources233 Parsing algorithms for CCG CCG derivations are binary trees. Standard chart parsing algorithms (eg. CKY) can be used. Complexity: O(n 6 ) (or O(n 3 ) if the category set is fixed) Recovery of “deep” dependencies require feature structures. Supertagging: assign most likely categories to words before parsing. Significantly speeds up parsing!

234 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources234 Parsing models Generative models: P(s,  ) Model the process which generates the derivation  –Advantage: easy to guarantee consistency –Disadvantage: requires good smoothing techniques, difficult to include complex features Good baseline Conditional models: P(  |s) Given a sentence s, predict most likely derivation  –Advantage: more natural for parsing –Disadvantage: large model size, difficult to estimate

235 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources235 Evaluation: recovery of dependency structures LabelledUnlabelled Generative: 83.390.3 (Hockenmaier and Steedman, 2002) Conditional:84.691.2 (Clark and Curran, 2004) This includes long-range dependencies

236 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources236 ccg2sem: from CCG to DRT A Prolog package which translates CCGbank derivations into Discourse Representation Theory structures (Bos, 2005)

237 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources237 CCGbanks for other languages German (Hockenmaier, 2006): –Translation of German TIGER corpus into CCG. –Many crossing dependencies, etc.: context-free approximations are inappropriate –Current coverage: 92.4% of all graphs (excluding headlines, fragments etc.) Turkish (Cakici, 2005): –Extracts a CCG lexicon from the METU Sabanci Treebank.

238 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources238 A few references General CCG references: M. Steedman (2000). The Syntactic Process, MIT Press. M. Steedman (1996). Surface Structure and Interpretation, MIT Press. CCGbank(s) and wide-coverage CCG parsing: J. Hockenmaier and M. Steedman (2005). CCGbank: User’s Manual, MS-CIS-05-09, Dept. of Computer and Information Science, University of Pennsylvania. J. Hockenmaier and M. Steedman (2002). Acquiring Compact Lexicalized Grammars from a Cleaner Treebank, LREC, Las Palmas, Spain. J. Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory Categorial Grammar. PhD thesis, Infomatics, University of Edinburgh. J. Hockenmaier and M. Steedman (2002). Generative Models for Statistical Parsing with Combinatory Categorial Grammar, ACL ‘02, Philadelphia, PA, USA. S. Clark and J. R. Curran (2004). Parsing the WSJ using CCG and Log-Linear Models ACL '04, Barcelona, Spain. S. Clark and J. R. Curran (2004). The Importance of Supertagging for Wide-Coverage CCG Parsing. Coling’04, Geneva, Switzerland. J. Bos (2005): Towards Wide-Coverage Semantic Interpretation. IWCS-6. R. Cakici (2005). Automatic Induction of a CCG Grammar for Turkish. ACL Student Research Workshop, Ann Arbor, Mi, USA. J. Hockenmaier (2006). Creating a CCGbank and a wide-coverage CCG lexicon for German. ACL/COLING ‘06, Sydney, Australia.

239 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources239 More references The CCG website: http://groups.inf.ed.ac.uk/ccg with lots of general references about CCG (as well as CCGbank, CCG parsing, etc.)groups.inf.ed.ac.uk/ccg CCGbank is available from the Linguistic Data Consortium (LDC) at the University of Pennsylvania.

240 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources240 Penn-II-Based Acquisition of HPSG Resources Head-Driven Phrase Structure Grammar

241 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources241 Penn-II-Based Acquisition of HPSG Resources Introduction Treebank conversion and HPSG annotation Lexicon extraction Probabilistic models –Feature forest model –Design of features Parsing Evaluation Advanced topics

242 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources242 Introduction If we had an HPSG version of Penn-II, we could obtain lexical entries and probabilistic models How do we get HPSG-annotated Penn-II? Converting Penn-II into an HPSG-conformant treebank How do we verify the conformity with the HPSG theory? Principles are exploited for the verification –Implementation of principles is relatively easy, while construction of the lexicon is extremely difficult –Principles are hand-coded, while lexical entries are acquired from a converted treebank

243 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources243 Introduction We develop a treebank rather than a lexicon A treebank provides more information than a lexicon –Verification of the consistency of the grammar –Statistics Principles Lexicon Treebank

244 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources244 Methodology Treebank Principles Lexicon pretty/JJ database/NN Treebank conversion HPSG treebank Lexicon extraction Grammar writer

245 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources245 Comparison with conventional grammar development Lexicon extractor Lexicon Principles Treebank Parser Grammar writer Principles Lexicon Treebank Corpus edit verify Treebank-based development Manual development

246 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources246 Treebank conversion and HPSG annotation Convert Penn-style parse trees into HPSG-style parse trees –Correcting frequent errors in Penn Treebank Ex. Confusion of VBD/VBN –Converting tree structures Small clauses, passives, NP structures, auxiliary/control verbs, LDDs, etc. –Mapping into HPSG-style representations Head/argument/modifier distinction, schema name assignment Mapping into HPSG categories –Applying HPSG principles/schemas Undetermined features are filled Violations of feature constraints are detected

247 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources247 HEAD verb SUBJ COMPS MOD HEAD verb SUBJ COMPS Overview S making the offer NP NL NP is officially VP head modhead arg S making the offer NP NL NP is officially VP ADVP Error correction & tree conversion Mapping into HPSG-style representation NL HEAD verb SUBJ COMPS subject-head HEAD noun SUBJ COMPS the offer making HEAD adv HEAD verb SUBJ 1 HEAD verb HEAD verb SUBJ 1 HEAD verb is officially HEAD verb head-comp head-mod head-comp Principle application NL HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS the offer making HEAD verb SUBJ COMPS 1 HEAD verb SUBJ COMPS 1 isofficially 1 1 2 HEAD verb SUBJ COMPS 1 2 3 3 HEAD verb SUBJ COMPS 1 4 4 2

248 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources248 Tree conversion Coordination, quotation, insertion, and apposition Small clauses, “than” phrases, quantifier phrases, complementizers, etc. Disambiguation of non-/pre-terminal symbols (TO, etc.) HEAD features (CASE, INV, VFORM, etc.) Noun phrase structures Auxiliary/control verbs Subject extraction Long distance dependencies Relative clauses, reduced relatives

249 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources249 Pattern-based tree conversion tree_transform_rule("predicative", $Input, $Output) :- tree_match(TREE_NODE\$Node & TREE_DTRS\[tree_any & ANY_TREES\$LeftTrees, (TREE_NODE\SYM\"S" & TREE_DTRS\($PRDTrees & [tree_any, tree & TREE_NODE\FUNC\"PRD", tree_any])), tree_any & ANY_TREES\$RightTrees], $Input), append_list([$LeftTrees, $PRDTrees, $RightTrees], $Dtrs), $Output = TREE_NODE\$Node & TREE_DTRS\$Dtrs. S NPVP S NPADJP himself Heconsidered superior S NPVP NPADJP himself Heconsidered superior Tree pattern

250 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources250 Passive “be + VBN” constructions are assigned “VFORM passive ” S been out VP *-2 NP-SBJ-2 have n’t VP thedetails worked /VBN NPPRT VFORM passive

251 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources251 Noun phrase structures Determiners are raised Possessive structures are explicitly represented NP of plant NP Monsanto NP ’s director PP sciences NP of plant NP Monsanto DP ’s director PP sciences N’ NP

252 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources252 Auxiliary/control verbs Auxiliary/control verbs are annotated as taking unsaturated constituents S VP have to choose this particular moment S NPVP NP they NP-1 did n’t *-1 VP SUBJ 1 1 SUBJ 2 SUBJ 2 SUBJ 3 3 = S VP have to choose this particular moment VP NP they NP-1 did n’t VP

253 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources253 Subject extraction HPSG does not allow subject extraction Relativizers are treated as ordinary subjects in relative clauses NP WHNP-1 SBAR S The company NP which NPVP has reported NP *T*-1 net losses NP WHNP-1 SBAR The company NP which VP has reported NP net losses

254 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources254 Subject relative Relativizers have a non-empty list in REL The element of REL is consumed in a head-relative construction and represents the relative-antecedent relation NP WHNP-1 SBAR The company NP which VP has reported NP net losses REL 2 2 2

255 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources255 LDDs: Object relative SLASH represents moved arguments REL represents relative-antecedent relations REL SLASH 1 2 REL SLASH 2 NP WHNP-3 SBAR S the energy and ambitions NP that NP-2 reformers VP S wanted reward VP *T*-3 1 NP to VP NP *-2 SLASH 1 1 1 1 2

256 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources256 Mapping into HPSG-style representations Convert nonterminal symbols into HPSG-style categories Assign schema names to internal nodes NN HEAD: noun AGR: 3sg HEAD: verb VFORM: finite TENSE: past VBD

257 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources257 Category mapping & schema name assignment Example: “NL is officially making the offer” S making the offer NP NL NP is officially VP head modhead arg NL HEAD verb SUBJ COMPS subject-head HEAD noun SUBJ COMPS the offer making HEAD adv HEAD verb SUBJ 1 HEAD verb HEAD verb SUBJ 1 HEAD verb is officially HEAD verb head-comp head-mod head-comp

258 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources258 Principle application inverse_schema_binary(subj_head_schema, $Mother, $Left, $Right) :- $Left = (SYNSEM\($LeftSynsem & LOCAL\CAT\(HEAD\MOD\[] & VAL\(SUBJ\[] & COMPS\[] & SPR\[])))), $Right = (SYNSEM\LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[$LeftSynsem] & COMPS\[] & SPR\[]))), $Mother = (SYNSEM\LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[] & COMPS\[] & SPR\[]))). HEAD: noun HEAD: verb Heconsidered... HEAD: verb SUBJ: HEAD: verb SUBJ: <> considered... HEAD: noun SUBJ: <> HEAD: verb He structure- sharing

259 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources259 Principle application NL HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS the offermaking HEAD adv MOD officially 1 HEAD verb SUBJ COMPS 1 1 2 1 2 3 3 is HEAD verb SUBJ COMPS 1 1 4 4 2 NL HEAD verb SUBJ COMPS subject-head HEAD noun SUBJ COMPS the offer making HEAD adv HEAD verb SUBJ 1 HEAD verb HEAD verb SUBJ 1 HEAD verb is officially HEAD verb head-comp head-mod head-comp

260 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources260 Complicated example NP we were VP the prices NP S SBAR WHNP-1 head arg 0 charged NP VP *-2*T*-1 arg head

261 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources261 Lexicon extraction Collecting leaf nodes of HPSG parse trees Generalizing leaf nodes into lexical entry templates Applying inverse lexical rules Assigning predicate argument structures

262 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources262 Overview Collection of leaf nodes & generalization Application of inverse lexical rules Assignment of predicate argument structures HEAD verb SUBJ COMPS MOD HEAD verb SUBJ COMPS NL HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS the offer making HEAD verb SUBJ COMPS 1 HEAD verb SUBJ COMPS 1 isofficially 1 1 2 HEAD verb SUBJ COMPS 1 2 3 3 HEAD verb SUBJ COMPS 1 4 4 2 HEAD verb SUBJ COMPS making: HEAD verb SUBJ COMPS make: HEAD verb HEAD noun CONT 2 COMPS HEAD noun CONT 1 SUBJ CONT make’ ARG1 ARG2 2 1

263 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources263 Collecting leaf nodes Leaf nodes of HPSG parse trees are instances of lexical entries NL HEAD verb SUBJ COMPS HEAD noun SUBJ COMPS the offermaking HEAD adv MOD officially 1 HEAD verb SUBJ COMPS 1 1 2 1 2 3 3 is HEAD verb SUBJ COMPS 1 1 4 4 2

264 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources264 Generalization into lexical entry templates Unnecessary constraints are removed (restriction) HEAD: verb SUBJ: noun POSTHEAD: minus HEAD: verb SUBJ: A leaf node of the HPSG treebank Lexical entry template lexical_entry_template($WordInfo, $Sign, $Template) :- copy($Sign, $Template), $Template = (SYNSEM\LOCAL\(CAT\HEAD\$Head & VAL\(SUBJ\$Subj & COMPS\$Comps & SPR\$SPR))),... restriction($SubjSynsem, [NONLOCAL\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, POSTHEAD\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, AUX\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, TENSE\]),...

265 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources265 Application of inverse lexical rules Converting lexical entries of inflected words into lexical entries of lexemes using inverse lexical rules Derivational rules: Ex. passive rule Inflectional rules: Ex. past-tense rule HEAD: verb SUBJ: COMPS: HEAD: verb SUBJ: COMPS: HEAD: verb VFORM: finite TENSE: past HEAD: verb VFORM: base

266 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources266 Predicate argument structures Create mappings from syntactic arguments into semantic arguments COMPS SUBJ HEAD verb make’ ARG1 ARG2 CAT|HEAD noun CONT 1 1 2 VAL CAT|HEAD noun CONT 2 CAT Ex. lexical entry for “make”

267 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources267

268 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources268

269 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources269

270 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources270 Probabilistic models Feature forest model –A solution to the problem of the probabilistic modeling of feature structures Design of features –How to represent preferences of HPSG parse trees

271 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources271 Example: PCFG S NPVP Shedances 0.3 0.2 S NPVP Idance S NPVP Shedanced S NPVP Idanced 0.15 0.2 Estimated prob. S → NP VP NP → She NP → I VP → dances VP → dance VP → danced CFG rule probabilities 1.0 0.5 0.3 0.4 Observed freq. Training data

272 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources272 What is the problem? PCFG assigns probabilities to ungrammatical structures –“She dance” (0.15), “I dances” (0.15) S NPVP Shedances 0.3 0.2 S NPVP Idance S NPVP Shedanced S NPVP Idanced 0.15 0.2 Estimated prob. Observed freq. Training data

273 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources273 Feature structure constraints In HPSG, feature structures explain grammatical constraints “She dance” “I dances” are never generated However, constraints of feature structures violate “independence assumption” of probabilistic models (Abney 1997) S → NP AGR 1 VP AGR 1 NP AGR: 3sg → She NP AGR: no3sg → I VP AGR: 3sg → dances VP AGR: no3sg → dance VP → danced How can we estimate probabilities in this situation?

274 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources274 Solution: ME model Probabilities of parse trees are estimated by maximum entropy models (Berger et al. 1996) Probability p(T) of parse tree T Optimal parameters are computed so as to maximize the likelihood of training data feature function parameter (feature weight) normalization factor

275 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources275 ME model of parse trees If feature functions correspond to CFG rules, this model is an extension of PCFG model Probabilities of parse tress are estimated without independence assumption S NP She VP dances

276 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources276 Estimation by a ME model S NPVP Shedances 0.3 0.2 S NPVP Idance S NPVP Shedanced S NPVP Idanced 0.3 0.2 Estimated prob. Observed freq. Training data S → NP AGR 1 VP AGR 1 1.0 NP AGR: 3sg → She 1.0 NP AGR: no3sg → I 1.0 VP AGR: 3sg → dances 1.145 VP AGR: no3sg → dance 1.145 VP → danced 0.763 ME parameters 1.145 0.763

277 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources277 Combinatorial explosion of parse trees Exponentially many parse trees are assigned to sentences (i.e., a set of T is exponential) S NP 1 VP 1 By expanding... S NP 1 VP 1 S NP 2 VP 1 S NP 1 VP 2 S NP 2 VP 2 Size: nm VP 2 NP 2 n m

278 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources278 Problems by combinatorial explosion Parameter estimation is intractable –Computation of Searching for the most probable parse is intractable –Computation of

279 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources279 Solutions in HMM and PCFG Probabilistic models are divided into independent probabilities, and dynamic programming is applied –Forward-backward probability –Baum-Welch algorithm –Inside-outside probability –Viterbi search Inside/outside probabilities can be computed at a cost proportional to the number of nodes, assuming a forest structure of parse trees

280 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources280 Feature forest model Dynamic programming can also be applied to maximum entropy estimation Feature forest: –Forest structure isomorphic to CFG parse forest –Assign feature functions to nodes rather than symbols A ME model is estimated without unpacking feature forests f (S) f (NP 1 ) f (VP 1 ) Size: n+m feature forest f (NP 2 ) f (VP 2 )

281 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources281 Feature forest representation of a parse tree A feature forest represents exponentially many trees of features f (S) f (NP 1 ) f (VP 1 ) Size: n+m feature forest representation S NP 1 VP 1 VP 2 NP 2 n m f (NP 2 ) f (VP 2 )

282 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources282 feature forest representation Outside T O ( NP 1 ) Inside T I ( NP 1 ) Focus on a set of trees below/above the targeted node Inside trees T I (n) : Trees below n Outside trees T O (n) : Trees above n Inside/outside trees of a feature forest f (S) f (NP 1 ) f (VP 1 ) f (NP 2 ) f (VP 2 )

283 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources283 Estimation algorithms for ME models Estimation of parameters requires computation of model expectations (Malouf 2002) Objective function Gradient Computed from training data Recomputed at each iteration

284 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources284 Inside/outside products Unnormalized product Inside product Outside product feature forest representation f (S) f (NP 1 ) f (VP 1 ) f (NP 2 ) f (VP 2 )

285 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources285 The inside product of NP 1 is a product of inside products of its daughters Computation of inside products feature forest representation f (NP 1 ) f (NP 2 ) f (N 1 ) f (N 2 ) f (N’ 1 ) f (N’ 2 )

286 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources286 The outside products of NP 1 is a product of the mother’s outside products and sister’s inside products Computation of outside products feature forest representation f (S) f (NP 1 ) f (VP 1 ) f (NP 2 ) f (VP 2 )

287 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources287 Computation of model expectations Sum of unnormalized products of trees including NP 1 Expectation of f i at NP 1 feature forest representation f (S) f (NP 1 ) f (VP 1 ) f (NP 2 ) f (VP 2 )

288 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources288 Viterbi search Almost the same as the computation of inside products –“max” rather than “sum” feature forest representation f (NP 1 ) f (NP 2 ) f (N 1 ) f (N 2 ) f (N’ 1 ) f (N’ 2 )

289 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources289 Design of features Feature engineering is important for higher accuracy Feature functions are designed for capturing syntactic/semantic preferences of HPSG parse trees

290 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources290 A chart for HPSG parsing he saw a girl with a telescope HEAD noun SUBCAT <> HEAD prep MOD NP SUBCAT HEAD prep MOD NP SUBCAT <> HEAD noun SUBCAT <> HEAD verb SUBCAT HEAD noun SUBCAT <> HEAD noun SUBCAT <> HEAD prep MOD VP SUBCAT HEAD prep MOD VP SUBCAT <> HEAD verb SUBCAT HEAD verb SUBCAT <> HEAD verb SUBCAT Equivalent signs are packed

291 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources291 Feature forest representation of a chart Node = each rule application HEAD prep MOD NP SUBCAT <> HEAD noun SUBCAT <> HEAD verb SUBCAT HEAD noun SUBCAT <> HEAD prep MOD VP SUBCAT HEAD prep MOD VP SUBCAT <> HEAD verb SUBCAT <> HEAD verb SUBCAT HEAD noun SUBCAT <> HEAD verb SUBCAT HEAD verb SUBCAT HEAD prep MOD VP SUBCAT <> HEAD noun SUBCAT <> HEAD verb SUBCAT HEAD noun SUBCAT <> HEAD noun SUBCAT <> he

292 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources292 Feature forest representation of predicate argument structures Node = already-determined predicate argument relations fact ARG1 want ARG1 4 ARG2 dispute1 I fact want ARG1 ARG2 dispute2 I ARG1 4 ARG2 3 3 want ARG1 ARG2 dispute1 I ARG1 1 1 2 want ARG1 ARG2 dispute2 I ARG1 2 ARG2 She ignored the fact that I wanted to dispute

293 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources293 Extraction of probabilistic events extract_binary_event("hpsg-forest", "bin", $RuleName, $LDtr, $RDtr, _, _, $Event) :- $Event = [$RuleName, $Dist, $Depth|$HDtrFeatures]) :- find_head($Rule, $LSign, $RSign, $Head, $NonHead), rule_name_mapping($Rule, $Head, $NonHead, $RuleName), encode_distance($LSign, $RSign, $Dist), encode_depth($LSign, $RSign, $Depth), encode_sign($Head, $HDtrFeatures, $NDtrFeatures), encode_sign($NonHead, $NDtrFeatures, []). S NPVP ADVP never Cool ran boys NTSPOS word lexical entry depth distance schema span

294 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources294 Atomic features RULE: name of applied rule DIST: distance between head words COMMA: whether the phrase includes commas SPAN: number of words the phrase dominates SYM: nonterminal symbol (e.g. S, VP, …) WORD: head word POS: part-of-speech LE: lexical entry ARG: argument label (ARG1, ARG2,...)

295 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources295 Example: syntactic features Feature for the Head-Modifier construction for “saw a girl” and “with a telescope” he saw a girl with a telescope HEAD noun SUBCAT <> HEAD verb SUBCAT HEAD noun SUBCAT <> HEAD noun SUBCAT <> HEAD prep MOD VP SUBCAT HEAD prep MOD VP SUBCAT <> HEAD verb SUBCAT HEAD verb SUBCAT <> HEAD verb SUBCAT

296 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources296 Example: semantic features Feature for the predicate argument relation between “he” and “saw” girl saw he ARG1 ARG2

297 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources297 Feature generation Features are generated by abstracting descriptions of probabilistic events feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]). feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0]). feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]).

298 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources298 Parsing Efficient processing of feature structures (details omitted) –Abstract machines, quick check, CFG filtering, etc. Efficient search with probabilistic HPSG –Beam thresholding –Iterative beam thresholding

299 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources299 Beam thresholding Thresholding out edges in each cell of the chart –Thresholding by number: for each cell, keep only the best n edges –Thresholding by width: keep only the edges whose FOM is greater than w, where w is the difference from the best FOM in the same cell

300 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources300 Effect of beam thresholding Precision and recall by changing parameters of beam search Recall drops, while precision retains

301 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources301 Iterative beam thresholding Start with a narrow beam width Continue widening a beam width until parsing succeeds Iterative_parse(sentence) { w := beam_width_start; while(w < beam_width_end) { parse(sentence, w); if(parse succeeds) return; w := w + beam_width_step; }

302 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources302 Efficacy of iterative beam thresholding Evaluated on Penn Treebank Section 24 (< 15 words) PrecisionRecallF-scoreAvg. time (ms) Viterbi 88.2%87.9%88.1%103923 Beam 89.0%82.4%85.5%88 Iterative 87.6%87.2%87.4%99

303 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources303 Distribution of parsing time Black: Viterbi, Red: iterative beam thresholding

304 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources304 Evaluation Evaluation of the lexical entries extracted from Penn Treebank –Investigation of obtained lexical entries –Coverage Evaluation of the disambiguation model –Parsing accuracy

305 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources305 Experimental settings Training data: Sections 2-21 of Penn Treebank II (39,832 sentences) Test data: –Development set: Section 22 (1,700 sentences) –Final test set: Section 23 (2,416 sentences)

306 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources306 Number of tree conversion rules Target of conversionNumber Penn-II errors102 Category mapping85 Head annotation and binarization63 Difference of phrase structures15 Predicate argument structures13 Long distance dependencies13 Others52 Total343

307 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources307 Result of treebank conversion & lexicon extraction Treebank conversion and HPSG annotation succeeded for 37,886 sentences Extracted lexicon: # words34,765 # lexical entries1,942 Average # lexical entries/word1.43

308 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources308 Sources of treebank conversion failures Classification of failures of treebank conversion in Section 02 (67 failures/1989 sentences) Shortcomings of tree conversion rules18 Errors in Penn Treebank16 Constructions currently unsupported20 Constructions unsupported by HPSG13

309 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources309 Breakdown of extracted lexical entries # words# lexical entries Avg. # lex. entries noun21,9251861.14 verb4,0949451.94 adjective8,078621.28 adverb1,295722.75 preposition1591939.17 particle58101.69 determiner36333.86 conjunction943219.46 punctuation1512022.00 Total34,7651,9421.43

310 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources310 Example lexical entries HEAD noun MOD <> VAL SPR SUBJ <> COMPS <> Common noun Ex. review/NN appeared 140,805 times HEAD verb MOD <> VFORM base VAL SPR <> SUBJ COMPS Transitive verb appeared 12,244 times HEAD adj MOD POSTHEAD - VAL SPR <> SUBJ <> COMPS <> Pre-head adjective appeared 55,049 times

311 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources311 Evaluation of coverage The ratio of lexical entries in the test data covered by the grammar is measured A sentence is covered when all of the lexical entries in the sentence are covered (strong coverage) Lexical entrySentence w/o unknown word handling96.52%54.7% w/ unknown word handling99.15%84.8%

312 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources312 Treebank size vs. coverage

313 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources313 Sentence length vs. coverage

314 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources314 Error analysis Classification of randomly selected uncovered lexical entries Errors of Penn Treebank 10 Errors of treebank conversion 48 Lack of lexical entries 23 Constructions currently unsupported 9 Idioms 6 Non-linguistic expressions (ex. list) 4

315 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources315 Examples of uncovered lexical entries Lack of mappings from words into lexical entries because of data sparseness –Post-noun adjectives (younger, crucial) –Coordination conjunctions of NP and S’ –Verbs taking present-participle as a complement Unsupported constructions –Free relatives, extrapositions Incorrect lexical entries obtained because of idiomatic expressions –(ADVP in part) because …

316 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources316 Evaluation of parsing accuracy Empirical evaluation of the probabilistic models –Overall accuracy –Treebank size vs. accuracy –Sentence length vs. accuracy –Contribution of features –Coverage and accuracy –Error analysis Measure: precision/recall of –e.g.) girl saw he ARG1 ARG2

317 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources317 Effect of feature forest models Accuracy for Section 23 (< 40 words) PrecisionRecall baseline78.1077.39 with syntactic features86.9286.28 with semantic features84.2983.74 with all features86.5486.02

318 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources318 Treebank size vs. accuracy

319 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources319 Sentence length vs. accuracy

320 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources320 Contribution of features (1/2) precisionrecall# features All87.1285.45623,173 - RULE 86.9885.37620,511 - DIST 86.7485.09603,748 - COMMA 86.5584.77608,117 - SPAN 86.5384.98583,638 - SYM 86.9085.47614,975 - WORD 86.6784.98116,044 - POS 86.3684.71430,876 - LE 87.0385.37412,290 None78.2276.4624,847

321 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources321 Contribution of features (2/2) precisionrecall# features All87.1285.45623,173 - DIST,SPAN 85.5484.02294,971 - DIST,SPAN,COMMA 83.9482.44286,489 - RULE,DIST, SPAN,COMMA 83.6181.98283,897 - WORD,LE 86.4884.9150,258 - WORD,POS 85.5683.9464,915 - WORD,POS,LE 84.8983.4333,740 - SYM,WORD, POS,LE 82.8181.4826,761 None78.2276.4624,847

322 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources322 Coverage and accuracy Accuracies for strongly covered/uncovered sentences We can expect accuracy improvements by improving grammar coverage PrecisionRecall# sentences Covered sentences89.3688.961,825 Uncovered sentences75.5774.04319

323 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources323 Error analysis Classification of errors in randomly selected sentences (100 sentences) PP-attachment ambiguity76 Distinction of arguments/modifiers49 Ambiguity of lexical entries44 Errors in test data22 Ambiguity of commas32 Others75

324 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources324 Examples of errors (1/2) Antecedent of a relative clause –It's made only in years when the grapes ripen perfectly (the last was 1979) and comes from a single acre of [NP grapes [S' that yielded a mere 75 cases in 1987 ]]. Argument/modifier distinction of to-phrases –More than a few CEOs say the red-carpet treatment tempts them [VP-modifier to return to a heartland city for future meetings ].

325 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources325 Examples of errors (2/2) Preposition or verb phrase? –Mitsui Mining & Smelting Co. posted a 62 % rise in pretax profit to 5.276 billion yen ($ 36.9 million) in its fiscal first half ended Sept. 30 [VP compared with 3.253 billion yen a year earlier ]. Selection of subcategorization frames –[NP-subject ``Nasty innuendoes,'' ] [VP says [NP-object John Siegal, Mr. Dinkins's issues director, ``designed to prosecute a case of political corruption that simply doesn't exist.'' ]]

326 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources326 Advanced topics Domain adaptation –Adapting the grammar and/or the disambiguation model to a new domain using a small amount of training data Generation –Using the grammar for sentence generation Semantics construction –Obtaining representations of formal semantics from HPSG parsing Applications

327 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources327 Domain adaptation (1/2) Disambiguation models are adapted to a bio domain using small training data –An original probabilistic model is incorporated into a new model as a reference distribution –Parameters of the new model are estimated so as to maximize the likelihood of the new training data Reference distribution

328 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources328 Domain adaptation (2/2) Evaluation with a bio-domain corpus Training data: –Penn Treebank (News): 39,832 sentences –GENIA Treebank (Bio): 3,524 sentences PrecisionRecall News domain87.69%87.16% Bio domain (w/o adaptation) 85.50%83.91% Bio domain87.19%85.58%

329 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources329 Generation (1/2) The methods for HPSG parsing are applied to a chart generator of HPSG –Feature forest model –Iterative beam thresholding he(x) buy(e) the(y) book(z) past(e) {3} {2}{1} {0} {0,3} {0,2} {2,3} {1,3} {1,2} {1,2,3} {0,2,3}{0,1,3} {0,1,2} {0,1,2,3} 0 1 2 3 chart generation He bought the book. 3 210 0-3 1-3 0-2 2-3 1-2 0-1 chart parsing 0 1 2 3 {0,1}

330 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources330 Generation (2/2) Evaluation on Penn Treebank Section 23 Beam width Coverage (%) Avg. generation time (msec.) BLEU Beam thresholding 444.766210.8196 867.7017760.8294 1273.1230740.8327 1672.9042870.8341 2071.8152730.8333 Iterative beam thresholding 8-2082.4716680.7982

331 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources331 Mapping from HPSG parse trees into semantic representations of typed dynamic logic (TDL) –Typed dynamic logic: a variant of dynamic semantics that includes plural semantics, event semantics, and situation semantics (Bekki, 2005) –Completely compositional semantics: lambda calculus composes semantic representations of phrases from lexical representations Semantics construction (1/2) Few boys fell. They died. few(x)[boy ’ x][fall ’ x]ref(x)[die ’ x] Λ

332 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources332 Approach: –Mapping HPSG lexical entries into lexical representations of TDL –Semantic representations of phrases are composed along HPSG parse trees Coverage: around 90% of Penn Treebank Section 23 are assigned well-formed semantic representations Semantics construction (2/2) PHON “loves” HEAD verb SUBJ COMPS

333 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources333 Applications: information extraction Extraction of protein-protein interactions from biomedical paper abstracts –Patterns on predicate argument structures are learned from small annotated data –Precision/recall: 71.8%/48.4%

334 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources334 Applications: text retrieval Retrieval of relational concepts –All sentences in MEDLINE are parsed into predicate argument structures –Relational concepts, such as “what causes cancer”, are retrieved by matching with predicate argument structures –Precision/recall: 60-96%/30-50%

335 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources335 Summary Conversion of Penn Treebank II into an HPSG treebank –Pattern-based tree conversion and principle application Extraction of lexical entries from the HPSG treebank –Generalization, application of inverse lexical rules, and assignment of predicate argument structures Probabilistic modeling of feature structures –Feature forest model Techniques for efficient parsing with probabilistic HPSG –Iterative beam thresholding Evaluation –Coverage and parsing accuracy Advanced topics –Domain adaptation, sentence generation, semantics construction, and practical applications

336 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources336 Publications Corpus-oriented development of HPSG –Y. Miyao, T. Ninomiya, and J. Tsujii. (2003). Lexicalized Grammar Acquisition. In Proc. 10th EACL Companion Volume. –Y. Miyao, T. Ninomiya, and J. Tsujii. (2004) Corpus-oriented grammar development for acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank. In Proc. IJCNLP 2004. –H. Nakanishi, Y. Miyao, and J. Tsujii. (2004). Using Inverse Lexical Rules to Acquire a Wide-coverage Lexicalized Grammar. In the IJCNLP 2004 Workshop on “Beyond Shallow Analyses.” –H. Nakanishi, Y. Miyao and J. Tsujii. (2004). An Empirical Investigation of the Effect of Lexical Rules on Parsing with a Treebank Grammar. In Proc. TLT 2004. –K. Yoshida. (2005). Corpus-Oriented Development of Japanese HPSG Parsers. In 43rd ACL Student Research Workshop.

337 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources337 Publications Feature forest model –Y. Miyao and J. Tsujii. (2002) Maximum entropy estimation for feature forests. In Proc. HLT 2002. Probabilistic models for HPSG –Y. Miyao and J. Tsujii. (2003). A model of syntactic disambiguation based on lexicalized grammars. In Proc. 7th CoNLL. –Y. Miyao, T. Ninomiya and J. Tsujii. (2003). Probabilistic modeling of argument structures including non-local dependencies. In Proc. RANLP 2003. –Y. Miyao, and J. Tsujii. (2005). Probabilistic disambiguation models for wide-coverage HPSG parsing. In Proc. ACL 2005. –T. Ninomiya, T. Matsuzaki, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2006). Extremely Lexicalized Models for Accurate and Fast HPSG Parsing. In Proc. EMNLP 2006.

338 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources338 Publications Parsing strategies for probabilistic HPSG –Y. Tsuruoka, Y. Miyao and J. Tsujii. (2004). Towards efficient probabilistic HPSG parsing: integrating semantic and syntactic preference to guide the parsing. In the IJCNLP-04 Workshop on “Beyond shallow analyses.” –T. Ninomiya, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2005). Efficacy of Beam Thresholding, Unification Filtering and Hybrid Parsing in Probabilistic HPSG Parsing. In Proc. IWPT 2005. –T. Ninomiya, Y. Tsuruoka, Y. Miyao, K. Taura, and J. Tsujii. (2006). Fast and Scalable HPSG Parsing. Traitement automatique des langues (TAL). 46(2). Domain adaptation –T. Hara, Y. Miyao, and J. Tsujii. (2005). Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In Proc. IJCNLP 2005.

339 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources339 Publications Generation –H. Nakanishi, Y. Miyao, and J. Tsujii. (2005). Probabilistic models for disambiguation of an HPSG-based chart generator. In Proc. IWPT 2005. Semantics construction –M. Sato, D. Bekki, Y. Miyao, and J. Tsujii. (2006). Translating HPSG- style Outputs of a Robust Parser into Typed Dynamic Logic. In Proc. COLING-ACL 2006 Poster Session. Applications –Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K. Yoshida, T. Ninomiya, and J. Tsujii. (2006). Semantic Retrieval for the Accurate Identification of Relational Concepts. In Proc. COLING-ACL 2006. –A. Yakushiji, Y. Miyao, T. Ohta, Y. Tateisi, and J. Tsujii. (2006). Automatic Construction of Predicate-Argument Structure Patterns for Biomedical Information Extraction. In EMNLP 2006 Poster Session.

340 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources340 Comparing LFG, CCG, HPSG and TAG Acquisition

341 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources341 Comparing LFG, CCG, HPSG and TAG Acquisition

342 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources342 Demos

343 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources343 Demos

344 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources344 Future Work & Discussion

345 ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources345 Future Work & Discussion


Download ppt "ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef."

Similar presentations


Ads by Google