Presentation is loading. Please wait.

Presentation is loading. Please wait.

Issues in Computational Linguistics: Grammar Engineering Dick Crouch and Tracy King.

Similar presentations


Presentation on theme: "Issues in Computational Linguistics: Grammar Engineering Dick Crouch and Tracy King."— Presentation transcript:

1 Issues in Computational Linguistics: Grammar Engineering Dick Crouch and Tracy King

2 Outline What is a deep grammar? How to engineer them: –robustness –integrating shallow resources –ambiguity –writing efficient grammars –real world data

3 What is a shallow grammar often trained automatically from marked up corpora part of speech tagging chunking trees

4 POS tagging and Chunking Part of speech tagging: I/PRP saw/VBD her/PRP duck/VB./PUNCT I/PRP saw/VBD her/PRP$ duck/NN./PUNCT Chunking: –general chunking [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]. (Abney) –NP chunking [NP President Clinton] visitited [NP the Hermitage] in [NP Leningrad]

5 Treebank grammars Phrase structure tree (c-structure) Annotations for heads, grammatical functions Collins parser output

6 Deep grammars Provide detailed syntactic/semantic analyses –LFG (ParGram), HPSG (LinGO, Matrix) –Grammatical functions, tense, number, etc. Mary wants to leave. subj(want~1,Mary~3) comp(want~1,leave~2) subj(leave~2,Mary~3) tense(leave~2,present) Usually manually constructed –linguistically motivated rules

7 Why would you want one Meaning sensitive applications –overkill for many NLP applications Applications which use shallow methods for English may not be able to for "free" word order languages –can read many functions off of trees in English SUBJ: NP sister to VP [S [NP Mary] [VP left]] OBJ: first NP sister to V [S [NP Mary] [VP saw [NP John]]] –need other information in German, Japanese, etc.

8 Deep analysis matters… if you care about the answer Example: A delegation led by Vice President Philips, head of the chemical division, flew to Chicago a week after the incident. Question: Who flew to Chicago? Candidate answers: division closest noun headnext closest V.P. Philipsnext shallow but wrong delegation furthest away but Subject of flew deep and right

9 Applications of Language Engineering Functionality Domain Coverage Low Narrow Broad High Alta Vista AskJeeves Google Post-Search Sifting Autonomous Knowledge Filtering Natural Dialogue Knowledge Fusion Microsoft Paperclip Manually-tagged Keyword Search Document Base Management Restricted Dialogue Useful Summary Good Translation

10 Traditional Problems Time consuming and expensive to write Not robust –want output for any input Ambiguous Slow Other gating items for applications that need deep grammars

11 Why deep analysis is difficult Languages are hard to describe –Meaning depends on complex properties of words and sequences –Different languages rely on different properties –Errors and disfluencies Languages are hard to compute –Expensive to recognize complex patterns –Sentences are ambiguous –Ambiguities multiply: explosion in time and space

12 How to overcome this Engineer the deep grammars –theoretical vs. practical –what is good enough Integrate shallow techniques into deep grammars Experience based on broad-coverage LFG grammars (ParGram project)

13 Robustness: Sources of Brittleness missing vocabulary –you can't list all the proper names in the world missing constructions –there are many constructions theoretical linguistics rarely considers (e.g. dates, company names) –easy to miss even core constructions ungrammatical input –real world text is not always perfect –sometimes it is really horrendous

14 Real world Input Other weak blue-chip issues included Chevron, which went down 2 to 64 7/8 in Big Board composite trading of 1.3 million shares; Goodyear Tire & Rubber, off 1 1/2 to 46 3/4, and American Express, down 3/4 to 37 1/4. (WSJ, section 13) ``The croaker's done gone from the hook – (WSJ, section 13) (SOLUTION 27000 20) Without tag P-248 the W7F3 fuse is located in the rear of the machine by the charge power supply (PL3 C14 item 15. (Eureka copier repair tip)

15 Missing vocabulary Build vocabulary based on the input of shallow methods –fast –extensive –accurate Finite-state morphologies Part of Speech Taggers

16 Finite State Morphologies Finite-state morphologies falls -> fall +Noun +Pl fall +Verb +Pres +3sg Mary -> Mary +Prop +Giv +Fem +Sg vienne -> venir +SubjP +SG {+P1|+P3} +Verb Build lexical entry on-the-fly from the morphological information –have canonicalized stem form –have significant grammatical information –do not have subcategorization

17 Building lexical entries Lexical entries -unknown N @(COMMON-NOUN %stem). +Noun N-SFX @(PERS 3). +Pl N-NUM @(NUM pl). Rule NOUN -> N N-SFX N-NUM. Templates –COMMON-NOUN :: (^ PRED)='%stem' (^ NTYPE)=common –PERS(3) :: (^ PERS)=3 –NUM(pl) :: (^ NUM)=pl

18 Building lexical entries F-structure for falls [ PRED 'fall' NTYPE common PERS 3 NUM pl ] C-Structure for falls Noun N fall N-SFX +Noun N-NUM +Pl

19 Guessing words Use FST guesser if the morphology doesn't know the word –Capitalized words can be proper nouns »Saakashvili -> Saakashvili +Noun +Proper +Guessed –ed words can be past tense verbs or adjectives »fumped -> fump +Verb +Past +Guessed fumped +Adj +Deverbal +Guessed Languages with more morphology allow for better guessers

20 Using the lexicons Rank the lexical lookup 1.overt entry in lexicon 2.entry built from information from morphology 3.entry built from information from guesser Use the most reliable information Fall back only as necessary

21 Missing constructions Even large hand-written grammars are not complete –new constructions, especially with new corpora –unusual constructions Generally longer sentences fail –one error can destroy the parse Build up as much as you can; stitch together the pieces

22 Grammar engineering approach First try to get a complete parse If fail, build up chunks that get complete parses Have a fall back for things without even chunk parses Link these chunks and fall backs together in a single structure

23 Fragment Chunks: Sample output the the dog appears. Split into: –"token" the –sentence "the dog appears" –ignore the period

24 C-structure

25 F-structure

26 Ungrammatical input Real world text contains ungrammatical input –typos –run ons –cut and paste errors Deep grammars tend to only cover grammatical output Two strategies –robustness techniques: guesser/fragments –disprefered rules for ungrammatical structures

27 Rules for ungrammatical structures Common errors can be coded in the rules –want to know that error occurred (e.g., feature in f-structure) Disprefer parses of ungrammatical structure –tools for grammar writer to rank rules –two+ pass system 1.standard rules 2.rules for known ungrammatical constructions 3.default fall back rules

28 Sample ungrammatical structures Mismatched subject-verb agreement Verb3Sg = { SUBJ PERS = 3 SUBJ NUM = sg |BadVAgr } Missing copula VPcop ==> { Vcop: ^=! |e: (^ PRED )='NullBe ' MissingCopularVerb} { NP: (^ XCOMP)=! |AP: (^ XCOMP)=! | …}

29 Robustness summary Integrate shallow methods –for lexical items –morphologies –guessers Fall back techniques –for missing constructions –fragment grammar –disprefered rules

30 Ambiguity Deep grammars are massively ambiguous Example: 700 from section 23 of WSJ –average # of words: 19.6 –average # of optimal parses: 684 »for 1-10 word sentences: 3.8 »for 11-20 word sentences: 25.2 »for 50-60 word sentences: 12,888

31 Managing Ambiguity Use packing to parse and manipulate the ambiguities efficiently (more tomorrow) Trim early with shallow markup –fewer parses to choose from –faster parse time Choose most probable parse for applications that need a single input

32 Shallow markup Part of speech marking as filter I saw her duck/VB. –accuracy of tagger (v. good for English) –can use partial tagging (verbs and nouns) Named entities – Goldman, Sachs & Co. bought IBM. –good for proper names and times –hard to parse internal structure Fall back technique if fail –slows parsing –accuracy vs. speed

33 Example shallow markup: Named entities Allow tokenizer to accept marked up input: parse { Mr. Thejskt Thejs arrived.} tokenized string: Mr. Thejskt Thejs TB +NEperson Mr( TB ). TB Thejskt TB Thejs TB arrived TB. TB Add lexical entries and rules for NE tags

34 Resulting C-structure

35 Resulting F-structure

36 Results for shallow markup Full/All% Full parses Optimal sol’ns Best F-sc Time % Unmarked76482/175382/7965/100  Named ent 78263/147786/8460/91 POS tag62248/191676/7240/48 Lab brk65158/ 77485/7919/31 Kaplan and King 2003

37 Chosing the most probable parse Applications may want one input –or at least just a handful Use stochastic methods to choose –efficient (XLE English grammar: 5% of parse time) Need training data –partially labelled data ok [NP-SBJ They] see [NP-OBJ the girl with the telescope]

38 Run-time performance Many deep grammars are slow Techniques depend on the system –LFG: exploit the context-free backbone ambiguity packing techniques Speed vs. accuracy trade off –remove/disprefer peripheral rules –remove fall backs for shallow markup

39 Development expense Grammar porting Starter grammar Induced grammar bootstrapping How cheap are shallow grammars? –training data can be expensive to produce

40 Grammar porting Use an existing grammar as the base for a new language Languages must be typologically similar –Japanese-Korean –Balkan Lexical porting via bi-lingual dictionaries Main work in testing and evaluation

41 Starter grammar Provide basic rules and templates –including for robustness techniques Grammar writer: –chooses among them –refines them Grammar Matrix for HPSG

42 Grammar induction Induce a core grammar from a treebank –compile rule generalizations –threshold rare rules –hand augment with features and fallback techniques Requires –induction program –existing resources (treebank)

43 Conclusions Grammar engineering makes deep grammars feasible –robustness techniques –integration of shallow methods Many current applications can use shallow grammars Fast, accurate, broad-coverage deep grammars enable new applications

44


Download ppt "Issues in Computational Linguistics: Grammar Engineering Dick Crouch and Tracy King."

Similar presentations


Ads by Google