Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.

Similar presentations


Presentation on theme: "Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley."— Presentation transcript:

1 Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley

2 Treebank PCFGs  Use PCFGs for broad coverage parsing  Can take a grammar right off the trees (doesn’t work well): ROOT  S 1 S  NP VP. 1 NP  PRP 1 VP  VBD ADJP 1 ….. ModelF1 Baseline72.0 [Charniak 96]

3 Conditional Independence?  Not every NP expansion can fill every NP slot  A grammar with symbols like “NP” won’t be context-free  Statistically, conditional independence too strong

4 Non-Independence  Independence assumptions are often too strong.  Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects).  Also: the subject and object expansions are correlated! All NPsNPs under SNPs under VP

5 Grammar Refinement  Example: PP attachment

6 Grammar Refinement  Structure Annotation [Johnson ’98, Klein&Manning ’03]  Lexicalization [Collins ’99, Charniak ’00]  Latent Variables [Matsuzaki et al. 05, Petrov et al. ’06]

7 The Game of Designing a Grammar  Annotation refines base treebank symbols to improve statistical fit of the grammar  Structural annotation

8 Typical Experimental Setup  Corpus: Penn Treebank, WSJ  Accuracy – F1: harmonic mean of per-node labeled precision and recall.  Here: also size – number of symbols in grammar.  Passive / complete symbols: NP, NP^S  Active / incomplete symbols: NP  NP CC  Training:sections02-21 Development:section22 (here, first 20 files) Test:section23

9 Vertical Markovization  Vertical Markov order: rewrites depend on past k ancestor nodes. (cf. parent annotation) Order 1 Order 2

10 Horizontal Markovization Order 1 Order 

11 Unary Splits  Problem: unary rewrites used to transmute categories so a high-probability rule can be used. AnnotationF1Size Base77.87.5K UNARY78.38.0K Solution: Mark unary rewrite sites with -U

12 Tag Splits  Problem: Treebank tags are too coarse.  Example: Sentential, PP, and other prepositions are all marked IN.  Partial Solution:  Subdivide the IN tag. AnnotationF1Size Previous78.38.0K SPLIT-IN80.38.1K

13 Other Tag Splits  UNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”)  UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”)  TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP)  SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97]  SPLIT-CC: separate “but” and “&” from other conjunctions  SPLIT-%: “%” gets its own tag. F1Size 80.48.1K 80.58.1K 81.28.5K 81.69.0K 81.79.1K 81.89.3K

14 A Fully Annotated (Unlex) Tree

15 Some Test Set Results  Beats “first generation” lexicalized parsers.  Lots of room to improve – more complex models next. ParserLPLRF1CB0 CB Magerman 9584.984.684.71.2656.6 Collins 9686.385.886.01.1459.9 Unlexicalized86.985.786.31.1060.3 Charniak 9787.487.587.41.0062.1 Collins 9988.788.6 0.9067.1

16  Annotation refines base treebank symbols to improve statistical fit of the grammar  Structural annotation [Johnson ’98, Klein and Manning 03]  Head lexicalization [Collins ’99, Charniak ’00] The Game of Designing a Grammar

17 Problems with PCFGs  If we do no annotation, these trees differ only in one rule:  VP  VP PP  NP  NP PP  Parse will go one way or the other, regardless of words  We addressed this in one way with unlexicalized grammars (how?)  Lexicalization allows us to be sensitive to specific words

18 Problems with PCFGs  What’s different between basic PCFG scores here?  What (lexical) correlations need to be scored?

19 Lexicalized Trees  Add “headwords” to each phrasal node  Syntactic vs. semantic heads  Headship not in (most) treebanks  Usually use head rules, e.g.:  NP:  Take leftmost NP  Take rightmost N*  Take rightmost JJ  Take right child  VP:  Take leftmost VB*  Take leftmost VP  Take left child

20 Lexicalized PCFGs?  Problem: we now have to estimate probabilities like  Never going to get these atomically off of a treebank  Solution: break up derivation into smaller steps

21 Lexical Derivation Steps  A derivation of a local tree [Collins 99] Choose a head tag and word Choose a complement bag Generate children (incl. adjuncts) Recursively derive children

22 Lexicalized CKY bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max max score(X[h]->Y[h] Z[h’]) * bestScore(Y,i,k,h) * bestScore(Z,k,j,h’) max score(X[h]->Y[h’] Z[h]) * bestScore(Y,i,k,h’) * bestScore(Z,k,j,h) Y[h]Z[h’] X[h] i h k h’ j k,h’,X->YZ (VP->VBD  )[saw] NP[her] (VP->VBD...NP  )[saw] k,h’,X->YZ

23 Pruning with Beams  The Collins parser prunes with per-cell beams [Collins 99]  Essentially, run the O(n 5 ) CKY  Remember only a few hypotheses for each span.  If we keep K hypotheses at each span, then we do at most O(nK 2 ) work per span (why?)  Keeps things more or less cubic  Also: certain spans are forbidden entirely on the basis of punctuation (crucial for speed) Y[h]Z[h’] X[h] i h k h’ j

24 Pruning with a PCFG  The Charniak parser prunes using a two-pass approach [Charniak 97+]  First, parse with the base grammar  For each X:[i,j] calculate P(X|i,j,s)  This isn’t trivial, and there are clever speed ups  Second, do the full O(n 5 ) CKY  Skip any X :[i,j] which had low (say, < 0.0001) posterior  Avoids almost all work in the second phase!  Charniak et al 06: can use more passes  Petrov et al 07: can use many more passes

25 Pruning with A*  You can also speed up the search without sacrificing optimality  For agenda-based parsers:  Can select which items to process first  Can do with any “figure of merit” [Charniak 98]  If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X n0 ij

26 Projection-Based A* Factory payrolls fell in Sept. NPPP VP S Factory payrolls fell in Sept. payrolls in fell Factory payrolls fell in Sept. NP:payrolls PP:in VP:fell S:fell

27 A* Speedup  Total time dominated by calculation of A* tables in each projection… O(n 3 )

28 Results  Some results  Collins 99 – 88.6 F1 (generative lexical)  Charniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked)  Petrov et al 06 – 90.7 F1 (generative unlexical)  McClosky et al 06 – 92.1 F1 (gen + rerank + self-train)  However  Bilexical counts rarely make a difference (why?)  Gildea 01 – Removing bilexical counts costs < 0.5 F1

29  Annotation refines base treebank symbols to improve statistical fit of the grammar  Structural annotation  Head lexicalization  Automatic clustering? The Game of Designing a Grammar

30 Latent Variable Grammars Parse Tree Sentence Parameters... Derivations

31 Forward Learning Latent Annotations EM algorithm: X1X1 X2X2 X7X7 X4X4 X5X5 X6X6 X3X3 Hewasright.  Brackets are known  Base categories are known  Only induce subcategories Just like Forward-Backward for HMMs. Backward

32 Refinement of the DT tag DT DT-1 DT-2 DT-3 DT-4

33 Hierarchical refinement

34 Hierarchical Estimation Results ModelF1 Flat Training87.3 Hierarchical Training88.4

35 Refinement of the, tag  Splitting all categories equally is wasteful:

36 Adaptive Splitting  Want to split complex categories more  Idea: split everything, roll back splits which were least useful

37 Adaptive Splitting Results ModelF1 Previous88.4 With 50% Merging89.5

38 Number of Phrasal Subcategories

39 Number of Lexical Subcategories

40 Learned Splits  Proper Nouns (NNP):  Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim

41  Relative adverbs (RBR):  Cardinal Numbers (CD): RBR-0furtherlowerhigher RBR-1morelessMore RBR-2earlierEarlierlater CD-7onetwoThree CD-4198919901988 CD-11millionbilliontrillion CD-0150100 CD-313031 CD-9785834 Learned Splits

42 Coarse-to-Fine Inference  Example: PP attachment ?????????

43 Prune? For each chart item X[i,j], compute posterior probability: …QPNPVP… coarse: refined: E.g. consider the span 5 to 12: < threshold

44 Bracket Posteriors

45 Hierarchical Pruning …QPNPVP… coarse: split in two: …QP1QP2NP1NP2VP1VP2… …QP1 QP3QP4NP1NP2NP3NP4VP1VP2VP3VP4… split in four: split in eight: ……………………………………………

46 Final Results (Accuracy) ≤ 40 words F1 all F1 ENG Charniak&Johnson ‘05 (generative)90.189.6 Split / Merge90.690.1 GER Dubey ‘0576.3- Split / Merge80.880.1 CHN Chiang et al. ‘0280.076.6 Split / Merge86.383.4 Still higher numbers from reranking / self-training methods

47 Bracket Posteriors (after G 0 )

48 Bracket Posteriors (after G 1 )

49 Bracket Posteriors (Final)

50 Bracket Posteriors (Best Tree)

51 G1G2G3G4G5G6G1G2G3G4G5G6 Learning Parsing times X-Bar= G 0 G= 60 % 12 % 7 % 6 % 5 % 4 %

52 1621 min 111 min 35 min 15 min (no search error)

53 Breaking Up the Symbols  We can relax independence assumptions by encoding dependencies into the PCFG symbols:  What are the most useful “features” to encode? Parent annotation [Johnson 98] Marking possessive NPs

54 Non-Independence II  Who cares?  NB, HMMs, all make false assumptions!  For generation, consequences would be obvious.  For parsing, does it impact accuracy?  Symptoms of overly strong assumptions:  Rewrites get used where they don’t belong.  Rewrites get used too often or too rarely. In the PTB, this construction is for possessives

55 Lexicalization  Lexical heads important for certain classes of ambiguities (e.g., PP attachment):  Lexicalizing grammar creates a much larger grammar. (cf. next week)  Sophisticated smoothing needed  Smarter parsing algorithms  More data needed  How necessary is lexicalization?  Bilexical vs. monolexical selection  Closed vs. open class lexicalization

56 Lexical Derivation Steps  Derivation of a local tree [simplified Charniak 97] VP[saw] VBD[saw] NP[her] NP[today] PP[on] VBD[saw] (VP->VBD  )[saw] NP[today] (VP->VBD...NP  )[saw] NP[her] (VP->VBD...NP  )[saw] (VP->VBD...PP  )[saw] PP[on] VP[saw] Still have to smooth with mono- and non- lexical backoffs

57 Manual Annotation  Manually split categories  NP: subject vs object  DT: determiners vs demonstratives  IN: sentential vs prepositional  Advantages:  Fairly compact grammar  Linguistic motivations  Disadvantages:  Performance leveled out  Manually annotated ModelF1 Naïve Treebank Grammar72.6 Klein & Manning ’0386.3

58 Automatic Annotation Induction  Advantages:  Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories.  Disadvantages:  Grammar gets too large  Most categories are oversplit while others are undersplit. ModelF1 Klein & Manning ’0386.3 Matsuzaki et al. ’0586.7

59 Adaptive Splitting  Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split  No loss in accuracy when 50% of the splits are reversed.

60 Final Results (Accuracy) ≤ 40 words F1 all F1 ENG Charniak&Johnson ‘05 (generative)90.189.6 Petrov and Klein 0790.690.1 GER Dubey ‘0576.3- Petrov and Klein 0780.880.1 CHN Chiang et al. ‘0280.076.6 Petrov and Klein 0786.383.4

61 Final Results F1 ≤ 40 words F1 all words Parser Klein & Manning ’0386.385.7 Matsuzaki et al. ’0586.786.1 Collins ’9988.688.2 Charniak & Johnson ’0590.189.6 Petrov et. al. 0690.289.7

62 Vertical and Horizontal  Examples:  Raw treebank: v=1, h=   Johnson 98:v=2, h=   Collins 99: v=2, h=2  Best F1: v=3, h=2v ModelF1Size Base: v=h=2v77.87.5K

63 Problems with PCFGs  Another example of PCFG indifference  Left structure far more common  How to model this?  Really structural: “chicken with potatoes with salad”  Lexical parsers model this effect, but not by virtue of being lexical

64 Naïve Lexicalized Parsing  Can, in principle, use CKY on lexicalized PCFGs  O(Rn 3 ) time and O(Sn 2 ) memory  But R = rV 2 and S = sV  Result is completely impractical (why?)  Memory: 10K rules * 50K words * (40 words) 2 * 8 bytes  6TB  Can modify CKY to exploit lexical sparsity  Lexicalized symbols are a base grammar symbol and a pointer into the input sentence, not any arbitrary word  Result: O(rn 5 ) time, O(sn 3 )  Memory: 10K rules * (40 words) 3 * 8 bytes  5GB

65 Quartic Parsing  Turns out, you can do (a little) better [Eisner 99]  Gives an O(n 4 ) algorithm  Still prohibitive in practice if not pruned Y[h]Z[h’] X[h] i h k h’ j Y[h] Z X[h] i h k j


Download ppt "Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley."

Similar presentations


Ads by Google