Presentation on theme: "Statistical NLP Winter 2009"— Presentation transcript:
1 Statistical NLP Winter 2009 Lecture 11: Parsing IIRoger LevyThanks to Jason Eisner & Dan Klein for slides
2 PCFGs as language models time flies like an arrow 5NP 3Vst 3NP 10S 8NP 24S 221NP 4VP 4NP 18S 21VP 182P 2V 5PP 12VP 163Det 14N 8What does the goal weight (neg. log-prob) represent?It is the probability of the most probable tree whose yield is the sentenceSuppose we want to do language modelingWe are interested in the probability of all trees“Put the file in the folder” vs. “Put the file and the folder”
3 2-22 2-27 1 S NP VP 6 S Vst NP 2-27 2 S S PP 1 VP V NP 2-22 Could just add up the parse probabilitiestime flies like an arrow 5NP 3Vst 3NP 10S 8S 13NP 24S 22S 271NP 4VP 4NP 18S 21VP 182P 2V 5PP 12VP 163Det 14N 8oops, back to finding exponentially many parses2-222-271 S NP VP6 S Vst NP2 S S PP1 VP V NP2 VP VP PP1 NP Det N2 NP NP PP3 NP NP NP0 PP P NP2-272-222-27
4 1 S NP VP 6 S Vst NP 2-2 S S PP 1 VP V NP 2 VP VP PP Any more efficient way?time flies like an arrow 5NP 3Vst 3NP 10SSNP 24S 22S 271NP 4VP 4NP 18S 21VP 182P 2V 5PP 2-12VP 163Det 14N 82-81 S NP VP6 S Vst NP2-2 S S PP1 VP V NP2 VP VP PP1 NP Det N2 NP NP PP3 NP NP NP0 PP P NP2-222-27
5 1 S NP VP 6 S Vst NP 2-2 S S PP 1 VP V NP 2 VP VP PP Add as we go … (the “inside algorithm”)time flies like an arrow 5NP 3Vst 3NP 10SNP 24S 22S 271NP 4VP 4NP 18S 21VP 182P 2V 5PP 2-12VP 163Det 14N 81 S NP VP6 S Vst NP2-2 S S PP1 VP V NP2 VP VP PP1 NP Det N2 NP NP PP3 NP NP NP0 PP P NP2-22+2-27
6 1 S NP VP 6 S Vst NP 2-2 S S PP 1 VP V NP 2 VP VP PP Add as we go … (the “inside algorithm”)time flies like an arrow 5NP 3Vst 3NP 10SNP1NP 4VP 4NP 18S 21VP 182P 2V 5PP 2-12VP 163Det 14N 82-22+2-272-22+2-271 S NP VP6 S Vst NP2-2 S S PP1 VP V NP2 VP VP PP1 NP Det N2 NP NP PP3 NP NP NP0 PP P NP+2-22+2-27
7 Charts and latticesYou can equivalently represent a parse chart as a lattice constructed over some initial arcsThis will also set the stage for Earley parsing latersaltfliesscratchNPNSVPVS NP VPVP V NPVPVNP NNP N NSNP SNP VP SN NPN NP V VPN NP V VPsaltfliesscratch
8 (Speech) LatticesThere was nothing magical about words spanning exactly one position.When working with speech, we generally don’t know how many words there are, or where they break.We can represent the possibilities as a lattice and parse these just as easily.IvaneyesofaweanIasaw‘vevan
9 Speech parsing mini-example Grammar (negative log-probs):[We’ll do it on the board if there’s time]1 S NP VP1 VP V NP2 VP V PP2 NP DT NN2 NP DT NNS3 NP NNP2 NP PRP0 PP IN NP1 PRP I9 NNP Ivan6 V saw4 V ’ve7 V awe2 DT a2 DT an3 IN of6 NNP eyes9 NN awe6 NN saw5 NN van
10 Better parsing We’ve now studied how to do correct parsing This is the problem of inference given a modelThe other half of the question is how to estimate a good modelThat’s what we’ll spend the rest of the day on
11 Problems with PCFGs?If we do no annotation, these trees differ only in one rule:VP VP PPNP NP PPParse will go one way or the other, regardless of wordsWe’ll look at two ways to address this:Sensitivity to specific words through lexicalizationSensitivity to structural configuration with unlexicalized methods
12 Problems with PCFGs?[insert performance statistics for a vanilla parser]
13 Problems with PCFGs What’s different between basic PCFG scores here? What (lexical) correlations need to be scored?
14 Problems with PCFGs Another example of PCFG indifference Left structure far more commonHow to model this?Really structural: “chicken with potatoes with gravy”Lexical parsers model this effect, though not by virtue of being lexical
15 PCFGs and Independence Symbols in a PCFG define conditional independence assumptions:At any node, the material inside that node is independent of the material outside that node, given the label of that node.Any information that statistically connects behavior inside and outside a node must flow through that node.SS NP VPNP DT NNNPNPVP
16 Solution(s) to problems with PCFGs Two common solutions seen for PCFG badness:Lexicalization: put head-word information into categories, and use it to condition rewrite probabilitiesState-splitting: distinguish sub-instances of more general categories (e.g., NP into NP-under-S vs. NP-under-VP)You can probably see that (1) is a special case of (2)More generally, the solution involves information propagation through PCFG nodes
17 Lexicalized Trees Add “headwords” to each phrasal node Syntactic vs. semantic headsHeadship not in (most) treebanksUsually use head rules, e.g.:NP:Take leftmost NPTake rightmost N*Take rightmost JJTake right childVP:Take leftmost VB*Take leftmost VPTake left childHow is this information propagation?
18 Lexicalized PCFGs? Problem: we now have to estimate probabilities like Never going to get these atomically off of a treebankSolution: break up derivation into smaller steps
19 Lexical Derivation Steps Simple derivation of a local tree [simplified Charniak 97]VP[saw]Still have to smooth with mono- and non-lexical backoffsVBD[saw] NP[her] NP[today] PP[on]VP[saw]It’s markovization again!(VP->VBD...PP )[saw]PP[on]NP[today](VP->VBD...NP )[saw]NP[her](VP->VBD...NP )[saw](VP->VBD )[saw]VBD[saw]
20 Lexical Derivation Steps Another derivation of a local tree [Collins 99]Choose a head tag and wordChoose a complement bagGenerate children (incl. adjuncts)Recursively derive children
21 Naïve Lexicalized Parsing Can, in principle, use CKY on lexicalized PCFGsO(Rn3) time and O(Sn2) memoryBut R = rV2 and S = sVResult is completely impractical (why?)Memory: 10K rules * 50K words * (40 words)2 * 8 bytes 6TBCan modify CKY to exploit lexical sparsityLexicalized symbols are a base grammar symbol and a pointer into the input sentence, not any arbitrary wordResult: O(rn5) time, O(sn3)Memory: 10K rules * (40 words)3 * 8 bytes 5GBNow, why do we get these space & time complexities?
22 Another view of CKY The fundamental operation is edge-combining X Y Z VPXVP(-NP)NPYZbestScore(X,i,j,h)if (j = i+1)return tagScore(X,s[i])elsereturnmax score(X -> Y Z) *bestScore(Y,i,k) *bestScore(Z,k,j)i k jk,X->YZTwo string-position indices required to characterize each edge in memoryThree string-position indices required to characterize each edge combination in time
23 Lexicalized CKYLexicalized CKY has the same fundamental operation, just more edgesX[h]VP [saw]Y[h]Z[h’]VP(-NP)[saw]NP [her]bestScore(X,i,j,h)if (j = i+1)return tagScore(X,s[i])elsereturnmax max score(X[h]->Y[h] Z[h’]) *bestScore(Y,i,k,h) *bestScore(Z,k,j,h’)max score(X[h]->Y[h’] Z[h]) *bestScore(Y,i,k,h’) *bestScore(Z,k,j,h)i h k h’ jk,X->YZThree string positions for each edge in spaceFive string positions for each edge combination in timek,X->YZ
24 Quartic Parsing Turns out, you can do better [Eisner 99] Gives an O(n4) algorithmStill prohibitive in practice if not prunedX[h]X[h]Y[h]Z[h’]Y[h]Zi h k h’ ji h k j
25 Dependency ParsingLexicalized parsers can be seen as producing dependency treesEach local binary tree corresponds to an attachment in the dependency graphquestionedlawyerwitnessthethe
26 Dependency Parsing Pure dependency parsing is only cubic [Eisner 99] Some work on non-projective dependenciesCommon in, e.g. Czech parsingCan do with MST algorithms [McDonald and Pereira 05]Leads to O(n3) or even O(n2) [McDonald et al., 2005]X[h]hY[h]Z[h’]hh’i h k h’ jh k h’
27 Pruning with BeamsThe Collins parser prunes with per-cell beams [Collins 99]Essentially, run the O(n5) CKYRemember only a few hypotheses for each span <i,j>.If we keep K hypotheses at each span, then we do at most O(nK2) work per span (why?)Keeps things more or less cubicSide note/hack: certain spans are forbidden entirely on the basis of punctuation (crucial for speed)X[h]Y[h]Z[h’]i h k h’ j
28 Pruning with a PCFGThe Charniak parser prunes using a two-pass approach [Charniak 97+]First, parse with the base grammarFor each X:[i,j] calculate P(X|i,j,s)This isn’t trivial, and there are clever speed upsSecond, do the full O(n5) CKYSkip any X :[i,j] which had low (say, < ) posteriorAvoids almost all work in the second phase!Currently the fastest lexicalized parserCharniak et al 06: can use more passesPetrov et al 07: can use many more passes
29 Pruning with A*You can also speed up the search without sacrificing optimalityFor agenda-based parsers:Can select which items to process firstCan do with any “figure of merit” [Charniak 98]If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03]Xijn
30 Projection-Based A* S:fell VP:fell NP:payrolls PP:in Factory payrolls fell in Sept.NP:payrollsPP:inVP:fellS:fellFactory payrolls fell in Sept.NPPPVPSFactory payrolls fell in Sept.payrollsinfell
31 A* SpeedupTotal time dominated by calculation of A* tables in each projection… O(n3)
32 Typical Experimental Setup Corpus: Penn Treebank, WSJEvaluation by precision, recall, F1 (harmonic mean)Training:sections02-21Development:section22Test:231234567812345678Precision=Recall=7/8[NP,0,2][NP,3,5][NP,6,8][PP,5,8][VP,2,5][VP,2,8][S,1,8][NP,0,2][NP,3,5][NP,6,8][PP,5,8][NP,3,8][VP,2,8][S,1,8]
33 Results Some results However Collins 99 – 88.6 F1 (generative lexical)Petrov et al 06 – 90.7 F1 (generative unlexical)HoweverBilexical counts rarely make a difference (why?)Gildea 01 – Removing bilexical counts costs < 0.5 F1Bilexical vs. monolexical vs. smart smoothing
34 Unlexicalized methods So far we have looked at the use of lexicalized methods to fix PCFG independence assumptionsLexicalization creates new complications of inference (computational complexity) and estimation (sparsity)There are lots of improvements to be made without resorting to lexicalization
35 PCFGs and Independence Symbols in a PCFG define independence assumptions:At any node, the material inside that node is independent of the material outside that node, given the label of that node.Any information that statistically connects behavior inside and outside a node must flow through that node.SS NP VPNP DT NNNPNPVP
36 Non-Independence I Independence assumptions are often too strong. Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects).Also: the subject and object expansions are correlated!All NPsNPs under SNPs under VP
37 Non-Independence II Who cares? Symptoms of overly strong assumptions: NB, HMMs, all make false assumptions!For generation, consequences would be obvious.For parsing, does it impact accuracy?Symptoms of overly strong assumptions:Rewrites get used where they don’t belong.Rewrites get used too often or too rarely.In the PTB, this construction is for possessives
38 Breaking Up the Symbols We can relax independence assumptions by encoding dependencies into the PCFG symbols:What are the most useful “features” to encode?Parent annotation[Johnson 98]Marking possessive NPs
39 AnnotationsAnnotations split the grammar categories into sub-categories (in the original sense).Conditioning on history vs. annotatingP(NP^S PRP) is a lot like P(NP PRP | S)Or equivalently, P(PRP | NP, S)P(NP-POS NNP POS) isn’t history conditioning.Feature / unification grammars vs. annotationCan think of a symbol like NP^NP-POS asNP [parent:NP, +POS]After parsing with an annotated grammar, the annotations are then stripped for evaluation.
40 LexicalizationLexical heads important for certain classes of ambiguities (e.g., PP attachment):Lexicalizing grammar creates a much larger grammar. (cf. next week)Sophisticated smoothing neededSmarter parsing algorithmsMore data neededHow necessary is lexicalization?Bilexical vs. monolexical selectionClosed vs. open class lexicalization
41 Unlexicalized PCFGs What is meant by an “unlexicalized” PCFG? Grammar not systematically specified to the level of lexical itemsNP [stocks] is not allowedNP^S-CC is fineClosed vs. open class words (NP^S [the])Long tradition in linguistics of using function words as features or markers for selectionContrary to the bilexical idea of semantic headsOpen-class selection really a proxy for semanticsIt’s kind of a gradual transition from unlexicalized to lexicalized (but heavily smoothed) grammars.
42 Typical Experimental Setup Corpus: Penn Treebank, WSJAccuracy – F1: harmonic mean of per-node labeled precision and recall.Here: also size – number of symbols in grammar.Passive / complete symbols: NP, NP^SActive / incomplete symbols: NP NP CC Training:sections02-21Development:section22 (here, first 20 files)Test:23
43 Multiple Annotations Each annotation done in succession Order does matterToo much annotation and we’ll have sparsity issues
45 Vertical Markovization Order 1Order 2Vertical Markov order: rewrites depend on past k ancestor nodes.(cf. parent annotation)
46 MarkovizationThis leads to a somewhat more general view of generative probabilistic models over treesMain goal: estimate P( )A bit of an interlude: Tree-Insertion Grammars deal with this problem more directly.
48 Data-oriented parsing (Bod 1992) A case of Tree-Insertion GrammarsRewrite large (possibly lexicalized) subtrees in a single stepDerivational ambiguity whether subtrees were generated atomically or compositionallyMost probable parse is NP-complete due to unbounded number of “rules”
49 Markovization, cont.So the question is, how do we estimate these tree probabilitiesWhat type of tree-insertion grammar do we use?Equivalently, what type of independence assuptions do we impose?Traditional PCFGs are only one type of answer to this question
51 Unary SplitsProblem: unary rewrites used to transmute categories so a high-probability rule can be used.Solution: Mark unary rewrite sites with -UAnnotationF1SizeBase77.87.5KUNARY78.38.0K
52 Tag Splits Problem: Treebank tags are too coarse. Example: Sentential, PP, and other prepositions are all marked IN.Partial Solution:Subdivide the IN tag.AnnotationF1SizePrevious78.38.0KSPLIT-IN80.38.1K
53 Other Tag Splits F1 Size 80.4 8.1K 80.5 81.2 8.5K 81.6 9.0K 81.7 9.1K 81.89.3KUNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”)UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”)TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP)SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97]SPLIT-CC: separate “but” and “&” from other conjunctionsSPLIT-%: “%” gets its own tag.
54 Treebank Splits The treebank comes with some annotations (e.g., -LOC, -SUBJ, etc).Whole set together hurt the baseline.One in particular is very useful (NP-TMP) when pushed down to the head tag (why?).Can mark gapped S nodes as well.AnnotationF1SizePrevious81.89.3KNP-TMP82.29.6KGAPPED-S82.39.7K
55 Yield SplitsProblem: sometimes the behavior of a category depends on something inside its future yield.Examples:Possessive NPsFinite vs. infinite VPsLexical heads!Solution: annotate future elements into nodes.Lexicalized grammars do this (in very careful ways – why?).AnnotationF1SizePrevious82.39.7KPOSS-NP83.19.8KSPLIT-VP85.710.5K
56 Distance / Recursion Splits Problem: vanilla PCFGs cannot distinguish attachment heights.Solution: mark a property of higher or lower sites:Contains a verb.Is (non)-recursive.Base NPs [cf. Collins 99]Right-recursive NPsNP-vVPNPPPvAnnotationF1SizePrevious85.710.5KBASE-NP86.011.7KDOMINATES-V86.914.1KRIGHT-REC-NP87.015.2K
58 Some Test Set Results Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6 84.71.2656.6Collins 9686.385.886.01.1459.9Klein & Manning 0386.985.71.1060.3Charniak 9787.487.51.0062.1Collins 9988.788.60.9067.1Beats “first generation” lexicalized parsers.Lots of room to improve – more complex models next.
59 Unlexicalized grammars: SOTA Klein & Manning 2003’s “symbol splits” were hand-codedPetrov and Klein (2007) used a hierarchical splitting process to learn symbol inventoriesReminiscent of decision trees/CARTCoarse-to-fine parsing makes it very fastPerformance is state of the art!
60 Parse RerankingNothing we’ve seen so far allows arbitrarily non-local featuresAssume the number of parses is very smallWe can represent each parse T as an arbitrary feature vector (T)Typically, all local rules are featuresAlso non-local features, like how right-branching the overall tree is[Charniak and Johnson 05] gives a rich set of features
61 Parse Reranking Since the number of parses is no longer huge Can enumerate all parses efficientlyCan use simple machine learning methods to score treesE.g. maxent reranking: learn a binary classifier over trees where:The top candidates are positiveAll others are negativeRank trees by P(+|T)The best parsing numbers have mostly been from reranking systemsCharniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked)McClosky et al 06 – 92.1 F1 (gen + rerank + self-train)
62 Shift-Reduce Parsers Another way to derive a tree: Parsing No useful dynamic programming searchCan still use beam search [Ratnaparkhi 97]
63 Derivational Representations Generative derivational models:How is a PCFG a generative derivational model?Distinction between parses and parse derivations.How could there be multiple derivations?
64 Tree-adjoining grammar (TAG) Start with local treesCan insert structure with adjunction operatorsMildly context-sensitiveModels long-distance dependencies naturally… as well as other weird stuff that CFGs don’t capture well (e.g. cross-serial dependencies)