Statistical NLP Spring 2011

Statistical NLP Spring 2011
Lecture 17: Parsing III Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Treebank PCFGs [Charniak 96] Model F1 Baseline 72.0
Use PCFGs for broad coverage parsing Can take a grammar right off the trees (doesn’t work well): ROOT  S 1 S  NP VP . 1 NP  PRP 1 VP  VBD ADJP 1 ….. Model F1 Baseline 72.0

Conditional Independence?
Not every NP expansion can fill every NP slot A grammar with symbols like “NP” won’t be context-free Statistically, conditional independence too strong

The Game of Designing a Grammar
Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation

A Fully Annotated (Unlex) Tree

Some Test Set Results Parser LP LR F1 CB 0 CB Magerman 95 84.9 84.6
84.7 1.26 56.6 Collins 96 86.3 85.8 86.0 1.14 59.9 Unlexicalized 86.9 85.7 1.10 60.3 Charniak 97 87.4 87.5 1.00 62.1 Collins 99 88.7 88.6 0.90 67.1 Beats “first generation” lexicalized parsers. Lots of room to improve – more complex models next.

Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation [Johnson ’98, Klein and Manning 03] Head lexicalization [Collins ’99, Charniak ’00]

Problems with PCFGs If we do no annotation, these trees differ only in one rule: VP  VP PP NP  NP PP Parse will go one way or the other, regardless of words We addressed this in one way with unlexicalized grammars (how?) Lexicalization allows us to be sensitive to specific words

Problems with PCFGs What’s different between basic PCFG scores here?
What (lexical) correlations need to be scored?

Lexicalized Trees Add “headwords” to each phrasal node
Syntactic vs. semantic heads Headship not in (most) treebanks Usually use head rules, e.g.: NP: Take leftmost NP Take rightmost N* Take rightmost JJ Take right child VP: Take leftmost VB* Take leftmost VP Take left child

Lexicalized PCFGs? Problem: we now have to estimate probabilities like
Never going to get these atomically off of a treebank Solution: break up derivation into smaller steps

Lexical Derivation Steps
A derivation of a local tree [Collins 99] Choose a head tag and word Choose a complement bag Generate children (incl. adjuncts) Recursively derive children

Lexicalized CKY X[h] Y[h] Z[h’] bestScore(X,i,j,h) if (j = i+1)
(VP->VBD...NP )[saw] X[h] (VP->VBD )[saw] NP[her] Y[h] Z[h’] bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max max score(X[h]->Y[h] Z[h’]) * bestScore(Y,i,k,h) * bestScore(Z,k,j,h’) max score(X[h]->Y[h’] Z[h]) * bestScore(Y,i,k,h’) * bestScore(Z,k,j,h) i h k h’ j k,h’,X->YZ k,h’,X->YZ

Quartic Parsing Turns out, you can do (a little) better [Eisner 99]
Gives an O(n4) algorithm Still prohibitive in practice if not pruned X[h] X[h] Y[h] Z[h’] Y[h] Z i h k h’ j i h k j

Pruning with Beams The Collins parser prunes with per-cell beams [Collins 99] Essentially, run the O(n5) CKY Remember only a few hypotheses for each span <i,j>. If we keep K hypotheses at each span, then we do at most O(nK2) work per span (why?) Keeps things more or less cubic Also: certain spans are forbidden entirely on the basis of punctuation (crucial for speed) X[h] Y[h] Z[h’] i h k h’ j

Pruning with a PCFG The Charniak parser prunes using a two-pass approach [Charniak 97+] First, parse with the base grammar For each X:[i,j] calculate P(X|i,j,s) This isn’t trivial, and there are clever speed ups Second, do the full O(n5) CKY Skip any X :[i,j] which had low (say, < ) posterior Avoids almost all work in the second phase! Charniak et al 06: can use more passes Petrov et al 07: can use many more passes

E.g. consider the span 5 to 12:
Prune? For each chart item X[i,j], compute posterior probability: < threshold E.g. consider the span 5 to 12: coarse: … QP NP VP To make this explicit, what do I mean by pruning? For each item we compute its posterior probability using the inside/outside scores. For example for a given span we have constituents like QP, NP, VP in our coarse grammar. When we prune, we will compute the probability of the coarse symbols. If this probability is below a threshold we can be pretty confident that there won’t be such a constituent in the final parse tree. Each of those symbols corresponds to a set of symbols in the refined grammar. We can therefore safely remove all its refined version from the chart of the refined grammar. How much does this help? refined:

Pruning with A* You can also speed up the search without sacrificing optimality For agenda-based parsers: Can select which items to process first Can do with any “figure of merit” [Charniak 98] If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X i j n

Projection-Based A* S:fell VP:fell NP:payrolls PP:in
Factory payrolls fell in Sept. NP:payrolls PP:in VP:fell S:fell Factory payrolls fell in Sept. NP PP VP S Factory payrolls fell in Sept. payrolls in fell

A* Speedup Total time dominated by calculation of A* tables in each projection… O(n3)

Results Some results However Collins 99 – 88.6 F1 (generative lexical)
Charniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked) Petrov et al 06 – 90.7 F1 (generative unlexical) McClosky et al 06 – 92.1 F1 (gen + rerank + self-train) However Bilexical counts rarely make a difference (why?) Gildea 01 – Removing bilexical counts costs < 0.5 F1

Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation Head lexicalization Automatic clustering?

Latent Variable Grammars
... So, what are latent variable grammars? The idea is that we take our observed parse trees and augment them with latent variables at each node. Each observed category, like NP, is thereby split into a set of subcategories NP-0 through NP-k. This also creates a set of exponentially many derivations. We then learn distributions over the new subcategories and these derivations. The resulting grammars look like this, where we have now many rules for each of the original rules that we started with. For example ... The learning process determines the exact set of rules, as well as the parameter values. Parse Tree Sentence Derivations Parameters

Learning Latent Annotations
Forward EM algorithm: Brackets are known Base categories are known Only induce subcategories X1 X2 X7 X4 X5 X6 X3 He was right . Backward Just like Forward-Backward for HMMs.

Refinement of the DT tag

Hierarchical refinement

Hierarchical Estimation Results
As we had hoped the hierarchical training leads to better parameter estimates, which in turn results in a 1% improvement in parsing accuracy. Model F1 Flat Training 87.3 Hierarchical Training 88.4

Refinement of the , tag Splitting all categories equally is wasteful:
This being said, there are still some things that are broken in our model. Let’s look at another category, e.g. the , tag. Whether we apply hierarchical training or not, the result will be the same. We are generating several subcategories that are exactly the same.

Adaptive Splitting Want to split complex categories more
Idea: split everything, roll back splits which were least useful

Adaptive Splitting Results
Here you can see how the grammar size has been dramatically reduced. The arrows connect models that have undergone the same number of split iterations and therefore have the same maximal number of subcategories. Because we have merged some of the subcategories back in, the blue model has far fewer total subcategories. The adaptive splitting allows us to do two more split iterations and further refine more complicated categories. This additional expressiveness gives us in turn another 1% improvement in parsing accuracy. We are now close to 90% with a grammar that has a total of barely more than 1000 subcategories. This grammar can allocate upto 64 subcategories, and here is where they get allocated. Model F1 Previous 88.4 With 50% Merging 89.5

Number of Phrasal Subcategories
Here are the number of subcategories for each of the 25 phrasal categories. In general phrasal categories are split less heavily than lexical categories, a trend that one could have expected.

Number of Lexical Subcategories
Or have little variance.

Learned Splits Proper Nouns (NNP): Personal pronouns (PRP): NNP-14
Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Before I finish, let me give you some interesting examples of what our grammars learn. These are intended to highlight some interesting observations, the full list of subcategories and more details are in the paper. The subcategories sometimes capture syntactic and sometimes semantic difference. But very often they represent syntactico – semantic relations, which are similar to those found in distributional clustering results. For example for the proper nouns the system learns a subcategory for months, first names, last names and initials. It also learns which words typically come first and second in multi-word units. For personal pronouns there is a subcategory for accusative case and one for sentence initial and sentence medial nominative case. PRP-0 It He I PRP-1 it he they PRP-2 them him

Learned Splits Relative adverbs (RBR): Cardinal Numbers (CD): RBR-0
further lower higher RBR-1 more less More RBR-2 earlier Earlier later CD-7 one two Three CD-4 1989 1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 30 31 CD-9 78 58 34 Relative adverbs are divided into distance, degree and time. For the cardinal numbers the system learns to distinguish between spelled out numbers, dates and others.

Coarse-to-Fine Inference
Example: PP attachment ?????????

Hierarchical Pruning coarse: split in two: split in four:
… QP NP VP split in two: … QP1 QP2 NP1 NP2 VP1 VP2 split in four: … QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 Consider again the same span as before. We had decided that QP is not a valid constituent. In hierarchical pruning we will do several pre-parses instead of doing just one. Instead of going directly to our most refined grammar we can then use a grammar where the categories have been split in two. We already said that QPs are not valid, so that we won’t need to consider them. We will compute the posterior probability of the different NPs and VPs and we might decide that NP1 is valid but NP2 is not. We can then go to to the next, more refined grammar, where some other subcategories will get pruned and so on. This is just an example but it is pretty representative. The pruning allows us to keep the number of active chart items roughly constant even though we are doubling the size of the grammar. split in eight: …

Bracket Posteriors Well, actually every pass is useful. After each pass we double the number of chart items, but by pruning we are able to keep the number of valid chart items roughly constant. You can also see that some ambiguities are resolved early on, while others, like some attachment ambiguities, are not resolved until the end. At the end we have this extremely sparse chart which contains most of the probability mass. What we need to do is extract a single parse tree from it.

15 min 1621 min 111 min 35 min (no search error)
Wow, that’s fast now. 15 min without search error, remember that’s at 91.2 F-score. And if you are willing to sacrifice 0.1 in accuracy you can this time in half again.

Final Results (Accuracy)
≤ 40 words F1 all ENG Charniak&Johnson ‘05 (generative) 90.1 89.6 Split / Merge 90.6 GER Dubey ‘05 76.3 - 80.8 80.1 CHN Chiang et al. ‘02 80.0 76.6 86.3 83.4 In terms of accuracy we are better than the generative component of the Brown parser, but still worse than the reranking parser. However, we could extract n-best lists and rerank them too, if we wanted… The nice thing about our approach is that we do not require any knowledge about the target language. We just need a treebank. We therefore trained some grammars for German and Chinese. We surpass the previous stae of the art by a huge margin there. Mainly because most work on parsing has been done on English and parsing techniques for other languages are still somewhat behind. Still higher numbers from reranking / self-training methods

Parse Reranking Assume the number of parses is very small
We can represent each parse T as an arbitrary feature vector (T) Typically, all local rules are features Also non-local features, like how right-branching the overall tree is [Charniak and Johnson 05] gives a rich set of features

1-Best Parsing

K-Best Parsing [Huang and Chiang 05, Pauls, Klein, Quirk 10]

K-Best Parsing

Dependency Parsing Lexicalized parsers can be seen as producing dependency trees Each local binary tree corresponds to an attachment in the dependency graph questioned lawyer witness the the

Dependency Parsing Pure dependency parsing is only cubic [Eisner 99]
Some work on non-projective dependencies Common in, e.g. Czech parsing Can do with MST algorithms [McDonald and Pereira 05] X[h] h Y[h] Z[h’] h h’ i h k h’ j h k h’

Shift-Reduce Parsers Another way to derive a tree: Parsing
No useful dynamic programming search Can still use beam search [Ratnaparkhi 97]

Data-oriented parsing:
Rewrite large (possibly lexicalized) subtrees in a single step Formally, a tree-insertion grammar Derivational ambiguity whether subtrees were generated atomically or compositionally Most probable parse is NP-complete

TIG: Insertion

Tree-adjoining grammars
Start with local trees Can insert structure with adjunction operators Mildly context-sensitive Models long-distance dependencies naturally … as well as other weird stuff that CFGs don’t capture well (e.g. cross-serial dependencies)

TAG: Long Distance

CCG Parsing Combinatory Categorial Grammar
Fully (mono-) lexicalized grammar Categories encode argument sequences Very closely related to the lambda calculus (more later) Can have spurious ambiguities (why?)

Statistical NLP Spring 2011

Similar presentations

Presentation on theme: "Statistical NLP Spring 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical NLP Spring 2011

Similar presentations

Presentation on theme: "Statistical NLP Spring 2011"— Presentation transcript:

Similar presentations

About project

Feedback