Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.

Similar presentations


Presentation on theme: "Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer."— Presentation transcript:

1 Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/12/15 (Slides from Dr. Mary P. Harper, http://min.ecn.purdue.edu/~ee669/)

2 Fall 2001 EE669: Natural Language Processing 2 Motivation N-gram models and HMM Tagging only allow us to process sentences linearly; however, the structural analysis of simple sentences require a nonlinear model that reflects the hierarchical structure of sentences rather than the linear order of words. Context-free grammars can be generalized to include probabilistic information by adding rule use statistics. Probabilistic Context Free Grammars (PCFGs) are the simplest and most natural probabilistic model for tree structures and the algorithms for them are closely related to those for HMMs. Note, however, that there are other ways of building probabilistic models of syntactic structure (see Chapter 12).

3 Fall 2001 EE669: Natural Language Processing 3 CFG Parse Example #1 S  NP VP #2 VP  V NP PP #3 VP  V NP #4 NP  N #5 NP  N PP #6 PP  PREP N #7 N  a_dog #8 N  a_cat #9 N  a_telescope #10 V  saw #11 PREP  with a_dog saw a_cat with a_telescope N V N PREP N NP NP PP VP S NP PP V N PREP N

4 Fall 2001 EE669: Natural Language Processing 4 Dependency Parse Example Same example, dependency representation saw a_dog a_cat with a_telescope Sb Attr Obj Adv_Tool

5 Fall 2001 EE669: Natural Language Processing 5 PCFG Notation G is a PCFG L is a language generated by G {N 1, …, N n } is a set of nonterminals for G; there are n nonterminals {w 1, …, w V } is the set of terminals; there are V terminals w 1 …w m is a sequence of words constituting a sentence to be parsed, also denoted as w 1m nonterminal N j spans w p through w q, or in other words N j  * w p w p+1 …w q

6 Fall 2001 EE669: Natural Language Processing 6 Formal Definition of a PCFG A PCFG consists of: –A set of terminals, {w k }, k= 1,…,V –A set of nonterminals, N i, i= 1,…, n –A designated start symbol N 1 –A set of rules, {N i   j }, (where  j is a sequence of terminals and nonterminals) –A corresponding set of probabilities on rules such that:  i  j P(N i   j ) = 1

7 Fall 2001 EE669: Natural Language Processing 7 Probability of a Derivation Tree and a String The probability of a derivation (i.e. parse) tree: P(T) =  i=1..k p(r(i)) where r(1), …, r(k) are the rules of the CFG used to generate the sentence w 1m of which T is a parse. The probability of a sentence (according to grammar G) is given by: P(w 1m ) =  t P(w 1m,t) =  {t: yield(t)=w 1m } P(t) where t is a parse tree of the sentence. Need dynamic programming to make this efficient!

8 Fall 2001 EE669: Natural Language Processing 8 Example: Probability of a Derivation Tree

9 Fall 2001 EE669: Natural Language Processing 9 Example: Probability of a Derivation Tree

10 Fall 2001 EE669: Natural Language Processing 10 The PCFG Below is a probabilistic CFG (PCFG) with probabilities derived from analyzing a parsed version of Allen's corpus.

11 Fall 2001 EE669: Natural Language Processing 11 Assumptions of the PCFG Model Place Invariance: The probability of a subtree does not depend on where in the string the words it dominates are (like time invariance in an HMM) Context Free: The probability of a subtree does not depend on words not dominated by that subtree. Ancestor Free: The probability of a subtree does not depend on nodes in the derivation outside the subtree.

12 Fall 2001 EE669: Natural Language Processing 12 Probability of a Rule Rule r: N j   (N  w) + Let R N j be the set of all rules that have nonterminal N j on the left-hand side; Then define probability distribution on R N j :  r  R N j p(r) = 1, 0  p(r)  1 Another point of view: p(  | N j ) = p(r), where r is N j 

13 Fall 2001 EE669: Natural Language Processing 13 Estimating Probability of a Rule MLE from a treebank produced using a CFG grammar Let r: N j , then –p(r) = c(r) / c(N j ) –c(r): how many times r appears in the treebank. –c(N j )=   c(N j  ) –p(r) = c(N j  ) /   c(N j  ) NjNj  1  2  k

14 Fall 2001 EE669: Natural Language Processing 14 Example PCFG

15 Fall 2001 EE669: Natural Language Processing 15 Some Features of PCFGs A PCFG gives some idea of the plausibility of different parses; however, the probabilities are based on structural factors and not lexical ones. PCFGs are good for grammar induction. PCFGs are robust. PCFGs give a probabilistic language model for English. The predictive power of a PCFG tends to be greater than for an HMM. Though in practice, it is worse. PCFGs are not good models alone but they can be combined with a trigram model. PCFGs have certain biases which may not be appropriate; smaller trees are preferred. (why?) Estimates of PCFG parameters from a corpus obtains a proper probability distribution (Chi and Geman, 1998).

16 Fall 2001 EE669: Natural Language Processing 16 Recall the Questions for HMMs Given a model  =(A, B,  ), how do we efficiently compute how likely a certain observation is, that is, P(O|  ) Given the observation sequence O and a model , how do we choose a state sequence (X 1, …, X T+1 ) that best explains the observations? Given an observation sequence O, and a space of possible models found by varying the model parameters  = (A, B,  ), how do we find the model that best explains the observed data?

17 Fall 2001 EE669: Natural Language Processing 17 Questions for PCFGs There are also three basic questions we wish to answer for PCFGs: –What is the probability of a sentence w 1m according to a grammar G: P(w 1m |G)? –What is the most likely parse for a sentence: argmax t P(t|w 1m,G)? –How can we choose rule probabilities for the grammar G that maximize the probability of a sentence: argmax G P(w 1m |G) ?

18 Fall 2001 EE669: Natural Language Processing 18 Restriction Chomsky Normal Form (CNF) GrammarsFor this presentation, we only consider the case of Chomsky Normal Form (CNF) Grammars, which only have unary and binary rules of the form: N i  N j N k // two nonterminals on the RHS N i  w j // one terminal on the RHS The parameters of a PCFG in Chomsky Normal Form are: P(N j  N r N s | G) // an n 3 matrix of parameters P(N j  w k |G)// n×V parameters (n is the number of nonterminals and V is the number of terminals) For j=1,…,n,  r,s P(N j  N r N s ) +  k P (N j  w k ) = 1 Any CFG can be represented by a weakly equivalent CNF form.

19 Fall 2001 EE669: Natural Language Processing 19 HMMs and PCFGs An HMM is able to efficiently do calculations using forward and backward probabilities. –  i (t) = P(w 1(t-1), X t = i) –  i (t) = P(w tT |X t = i) The forward probability corresponds in a parse tree to everything above and including a certain node (outside), while the backward probability corresponds to the probability of everything below a certain node (inside). For PCFGs, the Outside (  j ) and Inside (  j ) Probabilities are defined as:

20 Fall 2001 EE669: Natural Language Processing 20 Inside and Outside Probabilities of PCFGs N1N1  w1w1 wmwm  NjNj w p-1 w q+1 wpwp wqwq … … …

21 Fall 2001 EE669: Natural Language Processing 21 Inside and Outside Probabilities The inside probability  j (p,q) is the total probability of generating w pq starting off with nonterminal N j. The outside probability  j (p,q) is the total probability of beginning with nonterminal N 1 and generating N j pq and all words outside of w pq.

22 Fall 2001 EE669: Natural Language Processing 22 From HMMs to Probabilistic Regular Grammars (PRG) A PRG has start state N 1 and rules of the form: –N i  w j N k –N i  w j This is like an HMM except that for an HMM:  n  w 1n P(w 1n ) = 1 whereas, for a PCFG:  w  L P(w) = 1 where L is the language generated by the grammar. PRGs are related to HMMs in that a PRG is an HMM to which a start state and a finish (or sink) state is added.

23 Fall 2001 EE669: Natural Language Processing 23 Inside Probability  j (p,q) = P(N j  * w pq ) NjNj w p... w q

24 Fall 2001 EE669: Natural Language Processing 24 The Probability of a String: Using Inside Probabilities We can calculate the probability of a string using the inside algorithm, a dynamic programming algorithm based on the inside probabilities: Base Case:  j

25 Fall 2001 EE669: Natural Language Processing 25 Induction: Divide the String w pq wpwp wdwd …w d+1 wqwq NjNj NrNr NsNs

26 Fall 2001 EE669: Natural Language Processing 26 Induction Step  j, 1  p < q  m wpwp wdwd …w d+1 wqwq NjNj NrNr NsNs

27 Fall 2001 EE669: Natural Language Processing 27 Example PCFG #1 S  NP VP1.0 #2 VP  V NP PP 0.4 #3 VP  V NP0.6 #4 NP  N0.7 #5 NP  N PP0.3 #6 PP  PREP N1.0 #7 N  a_dog0.3 #8 N  a_cat0.5 #9 N  a_telescope0.2 #10 V  saw1.0 #11 PREP  with1.0 P(a_dog saw a_cat with a_telescope) = N V N PREP N NP NP PP VP S NP PP V N PREP N 1.0 0.4 0.7 0.31.00.51.00.2 0.7 1.0 0.6 0.3 1.0 1 .7 .4 .3 .7  1 .5  1  1 .2 +... .6... .3... =.00588 +.00378 =.00966

28 Fall 2001 EE669: Natural Language Processing 28 Computing Inside Probability a_dog saw a_cat with a_telescope 1 2 3 4 5 Create table m  m (m is length of string). Initialize on diagonal, using N  rules. Recursively compute along the diagonal towards the upper right corner.

29 Fall 2001 EE669: Natural Language Processing 29 The Probability of a String: Using Outside Probabilities We can also calculate the probability of a string using the outside algorithm, based on the outside probabilities, for any k, 1  k  m: Outside probabilities are calculated top down and require reference to inside probabilities, which must be calculated first.

30 Fall 2001 EE669: Natural Language Processing 30 Outside Algorithm Base Case:  1 (1,m)= 1;  j (1,m)=0 for j  1 (the probability of the root being nonterminal N i with nothing outside it) Inductive Case: 2 cases depicted on the next two slides. Sum over both possibilities but restrict first sum so that g  j to avoid double counting for rules of the form N i  N j N j

31 Fall 2001 EE669: Natural Language Processing 31 Case 1: left previous derivation step N1N1 w1w1 wmwm w e+1 ……w p-1 wpwp wqwq …w q+1 wewe

32 Fall 2001 EE669: Natural Language Processing 32 Case 2: right previous derivation step N1N1 w1w1 wmwm w e-1 w q+1 wewe w p-1 ………wpwp wqwq

33 Fall 2001 EE669: Natural Language Processing 33 Outside Algorithm Induction

34 Fall 2001 EE669: Natural Language Processing 34 Product of Inside and Outside Probabilities Similarly to an HMM, we can form a product of inside and outside probabilities: The PCFG requires us to postulate a nonterminal node. Hence, the probability of the sentence AND that there is a constituent spanning the words p through q is: P(w 1m, N pq |G)=  j  j (p,q)  j (p,q)

35 Fall 2001 EE669: Natural Language Processing 35 CNF Assumption Since we know that there will always be some nonterminal spanning the whole tree (the start symbol), as well as each individual terminal (due to CNF assumption), we obtain the following as the total probability of a string:

36 Fall 2001 EE669: Natural Language Processing 36 Most Likely Parse for a Sentence The O(m 3 n 3 ) algorithm works by finding the highest probability partial parse tree spanning a certain substring that is rooted with a certain nonterminal.  i (p,q) is the highest inside probability parse of a subtree N i pq Initialization:  i (p,p) = P(N i  w p ) Induction:  i (p,q) = max 1  j,k  n,p  r<q P(N i  N j N k )  j (p,r)  k (r+1,q) Store backtrace:  i (p,q)=argmax (j,k,r) P(N i  N j N k )  j (p,r)  k (r+1,q) At termination: The probability of the most likely tree is P( )=  1 (1,m). To reconstruct the tree, see the next slide.

37 Fall 2001 EE669: Natural Language Processing 37 Extracting the Viterbi Parse To reconstruct the maximum probability tree, we regard the tree as a set of nodes {X x }. –Because the grammar has a start symbol, the root node must be N 1 1m –Thus we only need to construct the left and right daughters of a non-terminal node recursively. If X x =N i 1m is in the Viterbi parse, and  i (p,q)= (j,k,r) then left(X x )= N j pr and right(X x )= N k (r+1)q There can be more than one maximum value parse.

38 Fall 2001 EE669: Natural Language Processing 38 Training a PCFG Purpose of training: limited grammar learning (induction) Restrictions: We must provide the number of terminals and nonterminals, as well as the start symbol, in advance. Then it is possible to assume all possible CNF rules for that grammar exist, or alternatively provide more structure on the rules, including specifying the set of rules. Purpose of training: to attempt to find the optimal probabilities to assign to different grammar rules once we have specified the grammar in some way. Mechanism: This training process uses an EM Training Algorithm called the Inside-Outside Algorithm which allows us to train the parameters of a PCFG on unannotated sentences of the language. Basic Assumption: a good grammar is one that makes the sentences in the training corpus likely to occur, i.e., we seek the grammar that maximizes the likelihood of the training data.

39 Fall 2001 EE669: Natural Language Processing 39 Inside-Outside Algorithm If a parsed corpus is available, it is best to train using that information. If such is unavailable, then we have a hidden data problem: we need to determine probability functions on rules by only seeing the sentences. Hence, we must use an iterative algorithm (like Baum Welch for HMMs) to improve estimates until some threshold is achieved. We begin with a grammar topology and some initial estimates of the rules (e.g., all rules with a non-terminal on the LHS are equi-probable). We use the probability of the parse of each sentence in the training set to indicate our confidence in it, and then sum the probabilities of each rule to obtain an expectation of how often it is used.

40 Fall 2001 EE669: Natural Language Processing 40 Inside-Outside Algorithm The expectations are then used to refine the probability estimates of the rules, with the goal of increasing the likelihood of the training set given the grammar. Unlike the HMM case, when dealing with multiple training instances, we cannot simply concatenate the data together. We must assume that the sentences in the training set are independent and then the likelihood of the corpus is simply the product of their component sentence probabilities. This complicates the reestimation formulas, which appear on pages 399-401 of Chapter 11.

41 Fall 2001 EE669: Natural Language Processing 41 Problems with the Inside-Outside Algorithm Extremely Slow: For each sentence, each iteration of training is O(m 3 n 3 ), given m is the sentence length and n is the number of nonterminals. Local Maxima are much more of a problem than in HMMs Satisfactory learning requires many more nonterminals than are theoretically needed to describe the language. There is no guarantee that the learned nonterminals will be linguistically motivated. Hence, although grammar induction from unannotated corpora is possible, it is extremely difficult to do well.

42 Fall 2001 EE669: Natural Language Processing 42 Another View of Probabilistic Parsing By assuming that the probability of a constituent being derived by a rule is independent of how the constituent is used as a subconstituent, the inside probability can be used to develop a best-first PCFG parsing algorithm. This assumption suggests that the probabilities of NP rules are the same whether the NP is used as a subject, object, or object of a preposition. (However, subjects are more likely to be pronouns!) The inside probabilities for lexical categories can be their lexical generation probabilities. For example, P(flower|N)=.063 is the inside probability that the constituent N is realized as the word flower.

43 Fall 2001 EE669: Natural Language Processing 43 Lexical Probability Estimates The table below gives the lexical probabilities which are needed for our example:

44 Fall 2001 EE669: Natural Language Processing 44 Word/Tag Counts

45 Fall 2001 EE669: Natural Language Processing 45 The PCFG Below is a probabilistic CFG (PCFG) with probabilities derived from analyzing a parsed version of Allen's corpus.

46 Fall 2001 EE669: Natural Language Processing 46 Parsing with a PCFG Using the lexical probabilities, we can derive probabilities that the constituent NP generates a sequence like a flower. Two rules could generate the string of words:

47 Fall 2001 EE669: Natural Language Processing 47 Parsing with a PCFG The likelihood of each rule has been estimated from the corpus, so the probability that the NP generates a flower is the sum of the two ways it can be derived: P(a flower|NP) = P(R 8 |NP)×P(a|ART)×P(flower|N) + P(R 6 |NP)×P(a|N)×PROB(flower|N) =.55 ×.36 ×.063 +.09 ×.001 ×.063 =.012 This probability can then be used to compute probabilities for larger constituents, like the probability of generating the words, A flower wilted ( 枯萎 ), from an S constituent.

48 Fall 2001 EE669: Natural Language Processing 48 Three Possible Trees for an S

49 Fall 2001 EE669: Natural Language Processing 49 Parsing with a PCFG The probability of a sentence generating A flower wilted: P(a flower wilted|S) = P(R 1 |S) × P(a flower|NP) × P(wilted|VP) + P(R 1 |S) × P(a|NP) × P(flower wilted|VP) Using this approach, the probability that a given sentence will be generated by the grammar can be efficiently computed. It only requires some way of recording the value of each constituent between each two possible positions. The requirement can be filled by a packed chart structure.

50 Fall 2001 EE669: Natural Language Processing 50 Parsing with a PCFG The probabilities of specific parse trees can be found using a standard chart parsing algorithm, where the probability of each constituent is computed from the probability of its subconstituents and the rule used. When entering an entry E of nonterminal N j using rule i with n subconstituents corresponding to E 1, E 2,..., E n, then: P(E) = P(Rule i| N j ) × P(E 1 ) ×... × P(E n ) A standard chart parser can be used with a step to compute the probability of each entry when it is added to the chart. Using a bottom-up algorithm plus probabilities from a corpus, the figure on slide 49 shows the complete chart for the input, a flower.

51 Fall 2001 EE669: Natural Language Processing 51 Lexical Generation Probabilities A better estimate of the lexical category for the word would be calculated by determining how likely it is that category N i occurred at a position t over all sequences given the input w 1t. In other words, rather than ignoring context or simply searching for the one sequence that yields the maximum probability for the word, we want to compute the sum of the probabilities from all sequences using the forward algorithm. The forward probability can then be used in the chart as the probability for a lexical item. The forward probabilities strongly prefer the noun reading flower in contrast to the context independent probabilities (as can be seen on the next slide).

52 Fall 2001 EE669: Natural Language Processing 52 PCFG Chart For A flower

53 Fall 2001 EE669: Natural Language Processing 53 Accuracy Levels A parser built using this technique will identify the correct parse about 50% of the time. The drastic independence assumptions may prevent it from doing better. For example, the context-free model assumes that the probability of a particular verb being used in a VP rule is independent of the rule in question. For example, any structure that attaches a PP to a verb will have a probability of.22 from the fragment, compared with one that attaches to an NP in the VP (.39*.24 =.093).

54 Fall 2001 EE669: Natural Language Processing 54 Best-First Parsing Algorithms can be developed that explore high-probability constituents first using a best-first strategy. The hope is that the best parse can be found quickly without using much search space and that low probability constituents will never be created. The chart parser can be modified by making the agenda a priority queue (where the most probable elements are first in the queue). The parser then operates by removing the highest ranked constituent first and adding it to the chart. The previous chart parser relied on the fact that it worked from left to right, finishing early constituents before those later. With the best-first strategy, this assumption does not hold, and so the algorithm must be modified.

55 Fall 2001 EE669: Natural Language Processing 55 Best-First Chart Issues For example, if the last word in the sentence has the highest score, it will be entered into the chart first in a best-first parser. This possibility suggests that a constituent needed to extend an arc may already be in the chart, requiring that whenever an active arc is added to the chart, we need to check if it can be extended by anything already in the chart. The arc extension algorithm must therefore be modified.

56 Fall 2001 EE669: Natural Language Processing 56 Arc Extension Algorithm To add a constituent C from position p1 to p2: 1.Insert C into the chart from position p 1 to p 2 2.For any active arc of the form: X  X 1... C... X n from position p 0 to p 1, add a new active arc X  X 1... C... X n from position p 0 to p 2 To add an active arc X  X 1... C C'... X n to the chart from p 0 to p 2 : 1.If C is the last constituent (i.e., the arc is completed), add a new constituent of type X to the agenda 2.Otherwise, if there is a constituent Y of category C' in the chart from p 2 to p 3, then recursively add an active arc: X  X 1... C C'... X n from p 0 to p 3 (which may of course add further arcs or create further constituents).

57 Fall 2001 EE669: Natural Language Processing 57 Best-First Strategy The best-first strategy improves the efficiency of the parser by creating fewer constituents than a standard chart parser which terminates as soon as a sentence parse is found. Even though it does not consider every possible constituent, it is guaranteed to find the highest probability parse first. Although conceptually simple, there are some problems with the technique in practice. For example, if we combine scores by multiplying probabilities, the scores of constituents fall as they cover more words. Hence, with large grammars, the probabilities may drop off so quickly that the search resembles breadth-first search!

58 Fall 2001 EE669: Natural Language Processing 58 Context-Dependent Best-First Parser The best-first approach may improve efficiency, but cannot improve the accuracy of the probabilistic parser. Improvements are possible if we use a simple alternative for computing rule probabilities that uses more context-dependent lexical information. In particular, we exploit the observation that the first word in a constituent is often its head and, as such, exerts an influence on the probabilities of rules that account for its complements.

59 Fall 2001 EE669: Natural Language Processing 59 Simple Addition of Lexical Context Hence, we can use the following measure that takes into account the first word in the constituent. P(R|N i,w)=C(rule R generates N i that starts with w)/ C(N i starts with w). This makes the probabilities sensitive to more lexical information. This is valuable: –It is unusual for singular nouns to be used by themselves. –Plurals are rarely used to modify other nouns.

60 Fall 2001 EE669: Natural Language Processing 60 Example Notice the probabilities for the two rules, NP  N and NP  N N, given the words house and peaches. Also, context sensitive rules can encode verb subcategorization preferences.

61 Fall 2001 EE669: Natural Language Processing 61 Accuracy and Size

62 Fall 2001 EE669: Natural Language Processing 62 Source of Improvement To see why the context-dependent parser does better, consider the attachment decision that must be made for sentences like: –The man put the bird in the house. –The man likes the bird in the house. The basic PCFG will assign the same structure to both, i.e., it will attach the PP to the VP. However, the context-sensitive approach will utilize subcategorization preferences for the verbs.

63 Fall 2001 EE669: Natural Language Processing 63 Chart for put the bird in the house The probability of attaching the PP to the VP is.54 (.93×.99×.76×.76) compared to.0038 for the alternative.

64 Fall 2001 EE669: Natural Language Processing 64 Chart for likes the bird in the house The probability of attaching the PP to the VP is.054 compared to.1 for the alternative.

65 Fall 2001 EE669: Natural Language Processing 65 Room for Improvement The question is, just how accurate can this approach become? 33% error is too high. Additional information could be added to improve performance. One can use probabilities relative to a larger fragment of input (e.g., bigrams or trigrams for the beginnings of rules). This would require more training data. Attachment decisions depend not only on the word that a PP is attaching to but also the PP itself. Hence, we could use a more complex estimate based on the previous category, the head verb, and the preposition. This creates a complication in that one cannot compare across rules easily: VP  V NP PP is evaluated using a verb and a preposition, but VP  V NP has no preposition.

66 Fall 2001 EE669: Natural Language Processing 66 Room for Improvement In general, the more selective the lexical categories, the more predictive the estimates can be, assuming that there is enough data. Clearly function words like prepositions, articles, quantifiers, and conjunctions can receive individual treatment; however, open class words like nouns and verbs are too numerous for individual handling. Words may be grouped based on similarity, or useful classes can be learned by analyzing corpora.


Download ppt "Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer."

Similar presentations


Ads by Google