Presentation is loading. Please wait.

Presentation is loading. Please wait.

684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.

Similar presentations


Presentation on theme: "684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University."— Presentation transcript:

1 684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University

2 684.0222/02/20162 Context Free Grammars  HMMs are sophisticated tools for language modelling based on finite state machines.  Context-free grammars go beyond FSMs  They can encode longer range dependencies than FSMs  They too can be made probabilistic

3 684.0222/02/20163 An example s -> np vps -> np vp pp np -> det nnp -> np pp vp->v np pp->p np n->girln -> boy n -> parkn -> telescope v-> saw p-> withp -> in Sample sentence: “The boy saw the girl in the park with the telescope”

4 684.0222/02/20164 Multiple analyses  2 of the 5 are

5 684.0222/02/20165 How serious is this ambiguity?  Very serious, ambiguities in different places multiply  Easy to get millions of analyses for simple seeming sentences  Maybe we can use probabilities to disambiguate, just as we chose from exponentially many paths through FSM  Fortunately, similar techniques apply

6 684.0222/02/20166 Probabilistic Context Free Grammars  Same as context free grammars, with one extension –Where there is a choice of productions for a non-terminal, give each alternative a probability. –For each choice point, sum of probabilities of available options is 1 –i.e. Production probability is p(rhs|lhs)

7 684.0222/02/20167 An example s -> np vp:0.8s -> np vp pp:0.2 np -> det n:0.5np -> np pp:0.5 vp->v np:1.0 pp->p np:1.0 n->girl:0.25n -> boy :0.25 n -> park:0.25n -> telescope:0.25 v-> saw:1.0 p-> with:0.5p -> in:0.5 Sample sentence: “The boy saw the girl in the park with the telescope”

8 684.0222/02/20168 The “low” attachment p(“np vp”|s) * p(“det n”|np) * p(“the”|det) * p(“boy”|n) * p(“v np”|vp) * p(“det n”|np) * p(“the”|det) *...

9 684.0222/02/20169 The “high” attachment p(“np vp pp”|s) * p(“det n”|np) * p(“the”|det) * p(“boy”|n) * p(“v np”|vp) * p(“det n”|np) * p(“the”|det) *... Note: I’m not claiming that this matches any particular set of psycholinguistic claims, only that the formalism allows such distinctions to be made.

10 684.0222/02/201610 Generating from Probabilistic Context Free Grammars  Start with the distinguished symbol “s”  Choose a way of expanding “s” –This introduces new non-terminals (eg. “np” “vp”)  Choose ways of expanding these  Carry on until no more non-terminals

11 684.0222/02/201611 Issues  The space of possible trees is infinite. –But the sum of probabilities for all trees is 1  There is a strong assumption built in to the model –Expansion probability is independent of position of non-terminal within tree –This assumption is questionable.

12 684.0222/02/201612 Training for Probabilistic Context Free Grammars  Supervised: you have a treebank  Unsupervised: you have only words  In between: Pereira and Schabes

13 684.0222/02/201613 Supervised Training  Look at the trees in your corpus  Count the number of times each lhs -> rhs occurs  Divide these counts by number of times each lhs occurs  Maybe smooth as described in the lecture on probability estimation from counts

14 684.0222/02/201614 Unsupervised Training  These are Rabiner’s problems, but for PCFGs –Calculate the probability of a corpus given a model –Guess the sequence of states passed through –Adapt the model to the corpus

15 684.0222/02/201615 Hidden Trees  All you see is the output: –“The boy saw the girl in the park”  But you can’t tell which of several trees led to that sentence  Each tree may have a different probability. Although trees which use the same rules the same number of times must give the same answer.  Don’t know which state you are in.

16 684.0222/02/201616 The three problems  Probability estimation –Given a sequence of observations O and a grammar G. Find P(O|G)  Best tree estimation –Given a sequence of observations O and a grammar G, find a Tree which maximizes P(O,Tree|G).

17 684.0222/02/201617 The third problem  Training –Adjust the model parameters so that P(O|G) is as large as possible for given O. Hard problem because there are so many adjustable parameters which could vary. Worse than for HMMs. More local maxima.

18 684.0222/02/201618 Probability estimation  Easy in principle. Marginalize out the trees, leaving probability of strings.  But this involves sum over exponentially many trees.  Efficient algorithm keeps track of inside and outside probabilities.

19 684.0222/02/201619 Inside Probability  The probability that non-terminal NT expands to the words between i and j

20 684.0222/02/201620 Outside probability  Dual of inside probability. NP SENT A LETTER... i SENT A LETTER j... A MAN

21 684.0222/02/201621 Corpus probability  Inside probability of S node and entire string is probability of all ways of making sentences over that string  Product over all strings in corpus is corpus probability  Can also get corpus probability from outside probabilities

22 684.0222/02/201622 Training  Uses inside and outside probabilities  Starts from an initial guess  Improves the initial guess using data  Stops at a (locally) best model  Specialization of the EM algorithm

23 684.0222/02/201623 Expected rule counts  Consider p(uses rule lhs -> rhs to cover i through j)  Four things need to happen –Generate outside words leaving hole for lhs –Choose correct rhs –Generate word seen between i and k from first item in rhs (inside probability) –Generate words seen between k and j using other items in rhs (more inside probailities)

24 684.0222/02/201624 Refinements  In practice there are very many local maxima, so strategies which involve generating hundreds of thousands of rules may fail badly.  Pereira and Schabes discovered that letting the system know some limited stuff about bracketting is enough to guide it to correct answers  Different grammar formalisms (TAGs, Categorial Grammars...)

25 684.0222/02/201625 A basic parsing algorithm  The simplest statistical parsing algorithm is called CYK or CKY.  It is a statistical variant of a bottom-up tabular parsing algorithm that you should have seen in 684.01  It (somewhat surprisingly) turns out to be closely related to the problem of multiplying matrices.

26 684.0222/02/201626 Basic CKY (review)  Assume we have organized the lexicon as a function lexicon: string -> nonterminal set  Organize these nonterminals into the relevant parts of a two dimensional array indexed by left and right end of the item For I = 1 to length(sentence) do chart[I,I+1] = lexicon(sentence[i]) endfor

27 684.0222/02/201627 Basic CKY  Assume we have organized the grammar as a function grammar: nonterminal -> nonterminal -> nonterminal set

28 684.0222/02/201628 Basic CKY  Build up new entries from existing entries, working from shorter entries to longer ones for l = 2 to length(sentence) do // l is length of constituent for s = 1 to len – l + 1 do // s is start of rhs1 for t = 1 to l-1 do (left,mid,right) = (s,s+t,s+l) chart[left,right] = combine(chart[left,mid],chart[mid,right]) endfor

29 684.0222/02/201629 Basic CKY  Combine is fun combine(set1,set2) result = empty for item1 in set1 do for item2 in set2 do result = union result (grammar item1 item2) endfor return result

30 684.0222/02/201630 Going statistical  The basic algorithm tracks labels for each substring of the input  The cell contents are sets of labels  A statistical version keeps track of labels and their probabilities  Now the cell contents must be weighted sets

31 684.0222/02/201631 Going statistical  Make the grammar and lexicon produce weighted sets. gexicon: word -> real*nt set grammar: real*nt->real*nt -> real*nt set  We now need an operation corresponding to set union for weighted sets.  {s:0.1,np:0.2} WU {s:0.2,np:0.1} = ???

32 684.0222/02/201632 Going statistical (one way) {s:0.1,np:0.2} WU {s:0.2,np:0.1} = {s:0.3,np:0.3} If we implement this, we get a parser that calculates the inside probability for each label on each span.

33 684.0222/02/201633 Going statistical (another way) {s:0.1,np:0.2} WU {s:0.2,np:0.1} = {s:0.2,np:0.2} If we implement this, we get a parser that calculates the best parse probability for each label on each span. The difference is that in one case we are combining weights with +, while in the second we use max

34 684.0222/02/201634 Building trees  Make the cell contents be sets of trees  Make the lexicon be a function from words to little trees  Make the grammar be a function from pairs of trees to sets of newly created (bigger) trees  Set union is now over sets of trees  Nothing else needs to change

35 684.0222/02/201635 Building weighted trees  Make the cell contents be sets of trees, labelled with probabilities  Make the lexicon be a function from words to weighted (little trees)  Make the grammar be a function from pairs of weighted trees to sets of newly created (bigger) trees  Set union is now over sets of weighted trees  Again we have a choice of min or +, to get either parse forest or just best parse

36 684.0222/02/201636 Where to get more information  Roark and Sproat ch 7  Charniak chapters 5 and 6  Allen Natural Language Understanding ch 7  Lisp code associated with Natural Language Understanding  Goodman: Semiring parsing (http://www.aclweb.org/anthology/J99-1004)


Download ppt "684.0222/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University."

Similar presentations


Ads by Google