Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein.

Similar presentations


Presentation on theme: "Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein."— Presentation transcript:

1 Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

2 Treebanks An example tree from the Penn Treebank

3 The Penn Treebank 1 million tokens In 50,000 sentences, each labeled with – A POS tag for each token – Labeled constituents – “Extra” information Phrase annotations like “TMP” “empty” constituents for wh-movement traces, empty subjects for raising constructions

4 Supervised PCFG Learning 1.Preprocess the treebank 1.Remove all “extra” information (empties, extra annotations) 2.Convert to Chomsky Normal Form 3.Possibly prune some punctuation, lower-case all words, compute word “shapes”, and other processing to combat sparsity. 2.Count the occurrence of each nonterminal c(N) and each observed production rule c(N->N L N R ) and c(N->w) 3.Set the probability for each rule to the MLE: P(N->N L N R ) = c(N->N L N R ) / c(N) P(N->w) = c(N->w) / c(N) Easy, peasy, lemon-squeezy.

5 Complications Smoothing – Especially for lexicalized grammars, many test productions will never be observed during training – We don’t necessarily want to assign these productions zero probability – Instead, define backoff distributions, e.g.: P final (VP transmogrified -> V transmogrified PP into ) = α P(VP transmogrified -> V transmogrified PP into ) + (1- α ) P(VP -> V PP into )

6 Problems with Supervised PCFG Learning Coming up with labeled data is hard! – Time-consuming – Expensive – Hard to adapt to new domains, tasks, languages – Corpus availability drives research (instead of tasks driving the research) Penn Treebank took many person-years to manually annotate it.

7 Unsupervised Learning of PCFGS: Feasible?

8 Unsupervised Learning Systems take raw data and automatically detect data Why? – More data is available – Kids learn (some aspects of) language with no supervision – Insights into machine learning and clustering

9 Grammar Induction and Learnability Some have argued that learning syntax from positive data alone is impossible – Gold, 1967: non-identifiability in the limit – Chomsky, 1980: poverty of the stimulus Surprising result: it’s possible to get entirely unsupervised parsing to work (reasonably) well.

10 Learnability Learnability: formal conditions under which a class of languages can be learned Setup: – Class of languages Λ – Algorithm H (the learner) – H sees a sequence X of strings x1 … xn – H maps sequences X to languages L in Λ Question is: for what classes Λ do learners H exist?

11 Learnability [Gold, 1967] Criterion: Identification in the limit – A presentation of L is an infinite sequence of x’s from L in which each x occurs at least once – A learner H identifies L in the limit if, for any presentation of L, from some point n onwards, H always outputs L – A class Λ is identifiable in the limit if there is some single H which correctly identifies in the limit every L in Λ. Example: L = {{a},{a,b}} is identifiable in the limit. Theorem (Gold, 67): Any Λ which contains all finite languages and at least one infinite language (ie is superfinite) is unlearnable in this sense.

12 Learnability [Gold, 1967] Proof sketch – Assume Λ is superfinite, H identifies Λ in the limit – There exists a chain L 1 ⊂ L 2 ⊂ … L ∞ – Construct the following misleading sequence Present strings from L1 until H outputs L1 Present strings from L2 until H outputs L2 … – This is a presentation of L ∞ but H never outputs L ∞

13 Learnability [Horning, 1969] Problem, IIL requires that H succeeds on all examples, even the weird ones Another criterion: measure one identification – Assume a distribution P L (x) for each L – Assume P L (x) puts non-zero probability on all and only the x in L – Assume an infinite presentation of x drawn i.i.d. from P L (x) – H measure-one identifies L if the probability of [drawing a sequence X from which H can identify L] is 1. Theorem (Horning, 69): PCFGs can be identified in this sense. – Note: there can be misleading sequences, but they have to be (infinitely) unlikely

14 Learnability [Horning, 1969] Proof sketch – Assume Λ is a recursively enumerable set of recursive languages (e.g., the set of all PCFGs) – Assume an ordering on all strings x1 < x2 < … – Define: two sequences A and B agree through n iff for all x<x n, x is in A  x is in B. – Define the error set E(L,n,m): All sequences such that the first m elements do not agree with L through n These are the sequences which contain early strings outside of L (can’t happen), or which fail to contain all of the early strings in L (happens less as m increases) – Claim: P(E(L,n,m)) goes to 0 as m goes to ∞ – Let d L (n) be the smallest m such that P(E) < 2 –n – Let d(n) be the largest d L (n) in first n languages – Learner: after d(n), pick first L that agrees with evidence through n – This can only fail for sequences X if X keeps showing up in E(L, n, d(n)), which happens infinitely often with probability zero.

15 Learnability Gold’s results say little about real learners (the requirements are too strong) Horning’s algorithm is completely impractical – It needs astronomical amounts of data Even measure-one identification doesn’t say anything about tree structures – It only talks about learning grammatical sets – Strong generative vs. weak generative capacity

16 Unsupervised POS Tagging Some (discouraging) experiments [Merialdo 94] Setup: – You know the set of allowable tags for each word (but not frequency of each tag) – Learn a supervised model on k training sentences Learn P(w|t), P(t i |t i-1,t i-2 ) on these sentences – On n>k, reestimate with EM

17 Merialdo: Results

18 Grammar Induction Unsupervised Learning of Grammars and Parameters

19 Right-branching Baseline In English (but not necessarily in other languages), trees tend to be right-branching: A simple, English-specific baseline is to choose the right chain structure for each sentence.

20 Distributional Clustering

21 Nearest Neighbors

22 Learn PCFGs with EM [Lari and Young, 1990] Setup: – Full binary grammar with n nonterminals {X1, …, Xn} (that is, at the beginning, the grammar has all possible rules) – Parse uniformly/randomly at first – Re-estimate rule expecations off of parses – Repeat Their conclusion: it doesn’t really work

23 EM for PCFGs: Details 1.Start with a “full” grammar, with all possible binary rules for our nonterminals N 1 … N k. Designate one of them as the start symbol, say N 1 2.Assign some starting distribution to the rules, such as 1.Random 2.Uniform 3.Some “smart” initialization techniques (see assigned reading) 3.E-step: Take an unannotated sentence S, and compute, for all nonterminals N, N L, N R, and all terminals w: E(N | S), E(N->N L N R, N is used| S), E(N->w, N is used| S) 4.M-step: Reset rule probabilities to the MLE: P(N->N L N R ) = E(N->N L N R |S) / E(N | S) P(N->w) = E(N->w | S) / E(N | S) 5.Repeat 3 and 4 until rule probabilities stabilize, or “converge”

24 E-Step Let We can define the expectations we want in terms of π, α, β quantities:

25 Inside Probabilities Base case: Induction: NjNj NlNl NrNr wpwp wdwd w d+1 wqwq

26 Outside Probabilities Base case: Induction: NjNj NlNl NrNr wpwp wdwd w d+1 wqwq

27 Problem: Model Symmetries

28 Distributional Syntax?

29 Problem: Identifying Constituents

30 A nested distributional model We’d like a model that – Ties spans to linear contexts (like distributional clustering) – Considers only proper tree structures (like PCFGs) – Has no symmetries to break (like a dependency model)

31 Constituent Context Model (CCM)

32 Results: Constituency

33 Results: Dependencies

34 Results: Combined Models

35 Multilingual Results


Download ppt "Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein."

Similar presentations


Ads by Google