Download presentation

Presentation is loading. Please wait.

Published byJackeline Wake Modified about 1 year ago

1
Learning grammars? Chris Brew The Ohio State University

2
Learning, May 2007Chris Brew: Grammar Learning The most pressing problem in NLP (at least for me) Broad coverage natural language grammars are difficult to create. They are very difficult to maintain or to tune to new domains. If possible, would like to learn most of the knowledge direct from the data.

3
Learning, May 2007Chris Brew: Grammar Learning The real world You need to choose a probabilistic model of the data But the number of such models is too large to consider exhaustively And the data you have is never enough to reliably make all the distinctions that you would like.

4
Learning, May 2007Chris Brew: Grammar Learning Simplify the search problem. Simplify the search space by restricting the class of models Reduce the cost of search using heuristics Do both

5
Learning, May 2007Chris Brew: Grammar Learning Restricting the class of models Independence assumptions (Hidden) Markov models Stochastic CFGs Decomposable models Very general models (eg. log-linear)

6
Learning, May 2007Chris Brew: Grammar Learning Heuristics for model search Information gain Description length Error-driven (Brill) Use maximum of distribution in place of integral Use sample of distribution to estimate integral

7
Learning, May 2007Chris Brew: Grammar Learning Techniques for model search Model merging Improved iterative scaling Stochastic simulation

8
Learning, May 2007Chris Brew: Grammar Learning Bayesian Model merging Stolcke “Bayesian Learning of Probabilistic Language Models” Berkeley, Addresses the language learning issue directly. Approach applies to HMMs, PCFGs and a simple variety of feature grammar.

9
Learning, May 2007Chris Brew: Grammar Learning Priors You are given a coin… Casino Prior Ordinary Prior N=0 Posterior (N=50)

10
Learning, May 2007Chris Brew: Grammar Learning Baseline Model One chain per sample (no generalization) F I a b b b a a 0.5

11
Learning, May 2007Chris Brew: Grammar Learning Merged Model collapse transitions F I b b b a 1 a 0.5

12
Learning, May 2007Chris Brew: Grammar Learning Final Result F I 2 b 1 a

13
Learning, May 2007Chris Brew: Grammar Learning How was this achieved? Used a Dirichlet prior (Bayesian smoothing) for the probabilities. Description length prior for the model topology. Zeros (usually) go in the topology. Likelihood drives you towards large model Priors drive you towards small models Best-first search with lookahead. Approximate true probabilities with Viterbi paths.

14
Learning, May 2007Chris Brew: Grammar Learning SCFG sample incorporation Close analogy to what happened with HMMs. Incorporate a sample x1,x2,x3,x4 by: Introducing S -> X1 X2 X3 X4 Introducing Xi-> xi for i=1,2,3,4 Keep counts rather than probabilities

15
Learning, May 2007Chris Brew: Grammar Learning Nonterminal merging (X 1,X 2 ) = Y Replace RHS occurrences of X 1, X 2 with Y Y has the productions of X 1 and X 2 Merge productions which have become identical Update counts

16
Learning, May 2007Chris Brew: Grammar Learning Chunking chunk(X 1 … X 2 ) = Y Replace RHS ordered sequences by Y Introduce new production for Y Update counts

17
Learning, May 2007Chris Brew: Grammar Learning Priors Parameter priors are once again Dirichlet Obvious description length prior gives production length a geometric distribution, which feels wrong for NL Alternative gets production length drawn from a Poisson.

18
Learning, May 2007Chris Brew: Grammar Learning Search regime In principle chunking + merging can get any SCFG. In practice the amount of lookahead and non- determinism used in the search matters. Sample ordering matters (arguably a strength in language acquisition theories). Partial or complete bracketing information can be used.

19
Learning, May 2007Chris Brew: Grammar Learning Experiments Primarily formal language examples using sample sets previously explored in grammar induction work Simple NL grammars from the L 0 language acquisition domain Preliminary work on 1200 rule grammar (+ Henry Li M.Sc project on ATIS ).

20
Learning, May 2007Chris Brew: Grammar Learning Probabilistic Attribute Grammars SCFG, but with a scheme for adding attributes to non-terminals S --> NP VP S.tr = NP.f S.lm = VP.g S.rel = VP.h Not as expressive as you think. Equations must specify an LHS feature, either as a constant or by reference to an RHS value.

21
Learning, May 2007Chris Brew: Grammar Learning Derivation in PAG Generate string top down as for SCFG Probabilistically assign features bottom up. Each non-terminal instance chooses stochastically from a set of constants and a set of RHS features (which have already been assigned)

22
Learning, May 2007Chris Brew: Grammar Learning Priors Feature assignment has Dirichlet prior just like other parameter priors Equations have a description length prior

23
Learning, May 2007Chris Brew: Grammar Learning Sample incorporation Allows attribute value pairs to be specified on the samples (a circle is below a square, {tr=circle, lm=square, rel=below] S -> A1 CIRCLE1 IS1 BELOW1 A2 SQUARE1 S.tr = circle S.lm = square S.rel = below A1-> a …

24
Learning, May 2007Chris Brew: Grammar Learning Merging operators chunking and non-terminal merging as before fmerge(f 1, f 2 ) = f the obvious renaming operation fattrib(X,v) = f suppose that an LHS feature is obtained from place in RHS

25
Learning, May 2007Chris Brew: Grammar Learning Search Extra operators complicate search. Compose common sequences (RISC style) Use heuristics to choose good operators

26
Learning, May 2007Chris Brew: Grammar Learning Statistical advantages of PAG formalism Marginal probability of CFG aspect is independent of feature part. Features can’t rule out a tree derived with non-zero probability by SCFG. The feature derivation is a product of conditional probabilities, because the feature dependencies form a consistent total order Footnote in Stolcke: could move to L-attributed grammars while preserving this. Opportunity?

27
Learning, May 2007Chris Brew: Grammar Learning What you can’t do S --> NP VP NP.num = VP.num because then you would need a marginal probability rather than a conditional. And this can’t be done consistently. You can abandon the connection to derivation, of which more later.

28
Learning, May 2007Chris Brew: Grammar Learning Stochastic HPSG Brew, European ACL 1995 Probabilistic interpretation for Typed Feature Structures Stochastic Type Hierarchies. Re-entrancies.

29
Learning, May 2007Chris Brew: Grammar Learning What I assumed Attribute-value structures as in Carpenter's book. Principles etc. treated as background theory. Corpus = multiset of Sentence Description Job of statistics is to represent regularities in corpus. Does not distinguish between consequences of the background theory and consequences of the choice of corpus.

30
Learning, May 2007Chris Brew: Grammar Learning ALE signatures bot sub [sign, num]. sign sub [sentence, phrase]. sentence sub [] intro [left:np, right:vp]. phrase sub [np,vp] intro [num:num]. np sub []. vp sub []. num sub [sing, pl]. sing sub []. pl sub [].

31
Learning, May 2007Chris Brew: Grammar Learning Maximality sentence,np,vp,sing and pl are maximal types. A feature structure is fully specified when all the types which it contains are maximal.

32
Learning, May 2007Chris Brew: Grammar Learning CFGs and Type Hierarchies In CFGs The expansions of a non-terminal partition the set of partial phrase markers. It can be determined which partial phrase markers are entailed by a more specified one. The rules specify accessibility relations. The probabilities control the costs.

33
Learning, May 2007Chris Brew: Grammar Learning CFGs and Type Hierarchies In type hierarchies. The sub-types partition the set of attribute-value structures. It can be determined which structures are entailed by a more specified one. The features introduced on the sub-types specify accessibility relations. The probabilities control the costs. Well, almost: Re-entrancy(!)

34
Learning, May 2007Chris Brew: Grammar Learning The idea We associate probabilities with attribute-value structures, which are sets of maximal and non- maximal nodes generated by beginning from the starting node and successively expanding non- maximal leaves of the partial tree. Maximally specified attribute-value structures are have only maximal leaves. Probabilities are assigned by an inductive definition parallel to the usual one for CFGs.

35
Learning, May 2007Chris Brew: Grammar Learning Independence assumption Types are to be refined according to the same probability distribution irrespective of the context in which it is expanded. Bogus, but… The number of parameters which must be estimated for a grammar is a linear function of the size of the type hierarchy There is a clear route to a more fine grained account if we allow the expansion probabilities to be conditioned on surrounding context.

36
Learning, May 2007Chris Brew: Grammar Learning Treatment of re-entrancies Simultaneously grow the tree and guess the pattern of re-entrancies. Treat re-entrancies as equivalence relations over nodes. Keep all types except the leaves maximal Whenever a node is introduced specify its equations When looking for new nodes to expand, only ever pick one element from each equivalence class.

37
Learning, May 2007Chris Brew: Grammar Learning Probabilties for re- entrancies Charge for pairwise equivalences and inequvalences. Good points: Not too many parameters. Retains stochastic process. Bad points: Stochastic process generates bogus structures. Don’t really have a training algorithm. How do you do grammar inference?

38
Learning, May 2007Chris Brew: Grammar Learning Abney Pointed out problem with ERF training algorithm. Failed derivations mean that ERF converges to a non-optimal solution. Not maximum likelihood.

39
Learning, May 2007Chris Brew: Grammar Learning The problem Rules S -> A [1] A [1] S -> B A-> a, A->b,B->a,B->b Renders illegal trees in which the As rewrite differently. Training step assigns probabilities to each of the legal trees. But the illegal ones consume some of the probability. Sum < 1.

40
Learning, May 2007Chris Brew: Grammar Learning Pseudo-solution Normalize the probabilities to cover only the legal DAGs. Unfortunately the re-estimation process now converges, but to the wrong value.

41
Learning, May 2007Chris Brew: Grammar Learning The solution Moved to random fields Define probability distribution over configurations

42
Learning, May 2007Chris Brew: Grammar Learning Configurations Configurations have features. Any property of the configuration that can be counted. Log probability is weighted sum of counts of features You could have local trees as features, but this is not required. Weights for alternative expansions don’t have to sum to 1. Flexible, but must now do feature selection as well as estimation.

43
Learning, May 2007Chris Brew: Grammar Learning Rules S -> A [1] A [1] S -> B A-> a, A->b,B->a,B->b Field induction for attribute value grammars b A aS B Atomic Features

44
Learning, May 2007Chris Brew: Grammar Learning Choosing features Consider features already in field, plus those formed by combining, plus those formed by combining with an atomic feature. S A + -> S A etc.

45
Learning, May 2007Chris Brew: Grammar Learning Weighting and reweighting Iterative scheme for selecting features and adjusting weights, too complex to go into. Originated by Dalla Pietra, Dalla Pietra and Lafferty Best description is Abney’s 1997 CL paper.

46
Learning, May 2007Chris Brew: Grammar Learning Riezler Generalizes Abney’s idea to incomplete data. Works with arbitary CLP programs. Provides efficient search for best analyses in special cases. Still requires random sampling in the general case.

47
Learning, May 2007Chris Brew: Grammar Learning Summary of sampling approaches Very general scheme + heuristics for feature selection. Needs random sampling (scaling is a big issue). Unknown whether practical. There is a lot of work in this area in statistics, which we ought to draw on.

48
Learning, May 2007Chris Brew: Grammar Learning Conclusion We can encode linguistic theory in two ways As Bayesian prior distributions As constraints on the form of probabilistic models As theories become less constrained, the probability models get more general and the estimation problems harder. How far can we get with techniques for which priors and so on are well known?

49
Learning, May 2007Chris Brew: Grammar Learning Where to read more Stolcke Thesis Abney CL paper Recent work by Riezler

50
Learning, May 2007Chris Brew: Grammar Learning Two issues Learning lexical information Learning “proper grammars”

51
Learning, May 2007Chris Brew: Grammar Learning Issue 1: Learning Levin’s verb classes Verbs are central to most linguistic theories. And needed for all applications. Beth Levin “English Verb Classes and Alternations”. Systematic and theory neutral account of verbs and their behaviour. Coverage necessarily incomplete.

52
Learning, May 2007Chris Brew: Grammar Learning Levin’s hypothesis Verbs with similar semantics show similar alternations. Load, rub and plaster are like spray ( SPRAY/LOAD verbs). Make, build and knit pattern with carve ( BUILD verbs). Levin uses this as a basis for 200-odd classes. Jessica sprayed paint on the wall.Martha carved the baby a toy. Jessica sprayed the wall with paint.Martha carved a toy for the baby. Jessica sprayed water at the baby.*Martha carved a toy at the baby. *Jessica Sprayed me water.

53
Learning, May 2007Chris Brew: Grammar Learning Ambiguity 784 of Levin’s 3,024 are class ambiguous. Ambiguity correlates with high frequency. Verbs can be class ambiguous even after syntactic frame is known. Remember to write your aunt a thankyou letter MESSAGE_TRANSFER Our lawyer will write you a Green Card application. PERFORMANCE The attendant will call you a cab.GET The prosecution will call you a liar.DUB

54
Learning, May 2007Chris Brew: Grammar Learning Lapata and Brew ( ) Statistical model of verb class ambiguity. Task: Infer class for ambiguous cases. Goal: Investigate and test Levin’s hypothesis. Ulitimate goal: Infer class for verbs omitted from Levin’s list.

55
Learning, May 2007Chris Brew: Grammar Learning The General Approach Stochastic process generating class, frame and verb. Express this process as a causal model (Bayes net). Find reasonable estimates of the conditional probabilities which parameterize the network. Find class which maximizes p(class|frame,verb)

56
Learning, May 2007Chris Brew: Grammar Learning Problem We have millions of words of POS tagged English in the British National corpus, but we don’t know frames or classes. We certainly can’t afford to generate complete training data. If we had a really good broad-coverage parser, we would have frames. We would still need classes.

57
Learning, May 2007Chris Brew: Grammar Learning Hindle and Rooth Wanted to obtain (automatically) lexical information for use in deciding attachment. Key idea: may not have a perfect parser, but if we have a reasonable parser, we can use its (statistically filtered) output to make reasonable decisions. You need a lot of text, but it doesn’t have to be marked up.

58
Learning, May 2007Chris Brew: Grammar Learning Verb frames from the BNC Wrote simple grammars for V NP NP V NP PP for V NP PP to Filtered to remove noise (compound nouns in particular), obtaining joint frequency distribution of frame and verb.

59
Learning, May 2007Chris Brew: Grammar Learning A causal model

60
Learning, May 2007Chris Brew: Grammar Learning The causal model We have P(verb),P(frame),P(frame|verb) but need P(class),P(frame|class). Approximate furiously

61
Learning, May 2007Chris Brew: Grammar Learning P(frame|class) For each class, counted the syntactic frames listed in Levin. Fairly coarse and easy classification of frames. For GIVE there were NP-V-NP-PP to and NP-V-NP- NP only. 6 frames for PERFORMANCE. Assumed uniform distribution P(NP-V-NP-NP|GIVE) estimated as 1/2 P(NP-V|PERFORMANCE) estimated as 1/6

62
Learning, May 2007Chris Brew: Grammar Learning P(class) Ambiguity class: a set of verbs which show the same patterns of ambiguity. For example, all the verbs which can be either of the classes MESSAGE_TRANSFER or PERFORMANCE, but no other. Ambiguity classes reduce sparse data problems. We still need a principled way of estimating P(class|ambiguity class)

63
Learning, May 2007Chris Brew: Grammar Learning P(class|amb_class) Key idea: use class size measured on verb types to stand in for true class population, which we don’t know. VerbClassSizeP(class|amb_class)f(verb,class) PassTHROW2727/ PassSEND2020/ PassGIVE1515/ PassMARRY1010/722530

64
Learning, May 2007Chris Brew: Grammar Learning Evaluation 1 For some verbs, knowing the frame is sufficient. We checked whether our model predicts the class that Levin specifies. Baseline was to use our estimated p(class). FrameVerbsBaselineModel NP-V-NP-NP %87.8% NP-V-NP-PP to %92% NP-V-NP-PP for 7070%98.5% Combined %91.8%

65
Learning, May 2007Chris Brew: Grammar Learning Evaluation 2 For other verbs, ambiguity persists. We marked up some instances with our judgements. Same baseline. FrameVerbsBaselineModel NP-V-NP-NP1442.8%85.7% NP-V-NP-PP to %86.6% NP-V-NP-PP for 20%50% Combined3161.3%83.9%

66
Learning, May 2007Chris Brew: Grammar Learning Evaluation 3 Built a simple verb sense disambiguator using naïve Bayes and straightforward features. Naïve Bayes uses product of P(a|class) over all features to judge probability of class.

67
Learning, May 2007Chris Brew: Grammar Learning Disambiguator Uniform prior: 1/|classes| Class-based prior: p(class) from corpus- based estimates. Class-based prior, taking into account frame and verb p(class,frame,verb) When we do that, we also use P(a|class,frame,verb) Class-based prior wins

68
Learning, May 2007Chris Brew: Grammar Learning Future Ought to use prior and contextual class probabilities in a parser. That way, the parser can do better at finding frames, which will in turn improve the estimates for the contexual classes

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google