Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:

Similar presentations


Presentation on theme: "1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:"— Presentation transcript:

1 1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under: Topics in Artificial Intelligence ) The Graduate School of the City University of New York Fall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York

2 2 Syntax acquisition can be viewed as a state space search — nodes represent grammars including a start state and a target state. — arcs represent a possible change from one hypothesized grammar to another. 100 010 000 011 101 001 110 111 G targ A possible state space for parameter space with 3 parameters.

3 3 010 000110 011 s  L(G curr ) s  L(G attempt ) s  L(G curr )  G attempt = 110  s  L(G 110 ) s  L(G curr )  G attempt = 011  s  L(G 011 ) s  L(G curr )  G attempt = 000  s  L(G 000 ) Error-driven Greediness Local state space for the TLA (Not the TLA-) If G curr = 010, then G attempt = random G  { 000, 110, 011 } SVC

4 4  i denotes the ambiguity factor  i = Pr ( s  L(G targ )  L(G i ) ) i,j denotes the overlap factor i,j = Pr (s  L(G i )  L(G J ) )  i denotes the probability of picking or "looking ahead” at a new hypothesis grammar G i New Probabilistic Formulation of TLA performance

5 5 The probability that the learner moves from state G curr to state G new = (1-  curr ) (  new ) Pr (G new can parse s|G curr can’t parse s) Pr(G curr  G new ) = (  new ) (  new - curr, new ) Error-driven Greediness SVC After some algebra:

6 6 0000 0010 1100 0011 1001 1010 0100 1000 0001 1111 1101 0111 1110 1011 0110 0101 G-Ring G 4 G-Ring G 2 target grammar G targ grammars Parameter space H 4 with G targ = 1111. Each ring or G-Ring contains exactly those grammars a certain hamming distance from the target. For example, ring G 2 contains 0011, 0101, 1100, 1010, 1001 and 0110 all of which differ from the target grammar 1111 by 2 bits.

7 7 Weak Smoothness Requirement - All the members of a G-Ring can parse s with an equal probability. Strong Smoothness Requirement - The parameter space is weakly smooth and the probability that s can be parsed by a member of a G-Ring increases monotonically as distance from the target decreases. Smoothness - there exists a correlation between the similarity of grammars and the similarity of the languages that they generate.

8 8 Experimental Setup 1) Adapt the formulas for the transition probabilities to work with G-rings 2) Build a generic transition matrix into which varying values of  and can be employed 3) Use standard Markov technique to calculate the expected number of inputs consumed by the system (construct the fundamental matrix) Goal - find the ‘sweet spot’ for TLA performance

9 9 Three experiments 1) G-Rings equally likely to parse an input sentence (uniform domain) 2) G-Rings are strongly smooth (smooth domain) 3) Anything goes domain Problem: How to find combinations of  and that are optimal? Solution: Use an an optimization algorithm: GRG2 (Lasdon and Waren, 1978).

10 10 Result 1: The TLA performs worse than blind guessing in a uniform domain - exponential increase in # sentences Logarithmic scale Results obtained employing optimal values of  and

11 11 Result 2: The TLA performs extremely well in a smooth domain - but still nonlinear increase Linear scale Results obtained employing optimal values of  and

12 12 Result 3: The TLA performs a bit better in the Anything Goes scenario - optimizer chooses ‘accelerating’ strong smoothness Linear scale Results obtained employing optimal values of  and

13 13 In summary TLA is an infeasible learner: With cross-language ambiguity uniformly distributed across the domain of languages, the number of sentences required by the the number of sentences consumed by the TLA is exponential in the number of parameters. TLA is a feasible learner: In strongly smooth domains, the number of sentences increases at a rate much closer to linear as the number of parameters increases (i.e. the number of grammars increases exponentially).

14 14 A second case study: The Structural Triggers Learner (Fodor 1998)

15 15 The Parametric Principle - (Fodor 1995,1998; Sakas and Fodor, 2001) Set individual parameters. Do not evaluate whole grammars. Halves the size of the grammar pool with each successful learning event. e.g.When 5 of 30 parameters are set  3% of grammar pool remains

16 16 — Parametric Principle requires certainty — But how to know when a sentence may be parametrically ambiguous? Structural Triggers Learner (STL), Fodor (1995, 1998) Problem: Solution: For the STL a parameter value = structural trigger = “treelet” onoff (e.g English)(e.g. German) V before O VP O V O V (e.g English)(e.g. German) VO / OV

17 17 1)Sentence. 2)Parse with current grammar G curr. Success  Keep G curr. Failure  Parse with G curr + all parametric treelets Adopt the treelets that contributed. STL Algorithm

18 18 So, the STL —uses the parser to decode parametric signatures of sentences. —can detect parametric ambiguity Don't learn from sentences that contain a choice point = waiting-STL variant —and thus can abide by the Parametric Principle

19 19 Computationally modeling STL performance: –the nodes represent current number of parameters that have been set - not grammars –arcs represent a possible change in the number of parameters that have been set 0 32 1 A state space for the STL performing in a 3 parameter domain Here, each input may express 0, 1 or 2 new parameters.

20 20 Transition probabilities for the waiting-STL depend on:  The number of parameters that have been set( t )  the number of relevant parameters ( r )  the expression rate ( e )  the ambiguity rate ( a )  the "effective" expression rate ( e' ) Learner’s state Formalization of the input space.

21 21 Transition probabilities for the STL-minus Probability of encountering a sentence s with w "new" unset parameters. Probability that those parameters expressed by s are expressed unambiguously. Probability of setting w "new" parameters, given t have been set (and given values of r, e, e'). Markov transition probability of shifting from state S t to S t+w

22 22 Results after Markov analysis for the STL-minus Exponential in % ambiguity Seemingly linear in # of parameters

23 23 Exponential in % ambiguity Seemingly linear in # of parameters Results for STL (NOT minus)

24 24 Striking effect of ambiguity ( r fixed) 20 parameters to be set 10 parameters expressed per input Logarithmic scale

25 25 Subtle effect of ambiguity on efficiency wrt r As ambiguity increases, the cost of the Parametric Principle skyrockets as the domain scales up (r increases) Linear scale x axis = # of parameters in the domain 10 parameters expressed per input 28 221 9,878 9,772,740 # parameters

26 26 The effect of ambiguity (interacting with e and r) How / where is the cost incurred? By far the greatest amount of damage inflicted by ambiguity occurs at the very earliest stages of learning - the wait for the first fully unambiguous trigger + a little wait for sentences that express the last few parameters unambiguously

27 27 The logarithm of expected number of sentences consumed by the waiting- STL in each state after learning has started. e = 10, r = 30, and e΄ = 0.2 (a΄ = 0.8) closer and closer to convergence Logarithmic scale

28 28 STL — Bad News Ambiguity is damaging even to a parametrically-principled learner Abiding by the Parametric Principle does not does not, in and of itself, guarantee merely linear increase in the complexity of the learning task as the number of parameters increases..

29 29 Can learn Can't learn STL — Good News Part 1 Learning task might be manageable if there are at least some sentences with low expression to get learning off the ground. Sakas and Fodor (1998) Parameters expressed per sentence

30 30 Add a distribution factor to the transition probabilities = Probability that i parameters are expressed by a sentence given distribution D on input text I

31 31 Average number of inputs consumed by the waiting-STL. Expression rate is not fixed per sentence. e varies uniformly from 0 to e max Still exponential in % ambiguity, but manageable For comparison: e varies from 0 to 10 requires 430 sentences e fixed at 5 requires 3,466 sentences

32 32 Effect of ambiguity is still exponential, but not as bad as for fixed e. r = 20, e is uniformly distributed from 0 to 10. Logarithmic scale

33 33 Effect of high ambiguity rates Varying rate of expression, uniformly distributed Still exponential in a, but manageable larger domain than in previous tables

34 34 STL — Good News Part 2 With a uniformly distributed expression rate, the cost of the Parametric Principle is linear (in r ) and doesn’t skyrocket Linear scale

35 35 In summary: With a uniformly distributed expression rate, the number of sentences required by the STL falls in a manageable range (though still exponential in % ambiguity) The number of sentences increases only linearly as the number of parameters increases (i.e. the number of grammars increases exponentially).

36 36 No Best Strategy Conjecture (roughly in the spirit of Schaffer, 1994): Algorithms may be extremely efficient in specific domains, but not in others; there is generally no best learning strategy. Recommends: Have to know the specific facts about the distribution or shape of ambiguity in natural language.

37 37 Research agenda: Three-fold approach to building a cognitive computational model of human language acquisition: 1) formulate a framework to determine what distributions of ambiguity make for feasible learning 2) conduct a psycholinguistic study to determine if the facts of human (child-directed) language are in line with the conducive distributions 3) conduct a computer simulation to check for performance nuances and potential obstacles (e.g. local max based on defaults, or subset principle violations)


Download ppt "1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:"

Similar presentations


Ads by Google