1 Modeling Parameter Setting Performance in Domains with a Large Number of Parameters: A Hybrid Approach CUNY / SUNY / NYU Linguistics Mini-conference.

Slides:



Advertisements
Similar presentations
Solving Hard Problems With Light Scott Aaronson (Assoc. Prof., EECS) Joint work with Alex Arkhipov vs.
Advertisements

Intro to NLP - J. Eisner1 Learning in the Limit Golds Theorem.
Shortest Vector In A Lattice is NP-Hard to approximate
1 Lecture 32 Closure Properties for CFL’s –Kleene Closure construction examples proof of correctness –Others covered less thoroughly in lecture union,
Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.
Statistics review of basic probability and statistics.
Chapter 4: Reasoning Under Uncertainty
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Planning under Uncertainty
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Chapter 7 Sampling and Sampling Distributions
Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
An Introduction to Black-Box Complexity
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1 A Novel Binary Particle Swarm Optimization. 2 Binary PSO- One version In this version of PSO, each solution in the population is a binary string. –Each.
1 Module 31 Closure Properties for CFL’s –Kleene Closure construction examples proof of correctness –Others covered less thoroughly in lecture union, concatenation.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Fitting. Choose a parametric object/some objects to represent a set of tokens Most interesting case is when criterion is not local –can’t tell whether.
Normal forms for Context-Free Grammars
CS5371 Theory of Computation Lecture 8: Automata Theory VI (PDA, PDA = CFG)
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Hypothesis Testing A hypothesis is a conjecture about a population. Typically, these hypotheses will be stated in terms of a parameter such as  (mean)
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Copyright © Cengage Learning. All rights reserved.
Genetic Algorithm.
Evolution of Universal Grammar Pia Göser Universität Tübingen Seminar: Sprachevolution Dozent: Prof. Jäger
Sociology 5811: Lecture 10: Hypothesis Tests Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology The Weak Law and the Strong.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
Some Probability Theory and Computational models A short overview.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Benk Erika Kelemen Zsolt
Unsolvability and Infeasibility. Computability (Solvable) A problem is computable if it is possible to write a computer program to solve it. Can all problems.
TECH Computer Science NP-Complete Problems Problems  Abstract Problems  Decision Problem, Optimal value, Optimal solution  Encodings  //Data Structure.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Simulated Annealing.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Introduction to Language Acquisition Theory Janet Dean Fodor St. Petersburg July 2013 Class 1. Language acquisition theory: Origins and issues.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Reasoning Under Uncertainty. 2 Objectives Learn the meaning of uncertainty and explore some theories designed to deal with it Find out what types of errors.
THEORY OF COMPUTATION Komate AMPHAWAN 1. 2.
Optimization Problems
Machine Learning Concept Learning General-to Specific Ordering
Probabilistic Automaton Ashish Srivastava Harshil Pathak.
Chap2. Language Acquisition: The Problem of Inductive Inference (2.1 ~ 2.2) Min Su Lee The Computational Nature of Language Learning and Evolution.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
Can small quantum systems learn? NATHAN WIEBE & CHRISTOPHER GRANADE, DEC
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Selection and Recombination Temi avanzati di Intelligenza Artificiale - Lecture 4 Prof. Vincenzo Cutello Department of Mathematics and Computer Science.
Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.
Conceptual Foundations © 2008 Pearson Education Australia Lecture slides for this course are based on teaching materials provided/referred by: (1) Statistics.
Ch 4. Language Acquisition: Memoryless Learning 4.1 ~ 4.3 The Computational Nature of Language Learning and Evolution Partha Niyogi 2004 Summarized by.
Ch 2. The Probably Approximately Correct Model and the VC Theorem 2.3 The Computational Nature of Language Learning and Evolution, Partha Niyogi, 2004.
Ch 5. Language Change: A Preliminary Model 5.1 ~ 5.2 The Computational Nature of Language Learning and Evolution P. Niyogi 2006 Summarized by Kwonill,
Chapter 9. A Model of Cultural Evolution and Its Application to Language From “The Computational Nature of Language Learning and Evolution” Summarized.
Biointelligence Laboratory, Seoul National University
William Gregory Sakas City University of New York (CUNY)
Ch 6. Language Change: Multiple Languages 6.1 Multiple Languages
Linguistic aspects of interlanguage
Biointelligence Laboratory, Seoul National University
Presentation transcript:

1 Modeling Parameter Setting Performance in Domains with a Large Number of Parameters: A Hybrid Approach CUNY / SUNY / NYU Linguistics Mini-conference March 10, 2001 William Gregory Sakas & Dina Demner-Fushman

2 Primary point (to make [quickly, painlessly] on a Saturday just before lunch): Not enough to build a series of computer simulations of a cognitive model of human language acquisition and claim that it mirrors the process by which a child acquires language. The (perhaps obvious) fact is that learners are acutely sensitive to cross-language ambiguity. Whether or not a learning model is ultimately successful as a cognitive model is an empirical issue; depends on the ‘fit’ of the simulations with the facts about the distribution of ambiguity in human languages.

3 What’s coming: 1) Some early learnability results 2) A feasibility case study analysis of one parameter setting model : The Triggering Learning Algorithm (TLA) Gibson and Wexler (1994) 3) Conjectures and a proposed research agenda

4 Why computationally model natural language acquisition? Pinker (1979) : "...it may be necessary to find out how language learning could work in order for the developmental data to tell us how is does work." [emphasis mine]

5 Learnability Is the learner guaranteed to converge on the target grammar for every language in a given domain? Gold (1967), Wexler and Culicover (1980), Gibson & Wexler (1994), Kanazawa (1994) An early learnability result (Gold, 1967) Exposed to input strings of an arbitrary target language generated by grammar G targ, it is impossible to guarantee that any learner can converge on G targ if G targ is drawn from any class in the Chomsky hierarchy. (E.g. context-free grammars).

6 Gold’s result is sometimes taken to be strong evidence for a nativist Universal Grammar. 1 ) Psycholinguistic research indicates that children learn grammar based on positive exemplar sentences. 2) Gold proves that G reg G cfg G cs G re can’t be learned this way. Conclude: some grammatical competence must be in place before learning commences. Gold’s result is often misapplied, but much discussion.

7 Another Learnability result: All classes of grammars possible within the principles and parameters framework are learnable because they are finite. In fact a simple Blind Guess Learner is guaranteed to succeed in the long run for any finite class of grammars. Blind Guess Learner: 1. randomly hypothesize a current grammar 2. consume and attempt to parse a sentence from the linguistic environment 3. If the sentence is parsable by the current grammar, go to 2. Otherwise go to 1.

8 Feasibility Is acquisition possible within a reasonable amount of time and/or with a reasonable amount of work? Clark (1994, in press), Niyogi and Berwick (1996), Lightfoot (1989) (degree-0), Sakas(2000), Tesar and Smolensky (1996) and many PAC results concerning induction of FSA’s Feasibility measure (Sakas and Fodor, in press) Near linear increase of the expected number of sentences consumed before a learner converges on the target grammar.

9 Feasibility result:The Blind guess learner succeeds only after consuming a number of sentences exponentially correlated with the number of parameters. If # Parameters= 30 then # Grammars = 2 30 = 1,073,741,824 The search space is huge!

10 A Feasibility Case Study : A three parameter domain (Gibson and Wexler, 1994) Sentences are strings of the symbols: S, V, 01, 02, aux, adv SV / VS- subject precedes verb / verb precedes subject +V2 / -V2- verb or aux must be in the second position in the sentence VO / OV - verb precedes object / object precedes verb Allie will eat the birds  S aux V O

11 SV OV +V2 (German-like) S V S V O O V S S V O2 O1 O1 V S O2 O2 V S O1 S AUX V S AUX O V O AUX S V S AUX O2 O1 V O1 AUX S O2 V O2 AUX S O1 V ADV V S ADV V S O ADV V S O2 O1 ADV AUX S V ADV AUX S O V ADV AUX S O2 O1 V SV VO -V2 (English-like) S V S V O S V O1 O2 S AUX V S AUX V O S AUX V O1 O2 ADV S V ADV S V O ADV S V O1 O2 ADV S AUX V ADV S AUX V O ADV S AUX V O1 O2 Two example languages (finite, degree-0)

12 Surprisingly, G&W’s simple 3-parameter domain presents nontrivial obstacles to several types of learning strategies, but the space is ultimately learnable. Big question: How will the learning process scale up in terms of feasibility as the number of parameters increases? Two problems for most acquisition strategies: 1) Ambiguity 2) Size of the domain

13 SV OV +V2 (German-like) S V S V O O V S S V O2 O1 O1 V S O2 O2 V S O1 S AUX V S AUX O V O AUX S V S AUX O2 O1 V O1 AUX S O2 V O2 AUX S O1 V ADV V S ADV V S O ADV V S O2 O1 ADV AUX S V ADV AUX S O V ADV AUX S O2 O1 V SV VO -V2 (English-like) S V S V O S V O1 O2 S AUX V S AUX V O S AUX V O1 O2 ADV S V ADV S V O ADV S V O1 O2 ADV S AUX V ADV S AUX V O ADV S AUX V O1 O2 Cross-language ambiguity Indicates a few ambiguous strings

14 P&P acquisition: How to obtain informative feasibility results studying linguistically interesting domains with cognitively plausible learning algorithms?

15 Answer: introduce some formal notions in order to abstract away from the specific linguistic content of the input data. Create an input space for a linguistically plausible domain. -- simulations. (Briscoe (2000), Clark(1992), Elman (1990, 1991,1996), Yang (200)) So, how to answer questions of feasibility as the number of grammars (exponentially) scales up? But won't work for large domains.

16 A hybrid approach (formal/empirical) 1)formalize the learning process and input space 2) use the formalization in a Markov structure to empirically test the learner across a wide range of learning scenarios The framework gives general data on the expected performance of acquisition algorithms. Can answer the question: Given learner L, if the input space exhibits characteristics x, y and z, is feasible learning possible?

17 Syntax acquisition can be viewed as a state space search — nodes represent grammars including a start state and a target state. — arcs represent a possible change from one hypothesized grammar to another G targ A possible state space for parameter space with 3 parameters.

18 The Triggering Learning Algorithm -TLA (Gibson and Wexler, 1994) Searches the (huge) grammar space using local heuristics repeat until convergence: receive a string s from L(G targ ) if it can be parsed by G curr, do nothing otherwise, pick a grammar that differs by one parameter value from the current grammar if this grammar parses the sentence, make it the current grammar, otherwise do nothing SVC Greediness Error-driven

s  L(G curr ) s  L(G attempt ) s  L(G curr )  G attempt = 110  s  L(G 110 ) s  L(G curr )  G attempt = 011  s  L(G 011 ) s  L(G curr )  G attempt = 000  s  L(G 000 ) Error-driven Greediness Local state space for the TLA If G curr = 010, then G attempt = random G  { 000, 110, 011 } SVC

20  i denotes the ambiguity factor  i = Pr ( s  L(G targ )  L(G i ) ) i,j denotes the overlap factor i,j = Pr (s  L(G i )  L(G J ) )  i denotes the probability of picking or "looking ahead” at a new hypothesis grammar G i Probabilistic Formulation of TLA performance

21 The probability that the learner moves from state G curr to state G new = (1-  curr ) (  new ) Pr (G new can parse s|G curr can’t parse s) Pr(G curr  G new ) = (  new ) (  new - curr, new ) Error-driven Greediness SVC After some algebra:

G-Ring G 4 G-Ring G 2 target grammar G targ grammars Parameter space H 4 with G targ = Each ring or G-Ring contains exactly those grammars a certain hamming distance from the target. For example, ring G 2 contains 0011, 0101, 1100, 1010, 1001 and 0110 all of which differ from the target grammar 1111 by 2 bits.

23 Weak Smoothness Requirement - All the members of a G-Ring can parse s with an equal probability. Strong Smoothness Requirement - The parameter space is weakly smooth and the probability that s can be parsed by a member of a G-Ring increases monotonically as distance from the target decreases. Smoothness - there exists a correlation between the similarity of grammars and the similarity of the languages that they generate.

24 Experimental Setup 1) Adapt the formulas for the transition probabilities to work with G-rings 2) Build a generic transition matrix into which varying values of  and can be employed 3) Use standard Markov technique to calculate the expected number of inputs consumed by the system (construct the fundamental matrix) Goal - find the ‘sweet spot’ for TLA performance

25 Three experiments 1) G-Rings equally likely to parse an input sentence (uniform domain) 2) G-Rings are strongly smooth (smooth domain) 3) Anything goes domain Problem: How to find combinations of  and that are optimal? Solution: Use an an optimization algorithm: GRG2 (Lasdon and Waren, 1978).

26 Result 1: The TLA performs worse than blind guessing in a uniform domain - exponential increase in # sentences Logarithmic scale Results obtained employing optimal values of  and

27 Result 2: The TLA performs extremely well in a smooth domain - but still nonlinear increase Linear scale Results obtained employing optimal values of  and

28 Result 3: The TLA performs a bit better in the Anything Goes scenario - optimizer chooses ‘accelerating’ strong smoothness Linear scale Results obtained employing optimal values of  and

29 In summary TLA is an infeasible learner: With cross-language ambiguity uniformly distributed across the domain of languages, the number of sentences required by the the number of sentences consumed by the TLA is exponential in the number of parameters. TLA is a feasible learner: In strongly smooth domains, the number of sentences increases at a rate much closer to linear as the number of parameters increases (i.e. the number of grammars increases exponentially).

30 No Best Strategy Conjecture (roughly in the spirit of Schaffer, 1994): Algorithms may be extremely efficient in specific domains, but not in others; there is generally no best learning strategy. Recommends: Have to know the specific facts about the distribution or shape of ambiguity in natural language.

31 Research agenda: Three-fold approach to building a cognitive computational model of human language acquisition: 1) formulate a framework to determine what distributions of ambiguity make for feasible learning 2) conduct a psycholinguistic study to determine if the facts of human (child-directed) language are in line with the conducive distributions 3) conduct a computer simulation to check for performance nuances and potential obstacles (e.g. local max based on defaults, or subset principle violations)

32 Tag Slides (if time): Why is Gold's theorem so often misapplied?

33 Gold’s result is sometimes taken to be strong evidence for a nativist Universal Grammar. 1 ) Psycholinguistic research indicates that children learn grammar based on positive exemplar sentences. 2) Gold says, G reg G cfg G cs G re can’t be learned this way. Conclude: some grammatical competence must be in place before learning commences. Gold’s result is often misapplied, but much discussion.

34 Gold misapplied. Tacit assumptions: human language is a computable process By Church/Turing no computational model is more powerful than L re. Hence, L human is a subset of L re The Chomsky hierarchy is an appropriate framework in which to examine L human Many, many interesting formal results about CFG’s, automata, etc. But where does L human lie?

35 Given the language as computation assumption, and Gold’s result, it may be that the class of human languages intersects the classes of the Chomsky Hierarchy. L re L cs L cfg L reg L human Angluin’s Theorem (1980) - provides necessary and sufficient conditions for such a class.