Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
Iowa State University Department of Computer Science, Iowa State University Artificial Intelligence Research Laboratory Center for Computational Intelligence,
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Part II. Statistical NLP Advanced Artificial Intelligence Probabilistic Context Free Grammars Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Normal forms for Context-Free Grammars
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
COP4020 Programming Languages
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
Chapter 23: Probabilistic Language Models April 13, 2004.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
LESSON 04.
Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Ensemble Methods in Machine Learning
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
9/15/2010CS485, Lecture 2, Fall Lecture 2: Introduction to Syntax (Revised based on the Tucker’s slides)
NATURAL LANGUAGE PROCESSING
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
General Information on Context-free and Probabilistic Context-free Grammars İbrahim Hoça CENG784, Fall 2013.
Hidden Markov Models BMI/CS 576
CSC 594 Topics in AI – Natural Language Processing
System Software Unit-1 (Language Processors) A TOY Compiler
Programming Languages Translator
Statistical Models for Automatic Speech Recognition
Compiler Construction
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Statistical NLP: Lecture 9
CSCI 5832 Natural Language Processing
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS4705 Natural Language Processing
Parsing Unrestricted Text
PRESENTATION: GROUP # 5 Roll No: 14,17,25,36,37 TOPIC: STATISTICAL PARSING AND HIDDEN MARKOV MODEL.
Programming Languages 2nd edition Tucker and Noonan
Statistical NLP : Lecture 9 Word Sense Disambiguation
Faculty of Computer Science and Information System
Presentation transcript:

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994

Why automatic grammar induction (AGI) Enables using domain-dependent grammars without expert intervention. Enables using person-dependent grammars without expert intervention. Can be used on different languages (without a linguist familiar with the particular language). A process of grammar induction with expert guidance may be more accurate than human written grammar since computers are more adept than humans in analyzing large corpora.

Why statistical approaches to AGI In practice languages are not logical structures. Often said sentences are not precisely grammatical. The solution of expanding the grammar leads to explosion of grammar rules. A large grammar will lead to many parses of the same sentences. Clearly, some parses are more accurate than others. Statistical approaches enable including a large set of grammar rules together with assigning probability to each parse. There are known optimality conditions and optimization procedure in statistics.

Some Bayesian statistics For each grammar (rule probabilities+ rules), a prior probability p(M) is assigned. This value may represent experts’ opinion about how likely is this grammar. Upon introduction of a training set X (an unlabeled corpus), the model posterior is computed by Bayes law: Either the grammar that maximizes the posterior is kept (as the best grammar), or the set of all grammars and their posteriors is kept (better).

Priors for CF grammars The prior of a grammar p(M) is split to two parts: The component is taken to introduce a bias towards short grammars (less rules). One way of doing that, though still heuristic, is minimum description length (MDL): Prior for the rule probabilities is taken to be uniform Dirichlet prior which has the effect of smoothing low counts of rules usage.

Grammar posterior Too hard to maximize over the posterior of both the rules and the probabilities. Instead, the search is done to maximize the posterior of the rules only: Where V is the Viterbi derivation of x. The last integral has a closed form solution.

Maximizing the posterior Even though computing an approximation to the posterior is possible in closed form, coming up with a grammar that maximizes it is still a hard problem. A. Stolcke: Start with many rules. Apply greedy operations of merging rules to maximize the posterior. Model merging was applied to Hidden Markov models, probabilistic context free grammar and probabilistic attribute grammar (PCFG with semantic features tied to non-terminals).

A concrete example: PCFG A specific PCFG consists of a list of rules s and a set of production probabilities. For a given s, it is possible to learn the production probabilities with EM. Coming up with an optimal s is still an open problem. Stolcke’s model merging is an attempt to tackle this problem. Given a corpus (set of sentences), an initial set of rules is constructed:

Merging operators Non-terminal merging: replace two existing non-terminals with a single new non-terminal. Non-terminal chunking: Given an ordered sequence of non- terminals, create a new non-terminal Y that expands to and replaces occurrences of in right hand side with Y.

PCFG priors Prior for rules: For a non-lexical rule (doesn’t produce a terminal symbol) the description length is For a lexical rule (produces a terminal symbol) the description length is The prior was taken to be either exponentially decreasing or Poisson in the description length Prior for rule probabilities:

Search strategy Start with the initial rules. Try applying all possible merge operations. For each resulting grammar compute the posterior and choose the merge which resulted in the highest posterior. Search strategy: Best first search, Best first with look-ahead Beam search

Now some examples…