1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.

Slides:



Advertisements
Similar presentations
Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.
Advertisements

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.
Learning to Predict Structures with Applications to Natural Language Processing Ivan Titov TexPoint fonts used in EMF. Read the TexPoint manual before.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
What is the internet? 
Hand video 
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
1 CS546: Machine Learning and Natural Language Preparation to the Term Project: - Dependency Parsing - Dependency Representation for Semantic Role Labeling.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
ADVANCED PARSING David Kauchak CS159 – Fall 2014 some slides adapted from Dan Klein.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.
CSA2050 Introduction to Computational Linguistics Parsing I.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Prior Knowledge Driven Domain Adaptation Gourab Kundu, Ming-wei Chang, and Dan Roth Hyphenated compounds are tagged as NN. Example: H-ras Digit letter.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Supertagging CMSC Natural Language Processing January 31, 2006.
Hand video 
John Lafferty Andrew McCallum Fernando Pereira
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP Spring 2011
Statistical NLP Spring 2011
Probabilistic and Lexicalized Parsing
Machine Learning in Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Presentation transcript:

1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

2 Parsing Problem Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98]

3 Parsing Problem Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head lexicalization [Collins 99,...]

4 Parsing Problem Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head lexicalization [Collins 99,...] Automatic Annotation [Matsuzaki et al, 05;...] Manual Annotation [Klein and Manning 03]

5 Manual Annotation Manually split categories – NP: subject vs object – DT: determiners vs demonstratives – IN: sentential vs prepositional Advantages: – Fairly compact grammar – Linguistic motivations Disadvantages: – Performance leveled out – Manually annotated ModelF1 Naïve Treebank PCFG72.6 Klein & Manning ’0386.3

6 Automatic Annotation Use Latent Variable Models – Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T]) – Each node in the tree is annotated with a latent sub-category: – Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

7 How to perform this clustering? Estimating model parameters (and models structure) – Decide how do you split each terminal (what is T in., NP -> ( NP[1], NP[2],...,NP[T]) – Estimate probabilities for all Parsing: – Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? – Usually (2), but the inferred latent variables can can be useful for other tasks – Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

8 Estimating the model Estimating parameters: – If we decide on the structure of the model (how we split) we can use EM (Matsuzaki et al, 05; Petrov and Klein, 06;...): E-Step: estimate - obtain fractional counts of rules M-Step: – Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07] Recall: We considered the variational methods in the context of LDA

9 Estimating the model How to decide on how many nodes to split? – Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand – Models are sparse (parameter estimates are not reliable), parsing time is large

10 Estimating the model How to decide on how many nodes to split? – Later different approaches were considered: (Petrov and Klein 06): Split and merge approach – recursively split each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar – Larger is the training set: more fine grain the annotation is

11 Estimating the model How to decide on how many nodes to split? (Titov and Henderson 07; current work): – Instead of annotating with a single label annotate with a binary vector: -log-linear models for instead of counts of productions - - can be large: standard Gaussian regularization to avoid overtraining – efficient approximate parsing algorithms

12 How to parse? Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? How to parse: – (1) – easy – just usual parsing with the extended grammar (if all nodes split in T) – (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), – instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson, 06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error (Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labeling can lead to lingustically non-plausible structures) – Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

13 Adaptive splitting (Petrov and Klein, 06): Split and Merge: number of induced constituent labels: PP VP NP

14 (Petrov and Klein, 06): Split and Merge: number of induced POS tags: Adaptive splitting TO,POS

15 Adaptive splitting (Petrov and Klein, 06): Split and Merge: number of induced POS tags: TO,POS NN NNS NNP JJ

16 Induced POS-tags  Proper Nouns (NNP):  Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim

17 Induced POS tags Relative adverbs (RBR): Cardinal Numbers (CD): RBR-0furtherlowerhigher RBR-1morelessMore RBR-2earlierEarlierlater CD-7onetwoThree CD CD-11millionbilliontrillion CD CD CD

18 Results for this model F1 ≤ 40 words F1 all words Parser Klein & Manning ’ Matsuzaki et al. ’ Collins ’ Charniak & Johnson ’ Petrov & Klein,

19 LVs in Parsing In standard models for parsing (and other structured prediction problems) you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs) In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort Latent variable models become popular in many applications: – syntactic dependency parsing [Titov and Henderson, 07] – best single model system in the parsing competition (overall 3 rd result out of 22 systems) (CoNLL-2007) – joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1 st result in parsing, 3 rd result in SRL) (CoNLL-2009) – hidden (dynamics) CRFs [Quattoni, 09] –...

20 Hidden CRFs CRF (Lafferty et al, 2001): Latent Dynamic CRF No long-distance statistical dependencies between y Long-distance dependencies can be encoded using latent vectors

21 Latent Variables Drawbacks: – Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...) – Optimization problem is often non-convex – many local minima – Inference (decoding) can be more expensive Advantages: – Reduces feature engineering effort – Especially preferable if little domain knowledge is available and complex features are needed – Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful, e.g., for SRL) – Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009]. #

22 Conclusions We considered latent variable models in different contexts: – Topic modeling – Structured prediction models We demonstrated where and why they are useful Reviewed basic inference/learning techniques: – EM-type algorithms – Variational approximations – Sampling Only very basic review Next time: a guest lecture by Ming-Wei Chang on Domain- Adaptation (really hot and important topic in NLP!)