Dan Roth Computer and Information Science University of Pennsylvania

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

June 2013 BENELEARN, Nijmegen With thanks to: Collaborators: Ming-Wei Chang, Gourab Kundu, Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others Funding:
June 2013 SLG Workshop, ICML, Atlanta GA Decomposing Structured Prediction via Constrained Conditional Models Dan Roth Department of Computer Science University.
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
Standard EM/ Posterior Regularization (Ganchev et al, 10) E-step: M-step: argmax w E q log P (x, y; w) Hard EM/ Constraint driven-learning (Chang et al,
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Page 1 March 2009 Brigham Young University With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Latent (S)SVM and Cognitive Multiple People Tracker.
SVM by Sequential Minimal Optimization (SMO)
Graphical models for part of speech tagging
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
June 2011 Microsoft Research, Washington With thanks to: Collaborators: Scott Yih, Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Lecture 7: Constrained Conditional Models
CS 9633 Machine Learning Support Vector Machines
Inference and Learning via Integer Linear Programming
Support Vector Machine
Chapter 7. Classification and Prediction
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Integer Linear Programming Formulations in Natural Language Processing
Part 2 Applications of ILP Formulations in Natural Language Processing
Dan Roth Department of Computer and Information Science
Dan Roth Department of Computer and Information Science
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
Kai-Wei Chang University of Virginia
CIS 700 Advanced Machine Learning for NLP A First Look at Structures
CIS 700 Advanced Machine Learning for NLP Inference Applications
Improving a Pipeline Architecture for Shallow Discourse Parsing
Data Mining Lecture 11.
Margin-based Decomposed Amortized Inference
Lecture 24: NER & Entity Linking
An Introduction to Support Vector Machines
Janardhan Rao (Jana) Doppa, Alan Fern, and Prasad Tadepalli
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Probabilistic Models with Latent Variables
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Michal Rosen-Zvi University of California, Irvine
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
A task of induction to find patterns
Introduction to Neural Networks
Dan Roth Department of Computer Science
A task of induction to find patterns
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
CS249: Neural Language Model
Presentation transcript:

Dan Roth Computer and Information Science University of Pennsylvania Quick Introduction to Structured Prediction for Natural Language Understanding Dan Roth Computer and Information Science University of Pennsylvania Title February 2019

Learning with Declarative Representations We talked about Declarative Representations, and Probabilistic Representations. Most of the progress in both paradigms was still done in the Knowledge Representation & Reasoning Court. Learning was not involved. Clearly, there is room to learn Prolog Statements (and a field, called Inductive Logic Programming (ILP) devoted to it. There is room to learn probabilistic extensions of Prolog, and Bayes Nets and there were efforts in all these directions. However, most of these were still not main-stream, not integrated and, for the most part, theoretical. The first area where people thought together on learning and (some form of) reasoning (not exactly in the Learning to Reason approach mentioned (see in the Classical papers) was in the area of Structured Prediction.

Nice to Meet You Identify units Consider multiple representations and interpretations Pictures, text, layout, spelling, phonetics Put it all together: Determine “best” global interpretation Satisfy expectations Slide; puzzle Nice to Meet You

Joint inference gives a good improvement Joint Inference with General Constraint StructureEntities and Relations Joint inference gives a good improvement other 0.05 per 0.85 loc 0.10 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 other 0.05 per 0.50 loc 0.45 other 0.05 per 0.50 loc 0.45 Key Questions: How to learn the model(s)? What is the source of the knowledge? How to guide the global inference? An Objective function that incorporates learned models with knowledge (expectation constraints) A Constrained Conditional Model Bernie’s wife, Jane, is a native of Brooklyn E1 E2 E3 R12 R23 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.10 spouse_of 0.05 born_in 0.85 Why not learn it jointly? Because you can’t always learn things jointly – some constraining information may come up at decision time. Some models may already be there, given to you. And, not all learning is done from many examples -- most of the learning you do outside perceptual things you did not learn from many examples…but rather by being told or observing single examples. Not all learning is from examples; communication-driven learning is essential. Models could be learned separately/jointly; constraints may come up only at decision time.

Structured Prediction: Inference Placing in context: a crash course in structured prediction (skip) Inference: given input x (a document, a sentence), predict the best structure y = {y1,y2,…,yn} 2 Y (entities & relations) Assign values to the y1,y2,…,yn, accounting for dependencies among yis Inference is expressed as a maximization of a scoring function y’ = argmaxy 2 Y wT Á (x,y) Inference requires, in principle, touching all y 2 Y at decision time, when we are given x 2 X and attempt to determine the best y 2 Y for it, given w For some structures, inference is computationally easy. Eg: Using the Viterbi algorithm In general, NP-hard (can be formulated as an ILP) Joint features on inputs and outputs Set of allowed structures Feature Weights (estimated during learning)

Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (xi, yi):

Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (xi, yi): We call these conditions the learning constraints. In most learning algorithms used today, the update of the weight vector w is done in an on-line fashion, Think about it as Perceptron; this procedure applies to Structured Perceptron, CRFs, Linear Structured SVM W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows: Score of annotated structure Score of any other structure Penalty for predicting other structure 8 y

Structured Prediction: Learning Algorithm In the structured case, prediction (inference) is often intractable but needs to be done many times For each example (xi, yi) Do: (with the current weight vector w) Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wT Á ( xi ,y) Check the learning constraints Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndFor

Structured Prediction: Learning Algorithm Solution I: decompose the scoring function to EASY and HARD parts Structured Prediction: Learning Algorithm For each example (xi, yi) Do: Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y) Check the learning constraint Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo EASY: could be feature functions that correspond to an HMM, a linear CRF, or even ÁEASY (x,y) = Á(x), omiting dependence on y, corresponding to classifiers. May not be enough if the HARD part is still part of each inference step.

Structured Prediction: Learning Algorithm Solution II: Disregard some of the dependencies: assume a simple model. Structured Prediction: Learning Algorithm For each example (xi, yi) Do: Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y) Check the learning constraint Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo

Structured Prediction: Learning Algorithm Solution III: Disregard some of the dependencies during learning; take into account at decision time For each example (xi, yi) Do: Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASYT ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y) Check the learning constraint Is the score of the current prediction better than of (xi, yi)? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo This is the most commonly used solution in NLP today

Constrained Conditional Models Any MAP problem w.r.t. any probabilistic model, can be formulated as an ILP [Roth+ 04, Taskar 04] Penalty for violating the constraint. Variables are models y = argmaxy  1∅(x,y) wx,y subject to Constraints C(x,y) y = argmaxy ∈ Y wT∅(x, y) + uTC(x, y) Knowledge component: (Soft) constraints Weight Vector for “local” models Features, classifiers; NN; log-linear models; a combination (non-linearity comes here) How far y is from a “legal/expected” assignment Magic Box(es) E.g., an entities model; a relations model. Training: learning the objective function (w, u) Decouple? Decompose? Force u to model hard constraints? There is some understanding for when to do what Inference: A way to push the learned model to satisfy our output expectations (or expectations from a latent representation) [CoDL, Chang, Ratinov, Roth (07, 12); Posterior Regularization, Ganchev et. al (10); Unified EM (Samdani & Roth(12), dozens of applications in NLP] The benefits of thinking about it as an ILP are conceptual and computational. Decomposition is key for abstraction and transfer

Examples: CCM Formulations y = argmaxy 2 Y wTÁ(x, y) + uTC(x, y) While Á(x, y) and C(x, y) could be the same; we want C(x, y) to express high level declarative knowledge over the statistical models. Formulate NLP Problems as ILP problems (inference may be done otherwise) 1. Sequence tagging (HMM/CRF + Global constraints) 2. Sentence Compression (Language Model + Global Constraints) 3. SRL (Independent classifiers + Global Constraints) Sentence Compression/Summarization: Language Model based: Argmax  ¸ijk xijk Sequential Prediction HMM/CRF based: Argmax  ¸ij xij Knowledge/Linguistics Constraints If a modifier chosen, include its head If verb is chosen, include its arguments Knowledge/Linguistics Constraints Cannot have both A states and B states in an output sequence. Constrained Conditional Models Allow: Decouple complexity of the learned model from that of the desired output Learn a simple model (multiple; pipelines); reason with a complex one. Accomplished by incorporating constraints to bias/re-rank global decisions to satisfy (minimally violate) expectations.

Semantic Role Labeling (SRL) [Extended SRL] I left my pearls to my daughter in my will . [I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC . A0 Leaver A1 Things left A2 Benefactor AM-LOC Location Variable ya,t indicates whether candidate argument a is assigned a label t. ca,t is the corresponding model score Algorithmic approach: Learn multiple models, identifying arguments and their types. Account for interdependencies via ILP inference. argmax a,t ya,t ca,t = a,t 1a=t ca=t Subject to: One label per argument: t ya,t = 1 Relations between verbs and arguments,…. In the context of SRL, the goal is to predict for each possible phrase in a given sentence if it is an argument or not and what type it is. Use the pipeline architecture’s simplicity while maintaining uncertainty: keep probability distributions over decisions & use global inference at decision time.