Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
Linear Classifiers (perceptrons)
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Support Vector Machines
A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.
Max-Margin Matching for Semantic Role Labeling David Vickrey James Connor Daphne Koller Stanford University.
A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.
Page 1 Learning and Global Inference for Information Access and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
SRL using complete syntactic analysis Mihai Surdeanu and Jordi Turmo TALP Research Center Universitat Politècnica de Catalunya.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
1/17 Probabilistic Parsing … and some other approaches.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
1 CS598 DNR FALL 2005 Machine Learning in Natural Language Dan Roth University of Illinois, Urbana-Champaign
Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons,
Page 1 Global Inference and Learning Towards Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.
Page 1 November 2007 Beckman Institute With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Mark Sammons, Scott Yih, Dav Zimak.
Page 1 March 2009 Brigham Young University With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott.
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Graphical models for part of speech tagging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Aspect Guided Text Categorization with Unobserved Labels Dan Roth, Yuancheng Tu University of Illinois at Urbana-Champaign.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.
Penn 1 Kindle: Knowledge and Inference via Description Logics for Natural Language Dan Roth University of Illinois, Urbana-Champaign Martha Palmer University.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
INSTITUTE OF COMPUTING TECHNOLOGY Forest-based Semantic Role Labeling Hao Xiong, Haitao Mi, Yang Liu and Qun Liu Institute of Computing Technology Academy.
Page 1 Learning and Inference in Natural Language From Stand Alone Learning Tasks to Structured Representations Dan Roth Department of Computer Science.
Page 1 Global Inference in Learning for Natural Language Processing.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications in Natural Language Processing Dan Roth Department of Computer Science.
Page 1 April 2010 Carnegie Mellon University With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
Page 1 January 2010 Saarland University, Germany. Constrained Conditional Models Learning and Inference for Natural Language Understanding Dan Roth Department.
Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()
Julia Hockenmaier and Mark Steedman.   The currently best single-model statistical parser (Charniak, 1999) achieves Parseval scores of over 89% on the.
Page 1 June 2009 ILPNLP NAACL-HLT With thanks to: Collaborators: Ming-Wei Chang, Dan Goldwasser, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Lecture 7: Constrained Conditional Models
Inference and Learning via Integer Linear Programming
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Part 2 Applications of ILP Formulations in Natural Language Processing
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
CIS 700 Advanced Machine Learning for NLP Inference Applications
CS 4/527: Artificial Intelligence
Statistical NLP: Lecture 9
Dan Roth Department of Computer Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign Joint work with Dan Roth, Wen-tau Yih, and Dav Zimak

Page 2 Story Comprehension 1. Who is Christopher Robin? 2. What did Mr. Robin do when Chris was three years old? 3. When was Winnie the Pooh written? 4. Why did Chris write two books of his own? (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Page 3 Stand Alone Ambiguity Resolution Context Sensitive Spelling Correction IIllinois’ bored of education board Word Sense Disambiguation...Nissan Car and truck plant is … …divide life into plant and animal kingdom Part of Speech Tagging (This DT) (can N) (will MD) (rust V) DT,MD,V,N Coreference Resolution The dog bit the kid. He was taken to a hospital. The dog bit the kid. He was taken to a veterinarian.

Page 4 Textual Entailment Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year. Yahoo acquired Overture. Question Answering Who acquired Overture?

Page 5 Inference and Learning Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. Learned classifiers for different sub-problems Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints. Global inference for the best assignment to all variables of interest.

Page 6 Text Chunking VPADVPVP NPADJP Theguystandingthereissotallx = y =

Page 7 Full Parsing VPADVP NPADJP VP NP S Theguystandingthereissotallx = y =

Page 8 Outline Semantic Role Labeling Problem Global Inference with Integer Linear Programming Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference Conclusion

Page 9 Semantic Role Labeling I left my pearls to my daughter in my will. [ I ] A0 left [ my pearls ] A1 [ to my daughter ] A2 [ in my will ] AM-LOC. A0Leaver A1Things left A2Benefactor AM-LOCLocation

Page 10 Semantic Role Labeling PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees. Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files 13 types of adjuncts labeled as AM- arg where arg specifies the adjunct type

Page 11 Semantic Role Labeling

Page 12 The Approach Pruning Use heuristics to reduce the number of candidates (modified from [Xue&Palmer’04] ) Argument Identification Use a binary classifier to identify arguments Argument Classification Use a multiclass classifier to classify arguments Inference Infer the final output satisfying linguistic and structure constraints I left my nice pearls to her

Page 13 Learning Both argument identifier and argument classifier are trained phrase- based classifiers. Features (some examples) voice, phrase type, head word, path, chunk, chunk pattern, etc. [some make use of a full syntactic parse] Learning Algorithm – SNoW Sparse network of linear functions weights learned by regularized Winnow multiplicative update rule with averaged weight vectors Probability conversion is done via softmax p i = exp{act i }/  j exp{act j }

Page 14 Inference The output of the argument classifier often violates some constraints, especially when the sentence is long. Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. Input: The probability estimation (by the argument classifier) Structural and linguistic constraints Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Page 15 Integer Linear Programming Inference For each argument a i and type t (including null) Set up a Boolean variable: a i,t indicating if a i is classified as t Goal is to maximize  i score(a i = t ) a i,t Subject to the (linear) constraints Any Boolean constraints can be encoded this way If score(a i = t ) = P(a i = t ), the objective is find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints

Page 16 Linear Constraints No overlapping or embedding arguments  a i, a j overlap or embed: a i,null + a j,null  1

Page 17 Constraints No overlapping or embedding arguments No duplicate argument classes for A0-A5 Exactly one V argument per predicate If there is a C-V, there must be V-A1-C-V pattern If there is an R-arg, there must be arg somewhere If there is a C-arg, there must be arg somewhere before Each predicate can take only core arguments that appear in its frame file. More specifically, we check for only the minimum and maximum ids

Page 18 SRL Results (CoNLL-2005) Training: section Development: section 24 Test WSJ: section 23 Test Brown: from Brown corpus (very small) PrecisionRecallF1 DevelopmentCollins Charniak Test WSJCollins Charniak Test BrownCollins Charniak

Page 19 Inference with Multiple SRL systems Goal is to maximize  i score(a i = t ) a i,t Subject to the (linear) constraints Any Boolean constraints can be encoded this way score(a i = t ) =  k f k (a i = t ) If system k has no opinion on a i, use a prior instead

Page 20 Results with Multiple Systems (CoNLL-2005)

Page 21 Outline Semantic Role Labeling Problem Global Inference with Integer Linear Programming Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference Conclusion

Page 22 Training w/o ConstraintsTesting: Inference with ConstraintsIBT: Inference-based Training Learning and Inference x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 y1y1 y2y2 y5y5 y4y4 y3y3 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) X Y Learning the components together! Which one is better? When and Why?

Page 23 Comparisons of Learning Approaches Coupling (IBT) Optimize the true global objective function (This should be better in the limit) Decoupling (L+I) More efficient Reusability of classifiers Modularity in training No global examples required

Page 24 Claims When the local classification problems are “easy”, L+I outperforms IBT. Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples. Will show experimental results and theoretical intuition to support our claims.

Page Y’ Local Predictions Perceptron-based Global Learning x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) X Y 11 Y True Global Labeling 111 Y’ Apply Constraints:

Page 26 Simulation There are 5 local binary linear classifiers Global classifier is also linear h(x) = argmax y2 C ( Y )  i f i (x,y i ) Constraints are randomly generated The hypothesis is linearly separable at the global level given that the constraints are known The separability level at the local level is varied

Page 27  opt =0.2  opt =0.1  opt =0 Bound Prediction Local  ≤  opt + ( ( d log m + log 1/  ) / m ) 1/2 Global  ≤ 0 + ( ( cd log m + c 2 d + log 1/  ) / m ) 1/2 BoundsSimulated Data L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I

Page 28 Relative Merits: SRL Difficulty of the learning problem (# features) L+I is better. When the problem is artificially made harder, the tradeoff is clearer. easyhard

Page 29 Summary When the local classification problems are “easy”, L+I outperforms IBT. Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples. Why does inference help at all?

Page 30 About Constraints We always assume that global coherency is good Constraints does help in real world applications Performance is usually measured at the local prediction Depending on the performance metric, constraints can hurt

Page 31 Results: Contribution of Expressive Constraints [Roth & Yih 05] Basic: Learning with statistical constraints only; Additional constraints added at evaluation time (efficiency) disallow + verb pos + argument + cand + no dup basic (Viterbi) F1F diff CRF-DCRF-ML

Page 32 Assumptions y = h y 1, …, y l i Non-interactive classifiers: f i (x,y i ) Each classifier does not use as inputs the outputs of other classifiers Inference is linear summation h un (x) = argmax y2 Y  i f i (x,y i ) h con (x) = argmax y2 C ( Y )  i f i (x,y i ) C ( Y ) always contains correct outputs No assumption on the structure of constraints

Page 33 Performance Metrics Zero-one loss Mistakes are calculated in terms of global mistakes y is wrong if any of y i is wrong Hamming loss Mistakes are calculated in terms of local mistakes

Page 34 Zero-One Loss Constraints cannot hurt Constraints never fix correct global outputs This is not true for Hamming Loss

Page 35 4-bit binary outputs Boolean Cube Hamming Loss Score 1 mistake 2 mistakes 3 mistakes 4 mistakes 0 mistake

Page 36 Hamming Loss Hamming Loss Score h un 0011

Page 37 Best Classifiers Hamming Loss Score 11 33 22  1 +  2 44

Page 38 When Constraints Cannot Hurt  i : distance between the correct label and the 2 nd best label  i : distance between the predicted label and the correct label F correct = { i | f i is correct} F wrong = { i | f i is wrong} Constraints cannot hurt if 8 i 2 F correct :  i >  i 2 F wrong  i

Page 39 An Empirical Investigation SRL System CoNLL-2005 WSJ test set Without ConstraintsWith Constraints Local Accuracy82.38%84.08%

Page 40 An Empirical Investigation NumberPercentages Total Predictions Incorrect Predictions Correct Predictions Violate the condition

Page 41 Good Classifiers Hamming Loss Score

Page 42 Bad Classifiers Hamming Loss Score

Page 43 Average Distance vs Gain in Hamming Loss Good High Loss ! Low Score (Low Gain)

Page 44 Good Classifiers Hamming Loss Score

Page 45 Bad Classifiers Hamming Loss Score

Page 46 Average Gain in Hamming Loss vs Distance Good High Score ! Low Loss (High Gain)

Page 47 Utility of Constraints Constraints improve the performance because the classifiers are good Good Classifiers: When the classifier is correct, it allows large margin between the correct label and the 2 nd best label When the classifier is wrong, the correct label is not far away from the predicted one

Page 48 Conclusions Show how global inference can be used Semantic Role Labeling Tradeoffs between Coupling vs. Decoupling learning and inference Investigation of utility of constraints The analyses are very preliminary Average-case analysis for the tradeoffs between Coupling vs. Decoupling learning and inference Better understanding for using constraints More interactive classifiers Different performance metrics, e.g. F1 Relation with margin