Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons,

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Guiding Semi- Supervision with Constraint-Driven Learning Ming-Wei Chang,Lev Ratinow, Dan Roth.
CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Support Vector Machines
A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.
Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Page 1 Learning and Global Inference for Information Access and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois.
Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Integer Linear Programming in NLP Constrained Conditional Models
1 CS598 DNR FALL 2005 Machine Learning in Natural Language Dan Roth University of Illinois, Urbana-Champaign
Page 1 Global Inference and Learning Towards Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.
Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign.
Page 1 November 2007 Beckman Institute With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Mark Sammons, Scott Yih, Dav Zimak.
1 CS546 Spring 2009 Machine Learning in Natural Language Dan Roth & Ivan Titov SC Wed/Fri 9:30  What’s.
Page 1 March 2009 Brigham Young University With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott.
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.
Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Aspect Guided Text Categorization with Unobserved Labels Dan Roth, Yuancheng Tu University of Illinois at Urbana-Champaign.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.
Page 1 Learning and Inference in Natural Language From Stand Alone Learning Tasks to Structured Representations Dan Roth Department of Computer Science.
Page 1 Global Inference in Learning for Natural Language Processing.
An Introduction to Support Vector Machines (M. Law)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
2007 Science of Design (SoD) PI Meeting – Project Nuggets NSF SoD Award No: NSF SoD-HCER Project Title: Learning Based Programming Investigator.
December 2011 Technion, Israel With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,
Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.
June 2011 Microsoft Research, Washington With thanks to: Collaborators: Scott Yih, Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,
Global Inference via Linear Programming Formulation Presenter: Natalia Prytkova Tutor: Maximilian Dylla
NTU & MSRA Ming-Feng Tsai
Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications in Natural Language Processing Dan Roth Department of Computer Science.
Page 1 April 2010 Carnegie Mellon University With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
Page 1 January 2010 Saarland University, Germany. Constrained Conditional Models Learning and Inference for Natural Language Understanding Dan Roth Department.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Page 1 June 2009 ILPNLP NAACL-HLT With thanks to: Collaborators: Ming-Wei Chang, Dan Goldwasser, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Dan Roth Department of Computer.
Lecture 7: Constrained Conditional Models
Integer Linear Programming Formulations in Natural Language Processing
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
CIS 700 Advanced Machine Learning for NLP Inference Applications
Overview of Machine Learning
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Dan Roth Computer and Information Science University of Pennsylvania
Dan Roth Department of Computer Science
Presentation transcript:

Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott Yih, Dav Zimak Funding: ARDA, under the AQUAINT program NSF: ITR IIS , ITR IIS , ITR IIS , SoD-HCER A DOI grant under the Reflex program, DASH Optimization (Xpress-MP) Constrained Conditional Models for Global Learning and Inference Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Page 2 Nice to Meet You

Page 3 Learning and Inference  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.  (Learned) models/classifiers for different sub-problems  Incorporate models’ information, along with constraints, in making coherent decisions – decisions that respect the local models as well as domain & context specific constraints.  Global inference for the best assignment to all variables of interest.

Page 4 Inference

Page 5 Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. A process that maintains and updates a collection of propositions about the state of affairs. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Page 6 Illinois’ bored of education board...Nissan Car and truck plant is … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian a hospital What we Know: Stand Alone Ambiguity Resolution Learn a function f: X  Y that maps observations in a domain to one of several categories or <

Page 7  Theoretically: generalization bounds  How many example does one need to see in order to guarantee good behavior on previously unobserved examples.  Algorithmically: good learning algorithms for linear representations.  Can deal with very high dimensionality (10 6 features)  Very efficient in terms of computation and # of examples. On-line.  Key issues remaining:  Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking; adaptation.  What are the features? No good theoretical understanding here.  How to decompose problems and learn tractable models.  Modeling/Programming systems that have multiple classifiers. Classification is Well Understood

Page 8 Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. A process that maintains and updates a collection of propositions about the state of affairs. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. This is an Inference Problem

Page 9 This Talk  Global Inference over Local Models/Classifiers + Expressive Constraints  Constrained Conditional Models  Generality of the framework  Training Paradigms  Global training, Decomposition and Local training  Semi-Supervised Learning  Examples  Semantic Parsing  Information Extraction  Pipeline processes

Page 10 Sequential Constrains Structure Three models for sequential inference with classifiers [Punyakanok & Roth NIPS’01]  HMM; HMM with Classifiers Sufficient for easy problems  Conditional Models (PMM) Allows direct modeling of states as a function of input Classifiers may vary; SNoW (Winnow;Perceptron); MEMM: MaxEnt; SVM based  Constraint Satisfaction Models (CSCL: more general constrains) The inference problem is modeled as weighted 2-SAT With sequential constraints: shown to have efficient solution Recent work – viewed as multi-class classification; emphasis on global training [Collins’02, CRFs,SVMs]; efficiency and performance issues s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6 s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6 By far, the most popular in applications Allows for Dynamic Programming based Inference What if the structure of the problem/constraints is not sequential?

Page 11 Pipeline Pipelining is a crude approximation; interactions occur across levels and down stream decisions often interact with previous decisions. Leads to propagation of errors Occasionally, later stage problems are easier but upstream mistakes will not be corrected. POS TaggingPhrasesSemantic EntitiesRelations Most problems are not single classification problems ParsingWSDSemantic Role Labeling Raw Data Looking for:  Global inference over the outcomes of different local predictors as a way to break away from this paradigm [between pipeline & fully global]  A flexible way to incorporate linguistic and structural constraints.

Page 12 Inference with General Constraint Structure [Roth&Yih’04] Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 Improvement over no inference: 2-5%

Page 13 Random Variables Y: Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined on partial assignments (possibly: + weights W ) Goal: Find the “best” assignment  The assignment that achieves the highest global performance. This is an Integer Programming Problem Problem Setting y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 C(y 1,y 4 ) C(y 2,y 3,y 6,y 7,y 8 ) Y*=argmax Y P  Y subject to constraints C (+ W  C) Other, more general ways to incorporate soft constraints here [ACL’07] observations

Page 14 y* = argmax y  w i Á (x; y) Typically, Linear or log-linear Typically Á (x,y) will be local functions, or Á (x,y) = Á (x) Constrained Conditional Models y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 Conditional Markov Random FieldConstraints Network   i ½ i C i x,y) Optimize for general constraints Constraints may have weights. May be soft Specified declaratively as FOL formulae Clearly, there is a joint probability distribution that represents this mixed model. We would like to:  Make decisions with respect to the mixed model, but  Not necessarily learn this complex model.

Page 15 A General Inference Setting Linear objective function:  Essentially all complex models studied today can be viewed as optimizing a : HMMs/CRFs [Roth’99; Collins’02;Laffarty et. al 02]  Linear objective functions can be derived from probabilistic perspective:  Markov Random Field  [standard; Kleinberg&Tardos] Optimization Problem (Metric Labeling)  [Chekuri et. al ’ 01] Linear Programming Problems  Inference as Constrained Optimization [Yih&Roth CoNLL ’ 04] … The probabilistic perspective supports finding the most likely assignment  Not necessarily what we want Our Integer linear programming (ILP) formulation  Allows the incorporation of more general cost functions  General (non-sequential) constraint structure  Better exploitation (computationally) of hard constraints  Can find the optimal solution if desired

Page 16 Formal Model How to solve? This is an Integer Linear Program Solve using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Weight Vector for “local” models Penalty for violating the constraint. How far away is y from a “legal” assignment Subject to constraints A collection of Classifiers; Log-linear models (HMM, CRF) or a combination How to train? How to decompose global objective function? Should we incorporate constraints in the learning process?

Page 17 Example: Semantic Role Labeling I left my pearls to my daughter in my will. [ I ] A0 left [ my pearls ] A1 [ to my daughter ] A2 [ in my will ] AM-LOC. A0Leaver A1Things left A2Benefactor AM-LOCLocation I left my pearls to my daughter in my will. Special Case (structured output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts. Implications on training paradigms Overlapping arguments If A2 is present, A1 must also be present. Who did what to whom, when, where, why,…

Page 18 PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations.  It adds a layer of generic semantic labels to Penn Tree Bank II.  (Almost) all the labels are on the constituents of the parse trees. Core arguments: A0-A5 and AA  different semantics for each verb  specified in the PropBank Frame files 13 types of adjuncts labeled as AM- arg  where arg specifies the adjunct type Semantic Role Labeling (2/2)

Page 19 Algorithmic Approach Identify argument candidates  Pruning [Xue&Palmer, EMNLP’04]  Argument Identifier Binary classification (SNoW) Classify argument candidates  Argument Classifier Multi-class classification (SNoW) Inference  Use the estimated probability distribution given by the argument classifier  Use structural and linguistic constraints  Infer the optimal global output I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her Identify Vocabulary Inference over (old and new) Vocabulary candidate arguments EASY

Page 20 Inference I left my nice pearls to her The output of the argument classifier often violates some constraints, especially when the sentence is long. Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04;05] Input:  The probability estimation (by the argument classifier)  Structural and linguistic constraints Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Page 21 Integer Linear Programming Inference For each argument a i  Set up a Boolean variable: a i,t indicating whether a i is classified as t Goal is to maximize   i score(a i = t ) a i,t  Subject to the (linear) constraints If score(a i = t ) = P(a i = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. The Constrained Conditional Model is completely decomposed during training

Page 22 No duplicate argument classes  a  P OT A RG x { a = A0 }  1 R-ARG  a2  P OT A RG,  a  P OT A RG x { a = A0 }  x { a2 = R-A0 } C-ARG  a2  P OT A RG,  (a  P OT A RG )  (a is before a2 ) x { a = A0 }  x { a2 = C-A0 } Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments If verb is of type A, no argument of type B Any Boolean rule can be encoded as a linear constraint. If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Constraints Joint inference can be used also to combine different SRL Systems. Universally quantified rules In LBJ we allow a programmer to encode their constraints in FOL; these are compiled into linear inequalities automatically.

Page 23 Extracting Relations via Semantic Analysis Screen shot from a CCG demo Semantic parsing reveals several relations in the sentence along with their arguments. Top ranked system in CoNLL’05 shared task Key difference is the Inference This approach produces a very good semantic parser. F1~90% Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Page 24 ILP as a Unified Algorithmic Scheme Consider a common model for sequential inference: HMM/CRF  Inference in this model is done via the Viterbi Algorithm. Viterbi is a special case of the Linear Programming based Inference.  Viterbi is a shortest path problem, which is a LP, with a canonical matrix that is totally unimodular. Therefore, you can get integrality constraints for free.  One can now incorporate non-sequential/expressive/declarative constraints by modifying this canonical matrix –modify the decision time objective function  The extension reduces to a polynomial scheme under some conditions (e.g., when constraints are sequential, when the solution space does not change, etc.)  Not necessarily increases complexity and very efficient in practice [Roth&Yih, ICML’05] y1y1 y2y2 y3y3 y4y4 y5y5 y x x1x1 x2x2 x3x3 x4x4 x5x5 s A B C A B C A B C A B C A B C t Learn a rather simple model; make decisions with a more expressive model So far, shown the use of only (deterministic) constraints. Can be used with statistical constraints. This is a CCM that is trained globally (ML, Discriminatively)

Page 25 This Talk  Global Inference over Local Models/Classifiers + Expressive Constraints  Constrained Conditional Models  Generality of the framework  Training Paradigms  Global training, Decomposition and Local training  Semi-Supervised Learning  Examples  Semantic Parsing  Information Extraction  Pipeline processes

Page 26 Given: Q: Who acquired Overture? Determine: A: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Textual Entailment Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year Yahoo acquired Overture Is it true that…? (Textual Entailment)  Overture is a search company Google is a search company ………. Google owns Overture Phrasal verb paraphrasing [Connor&Roth’07] Entity matching [Li et. al, AAAI’04, NAACL’04] Semantic Role Labeling Inference for Entailment AAAI’05;TE’07

Page 27 Training Paradigms that Support Global Inference Incorporating general constraints (Algorithmic Approach)  Allow both statistical and expressive declarative constraints  Allow non-sequential constraints (generally difficult) Coupling vs. Decoupling Training and Inference.  Incorporating global constraints is important but  Should it be done only at evaluation time or also at training time?  How to decompose the objective function and train in parts?  Issues related to: Modularity, efficiency and performance, availability of training data Problem specific considerations May not be relevant in some problems.

Page 28 Input: o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 o 9 o 10 Classifier 1: Classifier 2: Infer: Phrase Identification Problem Use classifiers’ outcomes to identify phrases Final outcome determined by optimizing classifiers outcome and constrains [[[[ ]] ] ]]] []][ Did this classifier make a mistake? How to train it?

Page 29 Training in the presence of Constraints General Training Paradigm:  First Term: Learning from data (could be further decomposed)  Second Term: Guiding the model by constraints  Can choose if constraints’ weights trained, when and how, or taken into account only in evaluation.

Page 30 L+I: Learning plus Inference IBT: Inference-based Training Training w/o Constraints Testing: Inference with Constraints x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 y1y1 y2y2 y5y5 y4y4 y3y3 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) X Y Learning the components together! Cartoon: each model can be more complex and may have a view on a set of output variables.

Page Y’ Local Predictions Perceptron-based Global Learning x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) X Y 11 Y True Global Labeling 111 Y’ Apply Constraints: Which one is better? When and Why?

Page 32 Claims When the local modes are “ easy ” to learn, L+I outperforms IBT.  In many applications, the components are identifiable and easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a larger number of training examples.  When data is scarce, problems are not easy and constraints can be used, along with a “ weak ” model, to label unlabeled data and improve model. Often, you don ’ t want the data to affect your view of the constraints. L+I: cheaper computationally; modular IBT is better in the limit, and other extreme cases. Combinations: L+I, and then IBT are possible

Page 33  opt =0.2  opt =0.1  opt =0 Bound Prediction Local  ≤  opt + ( ( d log m + log 1/  ) / m ) 1/2 Global  ≤ 0 + ( ( cd log m + c 2 d + log 1/  ) / m ) 1/2 BoundsSimulated Data L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I Indication for hardness of problem

Page 34 Relative Merits: SRL Difficulty of the learning problem (# features) L+I is better. When the problem is artificially made harder, the tradeoff is clearer. easyhard In some cases problems are hard due to lack of training data. Semi-supervised learning

Page 35 Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May Information extraction with Background Knowledge (Constraints) Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of constraints!

Page 36 Examples of Constraints Each field must be a consecutive list of words, and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….

Page 37 Information Extraction with Constraints Adding constraints, we get correct results! [AUTHOR] Lars Ole Andersen. [TITLE] Program analysis and specialization for the C Programming language. [TECH-REPORT] PhD thesis. [INSTITUTION] DIKU, University of Copenhagen, [DATE] May, If incorporated into semi-supervised training, better results mean  Better Feedback!

Page 38 Semi-Supervised Learning with Constraints Model Decision Time Constraints Un-labeled Data Constraints In traditional Semi-Supervised learning the model can drift away from the correct one. Constraints can be used At decision time, to bias the objective function towards favoring constraint satisfaction. At training to improve labeling of un-labled data (and thus improve the model)

Page 39 =learn(T) For N iterations do T=  For each x in unlabeled dataset y  Inference(x, ) T=T  {(x, y)} Supervised learning algorithm parameterized by Inference based augmentation of the training set (feedback) (inference with constraints). Inference(x,C, ) Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]

Page 40 Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07] =learn(T) For N iterations do T=  For each x in unlabeled dataset {y 1,…,y K }  Top-K-Inference(x,C, ) T=T  {(x, y i )} i=1…k =  +(1-  )learn(T) Learn from new training data. Weight supervised and unsupervised model. Inference based augmentation of the training set (feedback) (inference with constraints). Supervised learning algorithm parameterized by

Page 41 Token-based accuracy (inference with constraints)

Page 42 Objective function: Semi-Supervised Learning with Constraints # of available labeled examples Learning w Constraints Constraints are used to Bootstrap a semi- supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model. Learning w/o Constraints: 300 examples. A Constrained Conditional Model in which we do not want to let training affect the constraints’ part of the objective function.

Page 43 Constrained Conditional Models: a general paradigm for learning and inference in the context of natural language understanding tasks A general Constraint Optimization approach for integration of learned models with additional (declarative or statistical) expressivity. A paradigm for making Machine Learning practical – allow domain/task specific constraints. How to train?  Learn simple local models; make use of them globally (via global inference) [Punyakanok et. al IJCAI’05]  Ability to use of domain & constraints to drive supervision [Klementiev & Roth, ACL’06; Chang, Ratinov, Roth, ACL’07] Conclusions LBJ (Learning Based Java): A modeling language for Constrained Conditional Models. Supports programming along with building learned models, high level specification of constraints and inference with constraints

Page 44 Questions? Thank you