Page 1 June 2009 ILPNLP NAACL-HLT With thanks to: Collaborators: Ming-Wei Chang, Dan Goldwasser, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo,

Slides:

Advertisements

Similar presentations

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Advertisements

Guiding Semi- Supervision with Constraint-Driven Learning Ming-Wei Chang,Lev Ratinow, Dan Roth.

CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.

Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.

Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.

Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.

A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.

Visual Recognition Tutorial

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Page 1 Learning and Global Inference for Information Access and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois.

Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.

LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.

Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.

LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.

Distributed Representations of Sentences and Documents

Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,

Radial Basis Function Networks

Online Learning Algorithms

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Integer Linear Programming in NLP Constrained Conditional Models

Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons,

Page 1 Global Inference and Learning Towards Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.

Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign.

Page 1 November 2007 Beckman Institute With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Mark Sammons, Scott Yih, Dav Zimak.

Page 1 March 2009 Brigham Young University With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott.

1 CS546: Machine Learning and Natural Language Preparation to the Term Project: - Dependency Parsing - Dependency Representation for Semantic Role Labeling.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

CP Summer School Modelling for Constraint Programming Barbara Smith 1.Definitions, Viewpoints, Constraints 2.Implied Constraints, Optimization,

Design Challenges and Misconceptions in Named Entity Recognition Lev Ratinov and Dan Roth The Named entity recognition problem: identify people, locations,

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Graphical models for part of speech tagging

Aspect Guided Text Categorization with Unobserved Labels Dan Roth, Yuancheng Tu University of Illinois at Urbana-Champaign.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.

June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.

Page 1 Learning and Inference in Natural Language From Stand Alone Learning Tasks to Structured Representations Dan Roth Department of Computer Science.

Page 1 Global Inference in Learning for Natural Language Processing.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.

June 2011 Microsoft Research, Washington With thanks to: Collaborators: Scott Yih, Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

Global Inference via Linear Programming Formulation Presenter: Natalia Prytkova Tutor: Maximilian Dylla

NTU & MSRA Ming-Feng Tsai

Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications in Natural Language Processing Dan Roth Department of Computer Science.

Page 1 April 2010 Carnegie Mellon University With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.

Page 1 January 2010 Saarland University, Germany. Constrained Conditional Models Learning and Inference for Natural Language Understanding Dan Roth Department.

Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Lecture 7: Constrained Conditional Models

Integer Linear Programming Formulations in Natural Language Processing

By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS

Kai-Wei Chang University of Virginia

CIS 700 Advanced Machine Learning for NLP Inference Applications

Hidden Markov Models Part 2: Algorithms

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Dan Roth Computer and Information Science University of Pennsylvania

Dan Roth Department of Computer Science

Presentation transcript:

Page 1 June 2009 ILPNLP NAACL-HLT With thanks to: Collaborators: Ming-Wei Chang, Dan Goldwasser, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Ivan Titov, Scott Yih, Dav Zimak Funding: ARDA, under the AQUAINT program NSF: ITR IIS , ITR IIS , ITR IIS , SoD-HCER A DOI grant under the Reflex program; DHS DASH Optimization (Xpress-MP) Constrained Conditional Models Learning and Inference for Information Extraction and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Page 2  Informally:  Everything that has to do with global constraints (and learning models)  A bit more formally:  We typically make decisions based on models such as:  With CCMs we make decisions based on models such as:  This is a global inference problem (you can solve it multiple ways)  We do not dictate how models are learned.  but we’ll discuss it and make suggestions Page 2 Constraints Conditional Models (CCMs) CCMs assign values to variables in the presence/guided by constraints

Page 3 Constraints Driven Learning and Decision Making Why Constraints?  The Goal: Building a good NLP systems easily  We have prior knowledge at our hand How can we use it?  Often knowledge can be injected directly and be used to improve decision making guide learning simplify the models we need to learn How useful are constraints?  Useful for supervised learning  Useful for semi-supervised learning  Sometimes more efficient than labeling data directly

Page 4 Make my day

Page 5 Learning and Inference  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.  E.g. Structured Output Problems – multiple dependent output variables  (Main playground for these methods so far)  (Learned) models/classifiers for different sub-problems  In some cases, not all local models can be learned simultaneously  Key examples in NLP are Textual Entailment and QA  In these cases, constraints may appear only at evaluation time  Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions  decisions that respect the local models as well as domain & context specific knowledge/constraints.

Page 6 Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. A process that maintains and updates a collection of propositions about the state of affairs. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. This is an Inference Problem

Page 7 This Talk: Constrained Conditional Models A general inference framework that combines  Learning conditional models with using declarative expressive constraints  Within a constrained optimization framework Formulate a decision process as a constrained optimization problem Break up a complex problem into a set of sub-problems and require components’ outcomes to be consistent modulo constraints Has been shown useful in the context of many NLP problems  SRL, Summarization; Co-reference; Information Extraction; Transliteration  [Roth&Yih04,07; Punyakanok et.al 05,08; Chang et.al 07,08; Clarke&Lapata06,07; Denise&Baldrige07;Goldwasser&Roth’08] Here: focus on Learning and Inference for Structured NLP Problems Issues to attend to: While we formulate the problem as an ILP problem, Inference can be done multiple ways  Search; sampling; dynamic programming; SAT; ILP The focus is on joint global inference Learning may or may not be joint.  Decomposing models is often beneficial

Page 8 Outline Constrained Conditional Models  Motivation  Examples Training Paradigms: Investigate ways for training models and combining constraints  Joint Learning and Inference vs. decoupling Learning & Inference  Training with Hard and Soft Constrains  Guiding Semi-Supervised Learning with Constraints Examples  Semantic Parsing  Information Extraction  Pipeline processes

Page 9 Pipeline Conceptually, Pipelining is a crude approximation  Interactions occur across levels and down stream decisions often interact with previous decisions.  Leads to propagation of errors  Occasionally, later stage problems are easier but cannot correct earlier errors. But, there are good reasons to use pipelines  Putting everything in one basket may not be right  How about choosing some stages and think about them jointly? POS TaggingPhrasesSemantic EntitiesRelations Most problems are not single classification problems ParsingWSDSemantic Role Labeling Raw Data

Page 10 Inference with General Constraint Structure [Roth&Yih’04] Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 Improvement over no inference: 2-5% Some Questions: How to guide the global inference? Why not learn Jointly? Models could be learned separately; constraints may come up only at decision time. Non- Sequential Key Components: 1)Write down an objective function (Linear). 2)Write down constraints as linear inequalities x * = argmax x  c(x=v) [x=v] = = argmax x c {E1 = per} · x {E1 = per} + c {E1 = loc} · x {E1 = loc} +…+ c {R12 = spouse_of} · x {R12 = spouse_of} +…+ c {R12 =  } · x {R12 =  } Subject to Constraints

Page 11 Random Variables Y: Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined over partial assignments (possibly: + weights W ) Goal: Find the “best” assignment  The assignment that achieves the highest global performance. This is an Integer Programming Problem Problem Setting y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 C(y 1,y 4 ) C(y 2,y 3,y 6,y 7,y 8 ) Y*=argmax Y P  Y subject to constraints C (+ W  C) observations

Page 12 Formal Model How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Weight Vector for “local” models Penalty for violating the constraint. How far y is from a “legal” assignment Subject to constraints A collection of Classifiers; Log-linear models (HMM, CRF) or a combination How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

Page 13 Example: Semantic Role Labeling I left my pearls to my daughter in my will. [ I ] A0 left [ my pearls ] A1 [ to my daughter ] A2 [ in my will ] AM-LOC. A0Leaver A1Things left A2Benefactor AM-LOCLocation I left my pearls to my daughter in my will. Special Case (structured output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts. Implications on training paradigms Overlapping arguments If A2 is present, A1 must also be present. Who did what to whom, when, where, why,…

Page 14 PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations.  It adds a layer of generic semantic labels to Penn Tree Bank II.  (Almost) all the labels are on the constituents of the parse trees. Core arguments: A0-A5 and AA  different semantics for each verb  specified in the PropBank Frame files 13 types of adjuncts labeled as AM- arg  where arg specifies the adjunct type Semantic Role Labeling (2/2)

Page 15 Algorithmic Approach Identify argument candidates  Pruning [Xue&Palmer, EMNLP’04]  Argument Identifier Binary classification (SNoW) Classify argument candidates  Argument Classifier Multi-class classification (SNoW) Inference  Use the estimated probability distribution given by the argument classifier  Use structural and linguistic constraints  Infer the optimal global output I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her Identify Vocabulary Inference over (old and new) Vocabulary candidate arguments EASY

Page 16 Inference I left my nice pearls to her The output of the argument classifier often violates some constraints, especially when the sentence is long. Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04;05;07] Input:  The probability estimation (by the argument classifier)  Structural and linguistic constraints Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Page 17 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will

Page 18 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will

Page 19 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will One inference problem for each verb predicate.

Page 20 Integer Linear Programming Inference For each argument a i  Set up a Boolean variable: a i,t indicating whether a i is classified as t Goal is to maximize   i score(a i = t ) a i,t  Subject to the (linear) constraints If score(a i = t ) = P(a i = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. The Constrained Conditional Model is completely decomposed during training

Page 21 No duplicate argument classes  a  P OT A RG x { a = A0 }  1 R-ARG  a2  P OT A RG,  a  P OT A RG x { a = A0 }  x { a2 = R-A0 } C-ARG  a2  P OT A RG,  (a  P OT A RG )  (a is before a2 ) x { a = A0 }  x { a2 = C-A0 } Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B Any Boolean rule can be encoded as a linear constraint. If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Constraints Joint inference can be used also to combine different SRL Systems. Universally quantified rules LBJ: allows a developer to encode constraints in FOL; these are compiled into linear inequalities automatically.

Page 22 Learning Based Java (LBJ): A modeling language for Constrained Conditional Models Supports programming along with building learned models, high level specification of constraints and inference with constraints Learning operator:  Functions defined in terms of data  Learning happens at “compile time” Integrated constraint language:  Declarative, FOL-like syntax defines constraints in terms of your Java objects Compositionality:  Use any function as feature extractor  Easily combine existing model specifications /learned models with each other

Page 23 Example: Semantic Role Labeling Declarative, FOL-style constraints written in terms of functions applied to Java objects [Rizzolo, Roth’07] Inference produces new functions that respect the constraints LBJ site provides example code for NER, POS tagger etc.

Page 24 Semantic Role Labeling Screen shot from a CCG demo Semantic parsing reveals several relations in the sentence along with their arguments. Top ranked system in CoNLL’05 shared task Key difference is the Inference This approach produces a very good semantic parser. F1~90% Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Page 25 Textual Entailment Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year Yahoo acquired Overture Is it true that…? (Textual Entailment)  Overture is a search company Google is a search company ………. Google owns Overture Phrasal verb paraphrasing [Connor&Roth’07] Entity matching [Li et. al, AAAI’04, NAACL’04] Semantic Role Labeling Punyakanok et. al’05,08 Inference for Entailment Braz et. al’05, 07

Page 26 Outline Constrained Conditional Models  Motivation  Examples Training Paradigms: Investigate ways for training models and combining constraints  Joint Learning and Inference vs. decoupling Learning & Inference  Training with Hard and Soft Constrains  Guiding Semi-Supervised Learning with Constraints Examples  Semantic Parsing  Information Extraction  Pipeline processes

Page 27 Training Paradigms that Support Global Inference Algorithmic Approach: Incorporating general constraints  Allow both statistical and expressive declarative constraints [ICML’05]  Allow non-sequential constraints (generally difficult) [CoNLL’04] Coupling vs. Decoupling Training and Inference.  Incorporating global constraints is important but  Should it be done only at evaluation time or also at training time?  How to decompose the objective function and train in parts?  Issues related to: Modularity, efficiency and performance, availability of training data Problem specific considerations

Page 28 Training in the presence of Constraints General Training Paradigm:  First Term: Learning from data (could be further decomposed)  Second Term: Guiding the model by constraints  Can choose if constraints’ weights trained, when and how, or taken into account only in evaluation. Decompose Model (SRL case) Decompose Model from constraints

Page 29 Comparing Training Methods Option 1: Learning + Inference (with Constraints)  Ignore constraints during training Option 2: Inference (with Constraints) Based Training  Consider constraints during training In both cases: Global Decision Making with Constraints Question: Isn’t Option 2 always better? Not so simple…  Next, the “Local model story”

Page 30 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) Training Methods x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 y1y1 y2y2 y5y5 y4y4 y3y3 X Y Learning + Inference (L+I) Learn models independently Learning + Inference (L+I) Learn models independently Inference Based Training (IBT) Learn all models together! Inference Based Training (IBT) Learn all models together! Intuition Learning with constraints may make learning more difficult Intuition Learning with constraints may make learning more difficult Cartoon: each model can be more complex and may have a view on a set of output variables.

Page Y’ Local Predictions Training with Constraints Example: Perceptron-based Global Learning x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) X Y 11 Y True Global Labeling 111 Y’ Apply Constraints: Which one is better? When and Why?

Page 32 Claims [Punyakanok et. al, IJCAI 2005] When the local modes are “ easy ” to learn, L+I outperforms IBT.  In many applications, the components are identifiable and easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a larger number of training examples. Other training paradigms are possible Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09]  Identify a preferred ordering among components  Learn k-th model jointly with previously learned models L+I: cheaper computationally; modular IBT is better in the limit, and other extreme cases.

Page 33  opt =0.2  opt =0.1  opt =0 Bound Prediction Local  ≤  opt + ( ( d log m + log 1/  ) / m ) 1/2 Global  ≤ 0 + ( ( cd log m + c 2 d + log 1/  ) / m ) 1/2 BoundsSimulated Data L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I Indication for hardness of problem

Page 34 Relative Merits: SRL Difficulty of the learning problem (# features) L+I is better. When the problem is artificially made harder, the tradeoff is clearer. easyhard

Page 35 Comparing Training Methods (Cont.) Local Models (train independently) vs. Structured Models  In many cases, structured models might be better due to expressivity But, what if we use constraints? Local Models+Constraints vs. Structured Models +Constraints  Hard to tell: Constraints are expressive  For tractability reasons, structured models have less expressivity than the use of constraints.  Local can be better, because local models are easier to learn Decompose Model (SRL case) Decompose Model from constraints

Page 36 Example: CRFs are CCMs Consider a common model for sequential inference: HMM/CRF  Inference in this model is done via the Viterbi Algorithm. Viterbi is a special case of the Linear Programming based Inference.  Viterbi is a shortest path problem, which is a LP, with a canonical matrix that is totally unimodular. Therefore, you can get integrality constraints for free.  One can now incorporate non-sequential/expressive/declarative constraints by modifying this canonical matrix No value can appear twice; a specific value must appear at least once; A  B  And, run the inference as an ILP inference. y1y1 y2y2 y3y3 y4y4 y5y5 y x x1x1 x2x2 x3x3 x4x4 x5x5 s A B C A B C A B C A B C A B C t Learn a rather simple model; make decisions with a more expressive model But, you can do better

Page 37 Example: Semantic Role Labeling Revisited s A B C A B C A B C A B C A B C t Sequential Models Conditional Random Field Global perceptron Training: sentence based Testing: find the shortest path with constraints Local Models Logistic Regression Local Avg. Perceptron Training: token based. Testing: find the best assignment locally with constraints

Page 38 ModelCRFCRF-DCRF-IBTAvg. P Baseline Constraints Training Time Which Model is Better? Semantic Role Labeling Experiments on SRL: [Roth and Yih, ICML 2005]  Story: Inject constraints into conditional random field models Sequential Models Local L+I IBT Sequential Models are better than Local Models ! (No constraints) Sequential Models are better than Local Models ! (No constraints) Local Models are now better than Sequential Models! (With constraints) Local Models are now better than Sequential Models! (With constraints)

Page 39 Summary: Training Methods Many choices for training a CCM  Learning + Inference (Training without constraints)  Inference based Learning (Training with constraints)  Model Decomposition Advantages of L+I  Require fewer training examples  More efficient; most of the time, better performance  Modularity; easier to incorporate already learned models. Advantages of IBT  Better in the limit  Better when there are strong interactions among y’s Learn a rather simple model; make decisions with a more expressive model

Page 40 Outline Constrained Conditional Models  Motivation  Examples Training Paradigms: Investigate ways for training models and combining constraints  Joint Learning and Inference vs. decoupling Learning & Inference  Training with Hard and Soft Constrains  Guiding Semi-Supervised Learning with Constraints Examples  Semantic Parsing  Information Extraction  Pipeline processes

Page 41 Constrained Conditional Model: Soft Constraints (3) How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Constraint violation penalty How far y is from a “legal” assignment Subject to constraints (4) How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process? (1) Why use soft constraints? (2) How to model “degree of violations”

Page 42 (1) Why Are Soft Constraints Important? Some constraints may be violated by the data. Even when the gold data violates no constraints, the model may prefer illegal solutions.  If all solutions considered by the model violate constraints, we still want to rank solutions based on the level of constraints’ violation.  Important when beam search is used Rather than eliminating illegal assignments, re-rank them Working with soft constraints [Chang et. al, ACL’07]  Need to define the degree of violation Maybe be problem specific  Need to assign penalties for constraints

Page 43 Information extraction without Prior Knowledge Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of natural constraints! Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994.

Page 44 Examples of Constraints Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers

Page 45 Adding constraints, we get correct results!  Without changing the model [AUTHOR] Lars Ole Andersen. [TITLE] Program analysis and specialization for the C Programming language. [TECH-REPORT] PhD thesis. [INSTITUTION] DIKU, University of Copenhagen, [DATE] May, Information Extraction with Constraints

Page 46 Hard Constraints vs. Weighted Constraints Constraints are close to perfect Labeled data might not follow the constraints

Page 47 Training with Soft Constraints Need to figure out the penalty as well… Option 1: Learning + Inference (with Constraints)  Learn the weights and penalties separately Penalty(c) = -log{P(C is violated)} Option 2: Inference (with Constraints) Based Training  Learn the weights and penalties together The tradeoff between L+I and IBT is similar to what we saw earlier.

Page 48 Y PRED = Inference Based Training With Soft Constraints For each iteration For each (X, Y GOLD ) in the training data If Y PRED != Y GOLD λ = λ + F(X, Y GOLD ) - F(X, Y PRED ) ρ I = ρ I + d( Y GOLD,1 C(X) ) - d(Y PRED,1 C(X) ), I = 1,.. endif endfor Example: Perceptron Update penalties as well ! Example: Perceptron Update penalties as well !

Page 49 L+I vs IBT for Soft Constraints Test on citation recognition:  L+I: HMM + weighted constraints  IBT: Perceptron + weighted constraints  Same feature set With constraints  Factored Model is better  More significant with a small # of examples Without constraints  Few labeled examples, HMM > perceptron  Many labeled examples, perceptron > HMM

Page 50 Outline Constrained Conditional Models  Motivation  Examples Training Paradigms: Investigate ways for training models and combining constraints  Joint Learning and Inference vs. decoupling Learning & Inference  Training with Hard and Soft Constrains  Guiding Semi-Supervised Learning with Constraints Examples  Semantic Parsing  Information Extraction  Pipeline processes

Page 51 Outline Constrained Conditional Models  Motivation  Examples Training Paradigms: Investigate ways for training models and combining constraints  Joint Learning and Inference vs. decoupling Learning & Inference  Guiding Semi-Supervised Learning with Constraints Features vs. Constraints Hard and Soft Constraints Examples  Semantic Parsing  Information Extraction  Pipeline processes

Page 52 Constraints As a Way To Encode Prior Knowledge Consider encoding the knowledge that:  Entities of type A and B cannot occur simultaneously in a sentence The “Feature” Way  Requires larger models The Constraints Way  Keeps the model simple; add expressive constraints directly  A small set of constraints  Allows for decision time incorporation of constraints A effective way to inject knowledge Need more training data We can use constraints as a way to replace training data

Page 53 Guiding Semi-Supervised Learning with Constraints Model Decision Time Constraints Un-labeled Data Constraints In traditional Semi-Supervised learning the model can drift away from the correct one. Constraints can be used to generate better training data  At decision time, to bias the objective function towards favoring constraint satisfaction.  At training to improve labeling of un-labeled data (and thus improve the model)

Page 54 Semi-supervised Learning with Constraints =learn(T)‏ For N iterations do T=  For each x in unlabeled dataset {y 1,…,y K }  InferenceWithConstraints(x,C, )‏ T=T  {(x, y i )} i=1…k =  +(1-  )learn(T)‏ Learn from new training data. Weigh supervised and unsupervised model. Inference based augmentation of the training set (feedback)‏ (inference with constraints). Supervised learning algorithm parameterized by [Chang, Ratinov, Roth, ACL’07;ICML’08]

Page 55 Objective function: Value of Constraints in Semi-Supervised Learning # of available labeled examples Learning w 10 Constraints Constraints are used to Bootstrap a semi- supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model. Learning w/o Constraints: 300 examples. Factored model.

Page x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 X Y y1y1 Single Output Problem: Only one output Single Output Problem: Only one output Hard to find constraints!? Constraints in a hidden layer Intuition: introduce structural hidden variables

Page Adding Constraints Through Hidden Variables x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 X Y f1f1 f2f2 f4f4 f3f3 f5f5 y1y1 Single Output Problem with hidden variables Single Output Problem with hidden variables Use constraints to capture the dependencies

Page 58 ( איטליה,Italy)   Yes/No Learning feature representation is a structured learning problem  Features are the graph edges – the problem is choosing the optimal subset  Many constraints on the legitimacy of the active features representation  Formalize the problem as a constrained optimization problem A successful solution depends on: Subject to: One-to-One mapping; Non- crossing Length difference restriction Language specific constraints Learning Good Feature Representation for Discriminative Transliteration ylatI יטלי הא features  Iterative Unsupervised learning algorithm Learning a good objective function  Romanization table  Good initial objective function

Page 59 Iterative Objective Function Learning Inference Prediction Training Romanization Table Romanization Table Generate features Predict labels for all word pairs Update weight vector Initial objective function Language pairUCDLPrev. Sys English-Russian (ACC)7363 English-Hebrew (MRR)89.951

Page 60 y* = argmax y  w i Á (x; y) Linear objective functions Typically Á (x,y) will be local functions, or Á (x,y) = Á (x) Summary: Constrained Conditional Models y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 Conditional Markov Random FieldConstraints Network   i ½ i d C (x,y) Expressive constraints over output variables Soft, weighted constraints Specified declaratively as FOL formulae Clearly, there is a joint probability distribution that represents this mixed model. We would like to:  Learn a simple model or several simple models  Make decisions with respect to a complex model Key difference from MLNs, which provide a concise definition of a model, but the whole joint one.

Page 61 Conclusion Constrained Conditional Models combine  Learning conditional models with using declarative expressive constraints  Within a constrained optimization framework Use constraints! The framework supports:  A clean way of incorporating constraints to bias and improve decisions of supervised learning models Significant success on several NLP and IE tasks (often, with ILP)  A clean way to use (declarative) prior knowledge to guide semi- supervised learning Training protocol matters  More work needed here LBJ (Learning Based Java): A modeling language for Constrained Conditional Models. Supports programming along with building learned models, high level specification of constraints and inference with constraints

Page 62 Nice to Meet You