Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Nice to Meet You

 Informally:  Everything that has to do with constraints (and learning models)  Formally:  We typically make decisions based on models such as:  With CCMs we make decisions based on models such as:  We do not define the learning method  but we’ll discuss it and make suggestions  CCMs make predictions in the presence/guided by constraints Page 3 Constraints Conditional Models (CCMs)

Constraints Driven Learning and Decision Making Why Constraints?  The Goal: Building a good NLP systems easily  We have prior knowledge at our hand How can we use it? We suggest that often knowledge can be injected directly  Can use it to guide learning  Can use it to improve decision making  Can use it to simplify the models we need to learn How useful are constraints?  Useful for supervised learning  Useful for semi-supervised learning  Sometimes more efficient than labeling data directly

Inference

Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. A process that maintains and updates a collection of propositions about the state of affairs. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. This is an Inference Problem

This Tutorial: Constrained Conditional Models Part 1: Introduction to CCMs Part 1  Examples: NE + Relations Information extraction – correcting models with CCMS  First summary: why are CCM important  Problem Setting Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming Part 2  What is ILP; use of ILP in NLP Part 3: Detailed examples of using CCMs Part 3  Semantic Role Labeling in Details  Coreference Resolution  Sentence Compression BREAK

This Tutorial: Constrained Conditional Models (2 nd part) Part 4: More on Inference – Part 4  Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes  Using hard and soft constraints Part 5: Training issues when working with CCMS Part 5  Formalism (again)  Choices of training paradigms -- Tradeoffs Examples in Supervised learning Examples in Semi-Supervised learning Part 6: Conclusion Part 6  Building CCMs  Features and Constraints; Objective functions; Different Learners.  Mixed models vs. Joint models; where is Knowledge coming from THE END

Learning and Inference  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.  E.g. Structured Output Problems – multiple dependent output variables  (Learned) models/classifiers for different sub-problems  In some cases, not all local models can be learned simultaneously  Key examples in NLP are Textual Entailment and QA  In these cases, constraints may appear only at evaluation time  Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions  decisions that respect the local models as well as domain & context specific knowledge/constraints.

This Tutorial: Constrained Conditional Models Part 1: Introduction to CCMs  Examples: NE + Relations Information extraction – correcting models with CCMS  First summary: why are CCM important  Problem Setting Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming  What is ILP; use of ILP in NLP  Semantic Role Labeling in Details Part 3: Detailed examples of using CCMs  Coreference Resolution  Sentence Compression BREAK

This Tutorial: Constrained Conditional Models (2 nd part) Part 4: More on Inference –  Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes  Using hard and soft constraints Part 5: Training issues when working with CCMS  Formalism (again)  Choices of training paradigms -- Tradeoffs Examples in Supervised learning Examples in Semi-Supervised learning Part 6: Conclusion  Building CCMs  Features and Constraints; Objective functions; Different Learners.  Mixed models vs. Joint models; where is Knowledge coming from THE END

Inference with General Constraint Structure [Roth&Yih’04] Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 Improvement over no inference: 2-5% Some Questions: How to guide the global inference? Why not learn Jointly? Models could be learned separately; constraints may come up only at decision time. Note: Non Sequential Model

Task of Interests: Structured Output For each instance, assign values to a set of variables Output variables depend on each other Common tasks in  Natural language processing Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,…  Information extraction Entities, Relations,… Many pure machine learning approaches exist  Hidden Markov Models (HMMs) ‏ ; CRFs  Structured Perceptrons ad SVMs… However, …

Information Extraction via Hidden Markov Models Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994. Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994. [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Unsatisfactory results ! Motivation II

Strategies for Improving the Results (Pure) Machine Learning Approaches  Higher Order HMM/CRF?  Increasing the window size?  Adding a lot of new features Requires a lot of labeled examples  What if we only have a few labeled examples? Any other options?  Humans can immediately tell bad outputs  The output does not make sense Increasing the model complexity Can we keep the learned model simple and still make expressive decisions?

Information extraction without Prior Knowledge Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994. [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of natural constraints! Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994.

Examples of Constraints Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers

Adding constraints, we get correct results!  Without changing the model [AUTHOR] Lars Ole Andersen. [TITLE] Program analysis and specialization for the C Programming language. [TECH-REPORT] PhD thesis. [INSTITUTION] DIKU, University of Copenhagen, [DATE] May, 1994. Information Extraction with Constraints

Random Variables Y: Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined over partial assignments (possibly: + weights W ) Goal: Find the “best” assignment  The assignment that achieves the highest global performance. This is an Integer Programming Problem Problem Setting y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 C(y 1,y 4 ) C(y 2,y 3,y 6,y 7,y 8 ) Y*=argmax Y P  Y subject to constraints C (+ W  C) observations

Formal Model How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Weight Vector for “local” models Penalty for violating the constraint. How far y is from a “legal” assignment Subject to constraints A collection of Classifiers; Log-linear models (HMM, CRF) or a combination How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

Features Versus Constraints Á i : X £ Y ! R; C i : X £ Y ! {0,1}; d : X £ Y ! R;  In principle, constraints and features can encode the same propeties  In practice, they are very different Features  Local, short distance properties – to allow tractable inference  Propositional (grounded):  E.g. True if: “the” followed by a Noun occurs in the sentence” Constraints  Global properties  Quantified, first order logic expressions  E.g.True if: “all y i s in the sequence y are assigned different values.” Indeed, used differently

Encoding Prior Knowledge Consider encoding the knowledge that:  Entities of type A and B cannot occur simultaneously in a sentence The “Feature” Way  Results in higher order HMM, CRF  May require designing a model tailored to knowledge/constraints  Large number of new features: might require more labeled data  Wastes parameters to learn indirectly knowledge we have. The Constraints Way  Keeps the model simple; add expressive constraints directly  A small set of constraints  Allows for decision time incorporation of constraints A form of supervision Need more training data

Constrained Conditional Models – 1 st Summary Everything that has to do with Constraints and Learning models In both examples, we started with learning models  Either for components of the problem Classifiers for Relations and Entities  Or the whole problem Citations We then included constraints on the output  As a way to “correct” the output of the model In both cases this allows us to  Learn simpler models than we would otherwise As presented, global constraints did not take part in training  Global constraints were used only at the output. We will later call this training paradigm L+I

This Tutorial: Constrained Conditional Models Part 1: Introduction to CCMs  Examples: NE + Relations Information extraction – correcting models with CCMS  First summary: why are CCM important  Problem Setting Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming  What is ILP; use of ILP in NLP  Semantic Role Labeling in Details Part 3: Detailed examples of using CCMs  Coreference Resolution  Sentence Compression BREAK

Formal Model How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Weight Vector for “local” models Penalty for violating the constraint. How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

Inference with constraints. We start with adding constraints to existing models. It’s a good place to start, because conceptually all you do is to add constraints to what you were doing before and the performance improves. Page 27

Constraints and Integer Linear Programming (ILP) ILP is powerful (NP-complete) ILP is popular – inference for many models, such as Viterbi for CFR, have already been implemented. Powerful off-the shelf solvers exist. All we need is to write down the objective function and the constraints, there is no need to write code. Page 28

Linear Programming Key contributors: Leonid Kantorovich, George B. Dantzig, John von Neumann.Leonid KantorovichGeorge B. Dantzig John von Neumann Optimization technique with linear objective function and linear constraints. Note that the word “Integer” is absent. Page 29

Example (Thanks James Clarke) Telfa Co. produces tables and chairs  Each table makes 8$ profit, each chair makes 5$ profit. We want to maximize the profit. Page 30

Telfa Co. produces tables and chairs  Each table makes 8$ profit, each chair makes 5$ profit.  A table requires 1 hour of labor and 9 sq. feet of wood.  A chair requires 1 hour of labor and 5 sq. feet of wood.  We have only 6 hours of work and 45sq. Feet of wood. We want to maximize the profit. Example (Thanks James Clarke) Page 31

Solving LP problems. Page 32

Solving LP problems Page 36

Solving LP Problems. Page 37

Integer Linear Programming- integer solutions. Page 38

Integer Linear Programming. In NLP, we are dealing with discrete outputs, therefore we’re almost always interested in integer solutions. ILP is NP-complete, but often efficient for large NLP problems. In some cases, the solutions to LP are integral (e.g totally unimodular constraint matrix.).  Next, we show an example for using ILP with constraints. The matrix is not totally unimodular, but LP gives integral solutions. Page 39

Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45

Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 NLP with ILP- key issues: 1)Write down the objective function. 2)Write down the constraints as linear inequalities

Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45

Back to Example: cost function other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 x {E1 = per}, x {E1 = loc}, …, x {R12 = spouse_of}, x {R12 = born_in}, …, x {R12 =  }, …  {0,1} R 12 R 21 R 23 R 32 R 13 R 31 E1E1 Dole E2E2 Elizabeth E3E3 N.C.

Back to Example: cost function other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 Cost function: c {E1 = per} · x {E1 = per} + c {E1 = loc} · x {E1 = loc} + … + c {R12 = spouse_of} · x {R12 = spouse_of} + … + c {R12 =  } · x {R12 =  } + … R 12 R 21 R 23 R 32 R 13 R 31 E1E1 Dole E2E2 Elizabeth E3E3 N.C.

Adding Constraints Each entity is either a person, organization or location: x {E1 = per} + x {E1 = loc} + x {E1 = org} + x {E1 =  } =1 ( R 12 = spouse_of)  ( E 1 = person)  ( E 2 = person) x {R12 = spouse_of}  x {E1 = per} x {R12 = spouse_of}  x {E2 = per} We need more consistency constraints. Any Boolean constraint can be written with a set of linear inequalities, and an efficient algorithm exist [Rizzollo2007]. Page 45

CCM for NE-Relations We showed how a CCM formulation for the NE-Relation problem. In this case, the cost of each variable was learned independently, using trained classifiers. Other expressive problems can be formulated as Integer Linear Programs.  For example, HMM/CRF inference, Viterbi algorithms Page 46 c {E1 = per} · x {E1 = per} + c {E1 = loc} · x {E1 = loc} + … + c {R12 = spouse_of} · x {R12 = spouse_of} + … + c {R12 =  } · x {R12 =  } + …

TSP as an Integer Linear Program Page 47  Dantzig et al. were the first to suggest a practical solution to the problem using ILP. Dantzig

Reduction from Traveling Salesman to ILP x 12 c 12 3 1 2 x 32 c 32 Every node has ONE outgoing edge

Reduction from Traveling Salesman to ILP Every node has ONE incoming edge x 12 c 12 3 1 2 x 32 c 32

Reduction from Traveling Salesman to ILP The solutions are binary (int) x 12 c 12 3 1 2 x 32 c 32

Reduction from Traveling Salesman to ILP No sub graph contains a cycle x 12 c 12 3 1 2 x 32 c 32

Viterbi as an Integer Linear Program Page 52 y0y0 y1y1 y2y2 yNyN x0x0 x1x1 x2x2 xNxN

Viterbi with ILP Page 53 y2y2 y0y0 y1y1 yNyN x2x2 x0x0 x1x1 xNxN y 01 y 02 y 0M S -Log{P(x 0 |y 0 =1) P(y 0 =1)}

Viterbi with ILP Page 54 y2y2 y0y0 y1y1 yNyN x2x2 x0x0 x1x1 xNxN y 01 y 02 y 0M S -Log{P(x 0 |y 0 =2)P(y 0 =2)}

Viterbi with ILP Page 55 y2y2 y0y0 y1y1 yNyN x2x2 x0x0 x1x1 xNxN y 01 y 02 y 0M S -Log{P(x 0 |y 0 =M) P(y 0 =M)}

Viterbi with ILP Page 56 y2y2 y0y0 y1y1 yNyN x2x2 x0x0 x1x1 xNxN y 01 y 02 y 0M y 11 y 12 y 1M -Log{P(x 2 |y 1 =1)P(y 1 =1|y 0 =1)} S

Viterbi with ILP Page 57 y2y2 y0y0 y1y1 yNyN x2x2 x0x0 x1x1 xNxN y 01 y 02 y 0M y 11 y 12 y 1M -Log{P(x 2 |y 1 =2)P(y 1 =1|y 0 =1)} S

Viterbi with ILP Page 58 y0y0 y1y1 y2y2 yNyN x0x0 x1x1 x2x2 xNxN S y 01 y 02 y 0M y 11 y 12 y 1M y 21 y 22 y 2M y N1 y N2 y NM T

Viterbi with ILP Page 59 y0y0 y1y1 y2y2 yNyN x0x0 x1x1 x2x2 xNxN S y 01 y 02 y 0M y 11 y 12 y 1M y 21 y 22 y 2M y N1 y N2 y NM T Viterbi = shortest path (S  T ) with dynamic programming. We saw TSP  ILP. Now show: Shortest Path  ILP.

Shortest path with ILP. For each edge (u,v),  x uv =1 if (u,v) is on the shortest path, 0 otherwise.  c uv = the weight of the edge. Page 60

Shortest path with ILP. For each edge (u,v),  x uv =1 if (u,v) is on the shortest path, 0 otherwise.  c uv = the weight of the edge. Page 61

Shortest path with ILP. For each edge (u,v),  x uv =1 if (u,v) is on the shortest path, 0 otherwise.  c uv = the weight of the edge. Page 62 All nodes except the S and the T have eq. number of ingoing and outgoing edges

Shortest path with ILP. For each edge (u,v),  x uv =1 if (u,v) is on the shortest path, 0 otherwise.  c uv = the weight of the edge. Page 63 All nodes except the S and the T have eq. number of ingoing and outgoing edges The source node has one outgoing edge more than ingoing edges.

Shortest path with ILP. For each edge (u,v),  x uv =1 if (u,v) is on the shortest path, 0 otherwise.  c uv = the weight of the edge. Page 64 All nodes except the S and the T have eq. number of ingoing and outgoing edges The source node has one outgoing edge more than ingoing edges. The target node has one ingoing edge more than outgoing edges.

Shortest path with ILP. For each edge (u,v),  x uv =1 if (u,v) is on the shortest path, 0 otherwise.  c uv = the weight of the edge. Page 65 Interestingly, in this formulation, the solution to the general LP problem will have integer solutions. Thus, we have a reasonably fast alternative to Viterbi with ILP.

Who cares that Viterbi is an ILP? Assume you want to learn an HMM/CRF model (e.g., Extracting fields from citations (IE)) But, you also want to add expressive constraints:  No field can appear twice in the data.  The fields and are mutually exclusive.  The field must appear at least once. Do:  Learn HMM/CRF  Convert to an LP  Modify the LP canonical constraint matrix A concrete example for adding constraints over CRF will be shown

This Tutorial: Constrained Conditional Models Part 1: Introduction to CCMs  Examples: NE + Relations Information extraction – correcting models with CCMS  First summary: why are CCM important  Problem Setting Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming  What is ILP; use of ILP in NLP Part 3: Detailed examples of using CCMs  Semantic Role Labeling in Details  Coreference Resolution  Sentence Compression BREAK

CCM Examples Many works in NLP make use of constrained conditional models, implicitly or explicitly. Next we describe in details three examples. Example 2: Semantic Role Labeling.  The use of inference with constraints to improve semantic parsing Example 3 (Co-ref):  combining classifiers through objective function outperforms pipeline. Example 4 (Sentence Compression):  Simple language model with constraints outperforms complex models. Page 69

Example 2: Semantic Role Labeling Demo:http://L2R.cs.uiuc.edu/~cogcomphttp://L2R.cs.uiuc.edu/~cogcomp Approach : 1) Reveals several relations. Top ranked system in CoNLL’05 shared task Key difference is the Inference 2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP) Who did what to whom, when, where, why,…

Simple sentence: I left my pearls to my daughter in my will. [ I ] A0 left [ my pearls ] A1 [ to my daughter ] A2 [ in my will ] AM-LOC. A0Leaver A1Things left A2Benefactor AM-LOCLocation I left my pearls to my daughter in my will.

SRL Dataset PropBank [Palmer et. al. 05] Core arguments: A0 - A5 and AA  different semantics for each verb  specified in the PropBank Frame files 13 types of adjuncts labeled as AM-arg  where arg specifies the adjunct type In this problem, all the information is given, but conceptually, we could train different components on different resources.

Algorithmic Approach Identify argument candidates  Pruning [Xue&Palmer, EMNLP’04]  Argument Identifier Binary classification (SNoW) Classify argument candidates  Argument Classifier Multi-class classification (SNoW) Inference  Use the estimated probability distribution given by the argument classifier  Use structural and linguistic constraints  Infer the optimal global output I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her Identify Vocabulary Inference over (old and new) Vocabulary candidate arguments Well understood I left my nice pearls to her

Learning Both argument identifier and argument classifier are trained using SNoW.  Sparse network of linear functions  Multiclass classifier that produces a probability distribution over output values. Features (some examples)  Voice, phrase type, head word, parse tree path from predicate, chunk sequence, syntactic frame, …  Conjunction of features

Inference The output of the argument classifier often violates some constraints, especially when the sentence is long. Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming (ILP). Input:  The scores from the argument classifier  Structural and linguistic constraints ILP allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Semantic Role Labeling (SRL) I left my pearls to my daughter in my will. 0.5 0.15 0.1 0.15 0.6 0.05 0.1 0.2 0.6 0.05 0.7 0.05 0.15 0.3 0.2 0.1 0.2

Semantic Role Labeling (SRL) I left my pearls to my daughter in my will. 0.5 0.15 0.1 0.15 0.6 0.05 0.1 0.2 0.6 0.05 0.7 0.05 0.15 0.3 0.2 0.1 0.2 One inference problem for each verb predicate.

Integer Linear Programming Inference For each argument a i  Set up a Boolean variable: a i,t indicating whether a i is classified as t Goal is to maximize   i score(a i = t ) a i,t  Subject to the (linear) constraints If score(a i = t ) = P(a i = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.

No duplicate argument classes  a  P OT A RG x { a = A0 }  1 R-ARG  a2  P OT A RG,  a  P OT A RG x { a = A0 }  x { a2 = R-A0 } C-ARG  a2  P OT A RG,  (a  P OT A RG )  (a is before a2 ) x { a = A0 }  x { a2 = C-A0 } Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B Any Boolean rule can be encoded as a set of linear inequalities. If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Constraints Joint inference can be used also to combine different SRL Systems. Universally quantified rules LBJ: allows a developer to encode constraints in FOL; these are compiled into linear inequalities automatically.

Summary: Semantic Role Labeling Demo: http://L2R.cs.uiuc.edu/~cogcomp Who did what to whom, when, where, why,…

Example 3: co-ref (thanks Pascal Denis) Page 82

Example 1: co-ref (thanks Pascal Denis) Page 83 Traditional approach: train a classifier which predicts the probability that two named entities co-refer: P co-ref (Clinton, NPR) = 0.2 (order sensitive!) P co-ref (Lewinsky, her) = 0.6 Decision Process: E.g., If P co-ref (s 1, s 2 ) > 0.5, label two entities as co-referent or, clustering

Example 1: co-ref (thanks Pascal Denis) Page 84 Traditional approach: train a classifier which predicts the probability that two named entities co-refer: P co-ref (Clinton, NPR) = 0.2 P co-ref (Lewinsky, her) = 0.6 Decision Process: E.g., If P co-ref (s 1, s 2 ) > 0.5, label two entities as co-referent or, clustering Evaluation: Cluster all the entities based on co-ref classifier predictions. The evaluation is based on cluster overlap.

Example 1: co-ref (thanks Pascal Denis) Page 85 Two types of entities: “Base entities” “Anaphors” (pointers) NPR

Example 1: co-ref (thanks Pascal Denis) Page 86 Error analysis: 1)“Base entities” that “point” to anaphors. 2)Anaphors that don’t “point” to anything.

Proposed solution. Page 87 Pipelined approach: Identify anaphors. Forbid links of certain types. NPR

Proposed solution. Page 88 Doesn’t work, despite 80% accuracy of the anaphorisity classifier. The reason: error propagation in the pipeline. NPR

Joint Solution with CCMs Page 89

Joint solution using CCMs Page 90

Joint solution using CCMs Page 91 This performs considerably better than the pipelined approach, and actually improves over the baseline without anaphorisity classifier (though marginally)

Joint solution using CCMs Page 92 α β Note: If we have reasons to believe one of the classifiers significantly more than the other, we can scale with tuned parameters

Example 4 - Sentence Compression (thanks James Clarke).

Example Example: 012345678 Bigfisheatsmallfishinasmallpond Bigfishinapond

Language model-based compression

Example 2 : summarization (thanks James Clarke) This formulation requires some additional constraints Big fish eat small fish in a small pond No selection of decision variables can make these trigrams appear consecutively in output. We skip these constraints here.

Trigram model in action

Modifier Constraints

Example

Sentential Constraints

Example

More constraints

Other CCM Examples: Opinion Recognition Y. Choi, E. Breck, and C. Cardie. Joint Extraction of Entities and Relations for Opinion Recognition EMNLP-2006 Semantic parsing variation:  Agent=entity  Relation=opinion Constraints:  A an agent can have at most two opinions.  An opinion should be linked to only one agent.  The usual non-overlap constraints. Page 105

Other CCM Examples: Temporal Ordering N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008. Page 106

Other CCM Examples: Temporal Ordering N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008. Page 107 Three types of edges: 1)Annotation relations before/after 2)Transitive closure constraints 3)Time normalization constraints

Related Work: Language generation. Regina Barzilay and Mirella Lapata. Aggregation via Set Partitioning for Natural Language Generation.HLT- NAACL-2006. Constraints:  Transitivity: if (e i,e j )were aggregated, and (e i,e jk ) were too, then (e i,e k ) get aggregated.  Max number of facts aggregated, max sentence length. Page 108

MT & Alignment Ulrich Germann, Mike Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. ACL 2001. John DeNero and Dan Klein. The Complexity of Phrase Alignment Problems. ACL-HLT-2008. Page 109

Summary of Examples We have shown several different NLP solution that make use of CCMs. Examples vary in the way models are learned. In all cases, constraints can be expressed in a high level language, and then transformed into linear inequalities. Learning based Java (LBJ) [Rizollo&Roth’07] describe an automatic way to compile high level description of constraint into linear inequalities. Page 110

Solvers All applications presented so far used ILP for inference. People used different solvers  Xpress-MP  GLPK  lpsolve  R  Mosek  CPLEX Next we discuss other inference approaches to CCMs

112 This Tutorial: Constrained Conditional Models (2 nd part) Part 4: More on Inference –  Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes  Using hard and soft constraints Part 5: Training issues when working with CCMS  Formalism (again)  Choices of training paradigms -- Tradeoffs Examples in Supervised learning Examples in Semi-Supervised learning Part 6: Conclusion  Building CCMs  Features and Constraints; Objective functions; Different Learners.  Mixed models vs. Joint models; where is Knowledge coming from THE END

113 Where Are We ? We hope we have already convinced you that  Using constraints is a good idea for addressing NLP problems  Constrained conditional models provide a good platform We were talking about using expressive constraints  To improve existing models  Learning + Inference  The problem: inference A powerful inference tool: Integer Linear Programming  SRL, co-ref, summarization, entity-and-relation…  Easy to inject domain knowledge

114 Constrained Conditional Model : Inference How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Weight Vector for “local” models Constraint violation penalty How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

115 Advantages of ILP Solvers: Review ILP is Expressive: We can solve many inference problems  Converting inference problems into ILP is easy ILP is Easy to Use: Many available packages  (Open Source Packages): LPSolve, GLPK, …  (Commercial Packages): XPressMP, Cplex  No need to write optimization code! Why should we consider other inference options?

116 ILP: Speed Can Be an Issue Inference problems in NLP  Sometimes large problems are actually easy for ILP E.g. Entities-Relations Many of them are not “difficult” When ILP isn’t fast enough, and one needs to resort to approximate solutions. The Problem: General Solvers vs. Specific Solvers  ILP is a very, very general solver  But, sometimes the structure of the problem allows for simpler inference algorithms. Next we give examples for both cases.

117 Example 1: Search based Inference for SRL The objective function Constraints  Unique labels  No overlapping or embedding  If verb is of type A, no argument of type B …… Intuition: check constraints’ violations on partial assignments Maximize summation of the scores subject to linguistic constraints Indicator variable assigns the j-th class for the i-th token Classification confidence

118 Inference using Beam Search For each step, discard partial assignments that violate constraints! Shape: argument Color: label Beam size = 2, Constraint: Only one Red Rank them according to classification confidence!

119 Heuristic Inference Problems of heuristic inference  Problem 1: Possibly, sub-optimal solution  Problem 2: May not find a feasible solution Drop some constraints, solve it again Using search on SRL gives comparable results to using ILP, but is much faster.

120 How to get a score for the pair? Previous approaches: Extract features for each source and target entity pair The CCM approach: Introduce an internal structure (characters) Constrain character mappings to “make sense”. Example 2: Exploiting Structure in Inference: Transliteration

121 Transliteration Discovery with CCM The problem now: inference  How to find the best mapping that satisfies the constraints? A weight is assigned to each edge. Include it or not? A binary decision. Score = sum of the mappings’ weight Assume the weights are given. More on this latter. Assume the weights are given. More on this latter. Natural constraints Pronunciation constraints One-to-One Non-crossing … Score = sum of the mappings’ weight s. t. mapping satisfies constraints

122 Finding The Best Character Mappings An Integer Linear Programming Problem Is this the best inference algorithm? Maximize the mapping score Pronounciation constraint One-to-one constraint Non-crossing constraint

123 Finding The Best Character Mappings A Dynamic Programming Algorithm Exact and fast! Maximize the mapping score Restricted mapping constraints One-to-one constraint Non-crossing constraint Take Home Message: Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler We can decompose the inference problem into two parts

124 Other Inference Options Constraint Relaxation Strategies  Try Linear Programming [Roth and Yih, ICML 2005]  Cutting plane algorithms  do not use all constraints at first Cutting plane algorithms Dependency Parsing: Exponential number of constraints [Riedel and Clarke, EMNLP 2006] Other search algorithms  A-star, Hill Climbing…  Gibbs Sampling Inference [Finkel et. al, ACL 2005] Named Entity Recognition: enforce long distance constraints Can be considered as : Learning + Inference One type of constraints only

125 Inference Methods – Summary Why ILP? A powerful way to formalize the problems  However, not necessarily the best algorithmic solution Heuristic inference algorithms are useful sometimes!  Beam search  Other approaches: annealing … Sometimes, a specific inference algorithm can be designed  According to your constraints

126 This Tutorial: Constrained Conditional Models (2 nd part) Part 4: More on Inference –  Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes  Using hard and soft constraints Part 5: Training issues when working with CCMS  Formalism (again)  Choices of training paradigms -- Tradeoffs Examples in Supervised learning Examples in Semi-Supervised learning Part 6: Conclusion  Building CCMs  Features and Constraints; Objective functions; Different Learners.  Mixed models vs. Joint models; where is Knowledge coming from THE END

127 Constrained Conditional Model: Training How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Weight Vector for “local” models Constraint violation penalty How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

128 Where are we? Algorithmic Approach: Incorporating general constraints  Showed that CCMs allow for formalizing many problems  Showed several algorithmic ways to incorporate global constraints in the decision. Training: Coupling vs. Decoupling Training and Inference.  Incorporating global constraints is important but  Should it be done only at evaluation time or also at training time?  How to decompose the objective function and train in parts?  Issues related to: Modularity, efficiency and performance, availability of training data Problem specific considerations

129 Training in the presence of Constraints General Training Paradigm:  First Term: Learning from data (could be further decomposed)  Second Term: Guiding the model by constraints  Can choose if constraints’ weights trained, when and how, or taken into account only in evaluation. Decompose Model (SRL case) Decompose Model from constraints

130 Comparing Training Methods Option 1: Learning + Inference (with Constraints)  Ignore constraints during training Option 2: Inference (with Constraints) Based Training  Consider constraints during training In both cases: Decision Making with Constraints Question: Isn’t Option 2 always better? Not so simple…  Next, the “Local model story”

131 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) Training Methods x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 y1y1 y2y2 y5y5 y4y4 y3y3 X Y Learning + Inference (L+I) Learn models independently Learning + Inference (L+I) Learn models independently Inference Based Training (IBT) Learn all models together! Inference Based Training (IBT) Learn all models together! Intuition Learning with constraints may make learning more difficult Intuition Learning with constraints may make learning more difficult

132 1111Y’ Local Predictions Training with Constraints Example: Perceptron-based Global Learning x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 f1(x)f1(x) f2(x)f2(x) f3(x)f3(x) f4(x)f4(x) f5(x)f5(x) X Y 11 Y True Global Labeling 111 Y’ Apply Constraints: Which one is better? When and Why?

133 Claims [Punyakanok et. al, IJCAI 2005] Theory applies to the case of local model (no Y in the features) When the local modes are “ easy ” to learn, L+I outperforms IBT.  In many applications, the components are identifiable and easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a larger number of training examples. Other training paradigms are possible Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09]  Identify a preferred ordering among components  Learn k-th model jointly with previously learned models L+I: cheaper computationally; modular IBT is better in the limit, and other extreme cases.

134  opt =0.2  opt =0.1  opt =0 Bound Prediction Local  ≤  opt + ( ( d log m + log 1/  ) / m ) 1/2 Global  ≤ 0 + ( ( cd log m + c 2 d + log 1/  ) / m ) 1/2 BoundsSimulated Data L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I Indication for hardness of problem

135 Relative Merits: SRL Difficulty of the learning problem (# features) L+I is better. When the problem is artificially made harder, the tradeoff is clearer. easyhard In some cases problems are hard due to lack of training data. Semi-supervised learning

136 Y PRED = For each iteration For each (X, Y GOLD ) in the training data If Y PRED != Y GOLD λ = λ + F(X, Y GOLD ) - F(X, Y PRED ) endif endfor L+I & IBT: General View – Structured Perceptron The theory applies when F(x,y) = F(x) The difference between L+I and IBT The difference between L+I and IBT

137 Comparing Training Methods (Cont.) Local Models (train independently) vs. Structured Models  In many cases, structured models might be better due to expressivity But, what if we use constraints? Local Models+Constraints vs. Structured Models +Constraints  Hard to tell: Constraints are expressive  For tractability reasons, structured models have less expressivity than the use of constraints.  Local can be better, because local models are easier to learn Decompose Model (SRL case) Decompose Model from constraints

138 Example: Semantic Role Labeling Revisited s A B C A B C A B C A B C A B C t Sequential Models Conditional Random Field Global perceptron Training: sentence based Testing: find the shortest path with constraints Local Models Logistic Regression Local Avg. Perceptron Training: token based. Testing: find the best assignment locally with constraints

139 ModelCRFCRF-DCRF-IBTAvg. P Baseline66.4669.1458.15 + Constraints71.9473.9169.8274.49 Training Time 48381450.8 Which Model is Better? Semantic Role Labeling Experiments on SRL: [Roth and Yih, ICML 2005]  Story: Inject constraints into conditional random field mdoels Sequential Models Local L+I IBT Sequential Models are better than Local Models ! (No constraints) Sequential Models are better than Local Models ! (No constraints) Local Models are now better than Sequential Models! (With constraints) Local Models are now better than Sequential Models! (With constraints)

140 Summary: Training Methods Many choices for training a CCM  Learning + Inference (Training without constraints)  Inference based Learning (Training with constraints)  Based on this, what kind of models you want to use? Advantages of L+I  Require fewer training examples  More efficient; most of the time, better performance  Modularity; easier to incorporate already learned models.

141 Constrained Conditional Model: Soft Constraints (3) How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Constraint violation penalty How far y is from a “legal” assignment Subject to constraints (4) How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process? (1) Why use soft constraint? (2) How to model “degree of violations”

142 (1) Why Are Soft Constraints Important? Some constraints may be violated by the data. Even when the gold data violates no constraints, the model may prefer illegal solutions.  We want a way to rank solutions based on the level of constraints’ violation.  Important when beam search is used Working with soft constraints  Need to define the degree of violation  Need to assign penalties for constraints

143 Example: Information extraction Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994. [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of natural constraints! Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May 1994.

144 Examples of Constraints Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….

145 (2) Modeling Constraints’ Degree of Violations LarsOleAndersen. AUTH EDITOR Φ c (y 1 )=0Φ c (y 2 )=0Φ c (y 3 )=1Φ c (y 4 )=0 1 - if assigning y N to x N violates the constraint C with respect to assignment (x 1,..,x N-1 ;y 1,…,y N-1 ) 0 - otherwise Count how many times it violated the constraints: d(y,1 c(X) )= ∑Φ c (y i ) LarsOleAndersen. AUTHBOOKEDITOR Φ c (y 1 )=0Φ c (y 2 )=1Φ c (y 3 )=1Φ c (y 4 )=0 ∑Φ c (y i ) =1∑Φ c (y i ) =2 State transition must occur on punctuations.

146 Example: Degree of Violation? It might be the case that all “good” assignments violate some constraints  We lost the ability to judge which assignment is better Better options: Choose the one with fewer violations! LarsOleAndersen. AUTHBOOKEDITOR Φ c (y 1 )=0Φ c (y 2 )=1Φ c (y 3 )=1Φ c (y 4 )=0 LarsOleAndersen. AUTH EDITOR Φ c (y 1 )=0Φ c (y 2 )=0Φ c (y 3 )=1Φ c (y 4 )=0 The first one is better because of d(y,1 c(X) )!

147 Hard Constraints vs. Weighted Constraints Constraints are close to perfect Labeled data might not follow the constraints

148 (3) Constrained Conditional Model with Soft Constraints How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Constraint violation penalty How far y is from a “legal” assignment How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

Inference with Beam Search (& Soft Constraints) Au L.Shariet.al.1998.BuildingsemanticConcordances.In C.Fellbaum,ed.WordNet:anditsapplications. ILP is another optionILP is another option! Run beam search inference as before. Rather than eliminating illegal assignments, re-rank them. 149

Assume that We Have Made a Mistake… 150 Au DateTitle Journal L.Shariet.al.1998.BuildingsemanticConcordances.In C.Fellbaum,ed.WordNet:anditsapplications. Assumption: the best assignment according to the objective function Start from here

Top Choice from Learned Models Is Not Good : ∑Φ c1 (y i ) =1 ; ∑Φ c2 (y i ) =1+1 151 Au DateTitle JournalBook L.Shariet.al.1998.BuildingsemanticConcordances.In Au C.Fellbaum,ed.WordNet:anditsapplications. Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. Two violations.

The Second Choice from Learned Models Is Not Good 152 Au DateTitle JournalBook L.Shariet.al.1998.BuildingsemanticConcordances.In Editor C.Fellbaum,ed.WordNet:anditsapplications. State transitions must occur on punctuation marks. Two violations. ∑Φ c2 (y i ) =1+1

Use Constraints to Find the Best Assignment Au DateTitle JournalEditor L.Shariet.al.1998.BuildingsemanticConcordances.In Editor C.Fellbaum,ed.WordNet:anditsapplications. 153 State transitions must occur on punctuation marks. one violations. ∑Φ c2 (y i ) =1

Soft Constraints Help Us Find a Better Assignment Au DateTitle JournalEditor L.Shariet.al.1998.BuildingsemanticConcordances.In Editor Book C.Fellbaum,ed.WordNet:anditsapplications. 154 We can do this because we use soft constraints. If we use hard constraints, the score of all assignment given the prefix is negative infinity.

155 (4) Constrained Conditional Model with Soft Constraints How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Constraint violation penalty How far y is from a “legal” assignment How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

156 Compare Training Methods (with Soft Constraints) Need to figure out the penalty as well… Option 1: Learning + Inference (with Constraints)  Learn the weights and penalties separately Option 2: Inference (with Constraints) Based Training  Learn the weights and penalties together The tradeoff between L+I and IBT is similar to what we have seen earlier.

157 Y PRED = Inference Based Training With Soft Constraints For each iteration For each (X, Y GOLD ) in the training data If Y PRED != Y GOLD λ = λ + F(X, Y GOLD ) - F(X, Y PRED ) ρ I = ρ I + d( Y GOLD,1 C(X) ) - d(Y PRED,1 C(X) ), I = 1,.. endif endfor Example: Perceptron Update penalties as well ! Example: Perceptron Update penalties as well !

158 L+I v.s IBT for Soft Constraints Test on citation recognition:  L+I: HMM + weighted constraints  IBT: Perceptron + weighted constraints  Same feature set With constraints  Factored Model is better  More significant with a small # of examples Without constraints  Few labeled examples, HMM > perceptron  Many labeled examples, perceptron > HMM Agrees with earlier results in the supervised setting ICML’05, IJCAI’5

Summary – Soft Constraints Using soft constraints is important sometimes  Some constraints might be violated in the gold data  We want to have the notion of degree of violation Degree of violation  One approximation: count how many time the constraints was violated How to solve? Beam search, ILP, … How to learn?  L+I v.s. IBT 159

160 Constrained Conditional Model: Injecting Knowledge How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible (Soft) constraints component Weight Vector for “local” models Constraint violation penalty How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process? Inject Prior Knowledge via Constraints A.K.A. Semi/unsupervised Learning with Constrained Conditional Model

161 Constraints As a Way To Encode Prior Knowledge Consider encoding the knowledge that:  Entities of type A and B cannot occur simultaneously in a sentence The “Feature” Way  Requires larger models The Constraints Way  Keeps the model simple; add expressive constraints directly  A small set of constraints  Allows for decision time incorporation of constraints A effective way to inject knowledge Need more training data We can use constraints as a way to replace training data

162 Constraint Driven Semi/Un Supervised Learning CODL Use constraints to generate better training samples in semi/unsupervised leaning. Model Unlabeled Data Prediction Label unlabeled data Feedback Learn from labeled data Seed Examples  Prediction + Constraints Label unlabeled data Prediction + Constraints Label unlabeled data Better Feedback Learn from labeled data Better Feedback Learn from labeled data Better Model Resource  In traditional semi/unsupervised Learning, models can drift away from correct model

163 Semi-Supervised Learning with Soft Constraints (L+I) Learning model weights  Example: HMM Constraints Penalties  Hard Constraints : infinity  Weighted Constraints: ½ i = -log P{Constraint C i is violated in training data} Only 10 constraints Learn the weights and penalty separately

164 Semi-supervised Learning with Constraints =learn(T) ‏ For N iterations do T=  For each x in unlabeled dataset {y 1,…,y K }  InferenceWithConstraints(x,C, ) ‏ T=T  {(x, y i )} i=1…k =  +(1-  )learn(T) ‏ Learn from new training data. Weigh supervised and unsupervised model. Inference based augmentation of the training set (feedback)‏ (inference with constraints). Supervised learning algorithm parameterized by [Chang, Ratinov, Roth, ACL’07]

165 Objective function: Value of Constraints in Semi-Supervised Learning 300 # of available labeled examples Learning w 10 Constraints Constraints are used to Bootstrap a semi- supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model. Learning w/o Constraints: 300 examples.

166 Unsupervised Constraint Driven Learning Semi-supervised Constraint Driven Learning  In [Chang et.al. ACL 2007], they use a small labeled dataset Reason: We need a good starting point! Unsupervised Constraint Driven Learning  Sometimes, good resources exist We can use the resource to initialize the model We do not need labeled instances at all! Application: transliteration discovery

167 Why Adding Constraints? Before talking about transliteration discovery  Let’s think about why adding constraints again  Reason: We want to capture the dependencies between different outputs A new question: What happen if you are trying to solve a single output classification problem  Can you still inject prior knowledge?

168 x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 X Y y1y1 y2y2 y4y4 y3y3 y5y5 Structure Output Problem: Dependencies between different outputs Structure Output Problem: Dependencies between different outputs Use constraints to capture the dependencies Why Add Constraints?

169 x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 X Y y1y1 Single Output Problem: Only one output Single Output Problem: Only one output Hard to find constraints!? Why Add Constraints? Intuition: introduce structural hidden variables

170 Adding Constraints Through Hidden Variables x1x1 x6x6 x2x2 x5x5 x4x4 x3x3 x7x7 X Y f1f1 f2f2 f4f4 f3f3 f5f5 y1y1 Single Output Problem with hidden variables Single Output Problem with hidden variables Use constraints to capture the dependencies

171 Hidden variables Character mappings Character mappings is not our final goal But, it provides a intermediate representation that satisfies natural constraints. Example: Transliteration Discovery The score for each pair depends on the character mappings.

172 Transliteration Discovery with Hidden Variables For each source NE, find the best target candidate For each NE pair, the score is equal to  the score of the best hidden set of character mapping (edges) Add constraints when predicting hidden variables Violation A CCM Formulation, use constraints to bias the prediction! Alternative View: Finding the best feature representation A CCM Formulation, use constraints to bias the prediction! Alternative View: Finding the best feature representation

173 Natural Constraints A Dynamic Programming Algorithm Exact and fast! Maximize the mapping score Pronouncition constraints One-to-one constraint Non-crossing constraint Take Home Message: Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler We can decompose the problem into two parts

174 Algorithm: High Level View Model  Resource (Romanization Table) Until Convergence  Use model + Constraints to get assignments for both hidden variables (F) and labels (Y)  Update the model with newly-labeled F and y Get feedback from both hidden variables and labels

175 Results - Russian Higher, better No constraints + 20 labeled example No constraints + 20 labeled example Unsupervised Constraint Driven Learning No labeled data No temporal Information Unsupervised Constraint Driven Learning No labeled data No temporal Information No constraints + 20 labeled example + temporal No constraints + 20 labeled example + temporal No NER on Russian

176 Results - Hebrew No constraints + 250 labeled example No constraints + 250 labeled example Unsupervised Constraint Driven Learning No labeled data No temporal Information Unsupervised Constraint Driven Learning No labeled data No temporal Information

Summary Adding constraints is an effective way to inject knowledge  Constraints can correct our predictions on unlabeled data Injecting constraints sometimes are more effective than annotating data  300 labeled data v.s. 20 labeled data + constraints On the cases of single output problems  Use hidden variables to capture the constraints Other approaches are possible  [Duame, EMNLP 2008] Only get feedback from the examples where constraints are satisfied 177

178 Page 178 This Tutorial: Constrained Conditional Models (2 nd part) Part 4: More on Inference –  Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes  Using hard and soft constraints Part 5: Training issues when working with CCMS  Formalism (again)  Choices of training paradigms -- Tradeoffs Examples in Supervised learning Examples in Semi-Supervised learning Part 6: Conclusion  Building CCMs  Features and Constraints; Objective functions; Different Learners.  Mixed models vs. Joint models; where is Knowledge coming from THE END

179 Conclusion Constrained Conditional Models combine  Learning conditional models with using declarative expressive constraints  Within a constrained optimization framework Our goal was to:  Introduce a clean way of incorporating constraints to bias and improve decisions of learned models  A clean way to use (declarative) prior knowledge to guide semi- supervised learning  Provide examples for the diverse usage CCMs have already found in NLP Significant success on several NLP and IE tasks (often, with ILP)

180 Technical Conclusions Presented and discussed inference algorithms  How to use constraints to make global decisions  The formulation is an Integer Linear Programming formulation, but algorithmic solutions can employ a variety of algorithms Present and discussed learning models to be used along with constraints  Training protocol matters  We distinguished two extreme cases – training with/without constraints.  Issues include performance, as well as modularity of the solution and the ability to use previously learned models. We did not attend to the question of “how to find constraints”  In order to emphasize the idea that we think background knowledge is important, it exists, and to focus on using it.  But, we talked about learning weights for constraints, and it’s clearly possible to learn constraints.

181 y* = argmax y  w i Á (x; y) Linear objective functions Typically Á (x,y) will be local functions, or Á (x,y) = Á (x) Summary: Constrained Conditional Models y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 Conditional Markov Random FieldConstraints Network   i ½ i d C (x,y) Expressive constraints over output variables Soft, weighted constraints Specified declaratively as FOL formulae Clearly, there is a joint probability distribution that represents this mixed model. We would like to:  Learn a simple model or several simple models  Make decisions with respect to a complex model Key difference from MLNs which provide a concise definition of a model, but the whole joint one.

182 Decide what variables are of interest – learn model (s) Think about constraints among the variables of interest Design your objective function LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp http://L2R.cs.uiuc.edu/~cogcomp A modeling language for Constrained Conditional Models. Supports programming along with building learned models, high level specification of constraints and inference with constraints Designing CCMs y* = argmax y  w i Á (x; y) Linear objective functions Typically Á (x,y) will be local functions, or Á (x,y) = Á (x) y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3 y7y7 y4y4 y5y5 y6y6 y8y8 y1y1 y2y2 y3y3   i ½ i d C (x,y) Expressive constraints over output variables Soft, weighted constraints Specified declaratively as FOL formulae

183 Questions? Thank you!

Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Similar presentations

Presentation on theme: "Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Similar presentations

Presentation on theme: "Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback