Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 2013 SLG Workshop, ICML, Atlanta GA Decomposing Structured Prediction via Constrained Conditional Models Dan Roth Department of Computer Science University.

Similar presentations


Presentation on theme: "June 2013 SLG Workshop, ICML, Atlanta GA Decomposing Structured Prediction via Constrained Conditional Models Dan Roth Department of Computer Science University."— Presentation transcript:

1 June 2013 SLG Workshop, ICML, Atlanta GA Decomposing Structured Prediction via Constrained Conditional Models Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Page 1 With thanks to: Collaborators: Ming-Wei Chang, Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP)

2 Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robins dad was a magician. 4. Christopher Robin must be at least 65 now. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. This is an Inference Problem Page 2

3 Learning and Inference Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we often think about simpler structured problems: Parsing, Information Extraction, SRL, etc. As we move up the problem hierarchy (Textual Entailment, QA,….) not all component models can be learned simultaneously We need to think about (learned) models for different sub-problems, often pipelined Knowledge relating sub-problems (constraints) becomes more essential and may appear only at evaluation time Goal: Incorporate models information, along with prior knowledge (constraints) in making coherent decisions Decisions that respect the local models as well as domain & context specific knowledge/constraints. Page 3

4 Outline Constrained Conditional Models A formulation for global inference with knowledge modeled as expressive structural constraints A Structured Prediction Perspective Decomposed Learning (DecL) Efficient structure learning by reducing the learning-time-inference to a small output space Provide conditions for when DecL is provably identical to global structural learning (GL) Page 4

5 Three Ideas Underlying Constrained Conditional Models Idea 1: Separate modeling and problem formulation from algorithms Similar to the philosophy of probabilistic modeling Idea 2: Keep models simple, make expressive decisions (via constraints) Unlike probabilistic modeling, where models become more expressive Idea 3: Expressive structured decisions can be supported by simply learned models Amplified and minimally supervised by exploiting dependencies among models outcomes. Modeling Inference Learning Page 5

6 Inference with General Constraint Structure [Roth&Yih04,07] Recognizing Entities and Relations Dole s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 Key Questions: How to guide the global inference? Over independently learned or pipelined models How to learn? Independently, Pipeline, Jointly? other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 Significant performance Improvement Models could be learned separately; constraints may come up only at decision time. Page 6 Note: Non Sequential Model y = argmax y score [y=v] 1 [y=v] = = argmax score E 1 = PER ¢ 1 E 1 = PER + score E 1 = LOC ¢ 1 E 1 = LOC +… score R 1 = S-of ¢ 1 R 1 = S-of +….. Subject to Constraints An Objective function that incorporates learned models with knowledge (constraints) A constrained Conditional Model

7 Constrained Conditional Models How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible (Soft) constraints component Weight Vector for local models Penalty for violating the constraint. How far y is from a legal assignment Features, classifiers; log- linear models (HMM, CRF) or a combination How to train? Training is learning the objective function Decouple? Decompose? How to exploit the structure to minimize supervision? Page 7 Inferning workshop

8 Inference: given input x (a document, a sentence), predict the best structure y = {y 1,y 2,…,y n } 2 Y (entities & relations) Assign values to the y 1,y 2,…,y n, accounting for dependencies among y i s Inference is expressed as a maximization of a scoring function y = argmax y 2 Y w T Á (x,y) Inference requires, in principle, enumerating all y 2 Y at decision time, when we are given x 2 X and attempt to determine the best y 2 Y for it, given w For some structures, inference is computationally easy. Eg: Using the Viterbi algorithm In general, NP-hard (can be formulated as an ILP) Structured Prediction: Inference Joint features on inputs and outputs Feature Weights (estimated during learning) Set of allowed structures Placing in context: a crash course in structured prediction Page 8

9 Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (x i, y i ): Page 9

10 Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example (x i, y i ): We call these conditions the learning constraints. In most structured learning algorithms used today, the update of the weight vector w is done in an on-line fashion W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows: What follows is a Structured Perceptron, but with minor variations this procedure applies to CRFs and Linear Structured SVM Score of annotated structure Score of any other structure Penalty for predicting other structure 8 y Page 10

11 In the structured case, the prediction (inference) step is often intractable and needs to be done many times Structured Prediction: Learning Algorithm For each example (x i, y i ) Do: (with the current weight vector w) Predict: perform Inference with the current weight vector y i = argmax y 2 Y w T Á ( x i,y) Check the learning constraints Is the score of the current prediction better than of (x i, y i )? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndFor Page 11

12 Structured Prediction: Learning Algorithm For each example (x i, y i ) Do: Predict: perform Inference with the current weight vector y i = argmax y 2 Y w EASY T Á EASY ( x i,y) + w HARD T Á HARD ( x i,y) Check the learning constraint Is the score of the current prediction better than of (x i, y i )? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo Solution I: decompose the scoring function to EASY and HARD parts EASY: could be feature functions that correspond to an HMM, a linear CRF, or a bank of classifiers (omitting dependence on y at learning time). May not be enough if the HARD part is still part of each inference step. Page 12

13 Structured Prediction: Learning Algorithm For each example (x i, y i ) Do: Predict: perform Inference with the current weight vector y i = argmax y 2 Y w EASY T Á EASY ( x i,y) + w HARD T Á HARD ( x i,y) Check the learning constraint Is the score of the current prediction better than of (x i, y i )? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo Solution II: Disregard some of the dependencies: assume a simple model. Page 13

14 Structured Prediction: Learning Algorithm For each example (x i, y i ) Do: Predict: perform Inference with the current weight vector y i = argmax y 2 Y w EASY T Á EASY ( x i,y) + w HARD T Á HARD ( x i,y) Check the learning constraint Is the score of the current prediction better than of (x i, y i )? If Yes – a mistaken prediction Update w Otherwise: no need to update w on this example EndDo y i = argmax y 2 Y w EASY T Á EASY ( x i,y) + w HARD T Á HARD ( x i,y) This is the most commonly used solution in NLP today Solution III: Disregard some of the dependencies during learning; take into account at decision time Page 14

15 Linguistics Constraints Cannot have both A states and B states in an output sequence. Linguistics Constraints If a modifier chosen, include its head If verb is chosen, include its arguments Examples: CCM Formulations CCMs can be viewed as a general interface to easily combine declarative domain knowledge with data driven statistical models Sequential Prediction HMM/CRF based: Argmax ¸ ij x ij Sentence Compression/Summarization: Language Model based: Argmax ¸ ijk x ijk Formulate NLP Problems as ILP problems (inference may be done otherwise) 1. Sequence tagging (HMM/CRF + Global constraints) 2. Sentence Compression (Language Model + Global Constraints) 3. SRL (Independent classifiers + Global Constraints) Page 15 ( Soft) constraints component is more general since constraints can be declarative, non-grounded statements. Constrained Conditional Models Allow: Learning a simple model (or multiple; or pipelines) Make decisions with a more complex model Accomplished by directly incorporating constraints to bias/re-rank global decisions composed of simpler models decisions More sophisticated algorithmic approaches exist to bias the output [CoDL: Cheng et. al 07,12; PR: Ganchev et. al. 10; UEM: Samdani et. al 12]

16 Outline Constrained Conditional Models A formulation for global inference with knowledge modeled as expressive structural constraints A Structured Prediction Perspective Decomposed Learning (DecL) Efficient structure learning by reducing the learning-time-inference to a small output space Provide conditions for when DecL is provably identical to global structural learning (GL) Page 16

17 Training: Independently of the constraints (L+I) Jointly, in the presence of the constraints (IBT, GL) Decomposed to simpler models Not surprisingly, decomposition is good. See [Chang et. al., Machine Learning Journal 2012] Little can be said theoretically on the quality/generalization of predictions made with a decomposed model Next, an algorithmic approach to decomposition that is both good, and comes with interesting guarantees. Decompose Model Training Constrained Conditional Models Decompose Model from constraints Page 17

18 In Global Learning, the output space is exponential in the number of variables – accurate learning can be intractable Standard ways to decompose it forget some of the structure and bring it back only at decision time Decomposed Structured Prediction Learning is driven by the attempt to find a weight vector w such that for each given annotated example (x i, y i ): Page 18 8 y EntityRelation y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 wewe wrwr Learning Inference y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y = argmax y 2 Y w T Á (x,y) y w T Á ( x i, y i ) ¸ w T Á ( x i, y) + ¢ ( y,y i )

19 Decomposed Structural Learning (DecL) [samdani & Roth IMCL12] Algorithm: Restrict the argmax inference to a small subset of variables while fixing the remaining variables to their ground truth values y i … and repeating for different subsets of the output variables: a decomposition The resulting set of assignments considered for y i is called a neighborhood(y i ) Key contribution: We give conditions under which DecL is provably equivalent to Global Learning (GL) Show experimentally that DecL provides results close to GL when such conditions do not exactly hold y1y1 y3y3 y6y6 y5y5 y2y2 y4y Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10 Page 19

20 DecL: Separate ground truth from 8 y 2 nbr(y j ) o 16 outputs y1y1 y3y3 y6y6 y5y5 y2y2 y4y DecL vs. Global Learning (GL) GL: Separate ground truth from 8 y 2 Y 2 6 = 64 outputs y1y1 y3y3 y6y6 y5y5 y2y2 y4y Likely scenario: nbr(y j ) ¿ Y 8 y 2 Y 8 y 2 nbr(y j ) What are good neighborhoods? w T Á ( x i, y i ) ¸ w T Á ( x i, y) + ¢ ( y,y i ) Page 20

21 Creating Decompositions DecL allows different decompositions S j for different training instances y j Example: Learning with decompositions in which all subsets of size k are considered: DecL-k For each k-subset of the variables, enumerate; keep n-k to gold. K=1 is Pseudomax [Sontag et al, 2010] K=2 is Constraint Classification [Har-Peled, Zimak, Roth 2002; Crammer, Singer 2002] In practice, neighborhoods should be determined based on domain knowledge Put highly coupled variables in the same set The goal is to get results that are close to doing exact inference. Are there small & good neighborhoods? Page 21

22 Different label space Exactness of DecL Key Result: YES. Under reasonable conditions, DecL with small sized neighborhoods nbr(y j ) gives the same results as Global Learning. For analyzing the equivalence between DecL and GL, we need a notion of separability of the data Separability: existence of set of weights W * that satisfy W * : { w | w ¢ Á(x j, y j ) ¸ w ¢ Á(x j, y) + ¢(y j,y), 8 y 2 Y } Separating weights for DecL W decl : { w | w ¢ Á(x j, y j ) ¸ w ¢ Á(x j, y) + ¢(y j,y), 8 y 2 nbr(y j ) } Naturally: W * µ W decl Exactness Results: The set of separating weights for DecL is equal to the set of separating weights for GL: W * =W decl w yjyj yjyj W*W* W decl Score of ground truth y j Score of all non ground truth y Page 22

23 Example of Exactness: Pairwise Markov Networks Scoring function define over a graph with edges E Assume domain knowledge on W * : for correct (separating) w 2 W *, which of the pairwise Á i,k (.;w) are: Submodular: Á i,k (0,0)+ Á i,k (1,1) > Á i,k (0,1) + Á i,k (1,0) Supermodular : Á i,k (0,0)+ Á i,k (1,1) < Á i,k (0,1) + Á i,k (1,0) y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 Singleton/Vertex components Pairwise/Edge components < > OR Page 23

24 Decomposition for Pairwise Markov Networks For an example (x j,y j ), define E j by removing edges from E where the labels disagree with the Á s Theorem: Decomposing the variables as connected components of E j yields Exactness. E EjEj sub (Á) sup (Á) 1 0 X X X Page 24

25 Experiments: Information extraction Prediction result of a trained HMM Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of natural constraints! Lars Ole Andersen. Program analysis and specialization for the C Programming language. PhD thesis. DIKU, University of Copenhagen, May Page 25

26 Adding Expressivity via Constraints Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Page 26

27 Information Extraction with Constraints Adding constraints, we get correct results! [AUTHOR] Lars Ole Andersen. [TITLE] Program analysis and specialization for the C Programming language. [TECH-REPORT] PhD thesis. [INSTITUTION] DIKU, University of Copenhagen, [DATE] May, Experimental Goal: Investigate DecL with small neighborhoods Note that the required theoretical conditions hold only approximately: Output tokens tend to appear in contiguous blocks Use neighborhoods similar to the PMN Page 27

28 Typical Results: Information Extraction (Ads Data) F1 Scores Page 28

29 Typical Results: Information Extraction (Ads Data) F1 Scores Time taken to train (Minutes) Page 29

30 Conclusion Presented Constrained Conditional Models: An ILP formulation for structured prediction that augments statistically learned models with declarative constraints as a way to incorporate knowledge and support decisions in expressive output spaces Supports joint inference while maintaining modularity and tractability of training Interdependent components are learned (independently or pipelined) and, via joint inference, support coherent decisions, modulo declarative constraints. Presented Decomposed Learning (DecL): efficient joint learning by reducing the learning-time-inference to a small output space Provided conditions for when DecL is provably identical to global structural learning (GL) Interesting open questions in developing further understanding of how to support efficient joint inference Thank You! Check out our tools, demos, tutorials Page 30


Download ppt "June 2013 SLG Workshop, ICML, Atlanta GA Decomposing Structured Prediction via Constrained Conditional Models Dan Roth Department of Computer Science University."

Similar presentations


Ads by Google