Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign Page 1.

Similar presentations


Presentation on theme: "Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign Page 1."— Presentation transcript:

1 Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign Page 1

2 Structured Prediction Structured prediction: predicting a structured output variable y based on the input variable x y = { y 1,y 2,…,y n } variables form a structure Structure comes from interactions between the output variables through mutual correlations and constraints Such problems occur frequently in  NLP – e.g. predicting the tree structured parse of a sentence, predicting the entity-relation structure from a document.  Computer vision – scene segmentation, body-part identification  Speech processing – capturing relations between phonemes  Computational Biology – protein folding and interactions between different sub-structures  Etc. Page 2

3 Example Problem: Information Extraction Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … Structure introduced by correlations between words  E.g. if treated as sequence-tagging Structure is also introduced by declarative constraints that define the set of feasible assignments  E.g. the ‘author’ tokens are likely to appear together in a single block  A paper should have at most one ‘title’ Page 3

4 Example problem: Body Part Identification Count the number of people Predict the body parts Correlations  Position of shoulders and heads correlated  Position of torso and legs correlated Page 4

5 Predict variables in y = { y 1,y 2,…,y n } 2 Y together to leverage dependencies (e.g. entity-relation, shoulders-head, information fields, document labels etc.) between these variables Inference constitutes predicting the best scoring structure f ( x, y ) = w¢ Á ( x, y ) is called the scoring function Set of allowed structures often specified by constraints Weight parameters (to be estimated during learning) Features on input-output Structured Prediction: Inference Page 5

6 Structural Learning: Quick Overview Consider a big monolithic structured prediction problem Given labeled data pairs ( x j, y j = { y j 1,y j 2,…,y j n } ), how do we learn w and perform inference? Page 6 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4

7 Learning w : Two Extreme Styles Global Learning (GL) Consider all the variables together Collins’02; Taskar et al’04; Tsochantiridis et al’04 Local Learning (LL) Ignore hard to learn structural aspects e.g. global constraints/consider variables in isolation Punyakanok et al’05; Roth and Yih’05; Koo et al’10… Page 7 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 LL+C : apply constraints, if available, only at test-time inference ExpensiveInconsistent

8 Our Contribution: Decomposed Learning We consider learning with subsets of variables at a time We give conditions under which this decomposed learning is actually identical to global learning and exhibit the advantage of our learning paradigm experimentally. Page 8 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 00 01 10 11 00 01 10 11 000 001 010 011 100 101 110 111 Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10

9 Efficient Decomposed Learning (DecL) This work: learning by decomposing the task into smaller components Previous work: Bottom-up learning – learning all the tasks independent of each other and piecing them together during inference (large body of literature on this) This work: Top-down learning – decomposing a joint/monolithic learning task cleverly into smaller but not independent components  Providing realistic conditions on when Decomposed Learning is actually identical to Global Learning  Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10 Page 9

10 Existing Global Structural learning algorithms Decomposed Learning (DecL): Efficient structural learning  Intuition  Formalization Theoretical properties of DecL Experimental evaluation Outline Page 10

11 Supervised Structural Learning We focus on structural SVM style algorithms which learn w by minimizing regularized structured-hinge loss Literature: Taskar et al’04; Tsochantiridis et al’04; Structured hinge-loss Score of non- ground truth y Score of the ground truth y Loss-based margin Page 11 Global Inference over all the variables

12 Due to exact global inference as an intermediate step, we call this Global Learning (GL) Exact Inference Update Simple Structural SVM algorithm Page 12

13 Limitations of Global Learning Exact global inference as an intermediate step Expressive models don’t admit exact and efficient (poly-time) inference algorithms e.g.  HMM with global constraints,  Arbitrary Pairwise Markov Networks Hence Global Learning is expensive for expressive features ( Á ( x,y )) and constraints ( y 2 Y ) The problem is using inference as a black box during learning Our proposal: change the inference-during-learning to inference over a smaller output space: Decomposed inference for learning Page 13

14 Approximate Learning Exact inference may be an overkill for several learning applications One possible alternative: use approximate inference as a black box (Kulesza and Pereira, 07; Finley and Joachims, 08)  LP relaxations  Belief Propagation Our proposal: Instead of approximating entire inference during learning, do exact inference over a smaller output space Page 14

15 Existing Structural learning algorithms Decomposed Learning (DecL): Efficient structural learning  Intuition  Formalization Theoretical properties of DecL Experimental evaluation Outline Page 15

16 Decomposed Structural Learning (DecL) GENERAL IDEA: For (x j,y j ), reduce the argmax inference from the intractable output space Y to a “neighborhood” around y j : nbr(y j )µY Small and tractable nbr(y j ) ) efficient learning Use domain knowledge to create neighborhoods which preserve the structure of the problem Page 16 {0,1} n Y nbr ( y ) n outputs in y

17 Neighborhoods via Decompositions Generate nbr ( y j ) by varying a subset of the output variables, while fixing the rest of them to their gold labels in y j … … and repeat the same for different subsets of the output variables A decomposition is a collection of different (non-inclusive, possibly overlapping) sets of variables which vary together S j = { s 1, …, s l | 8 i, s i µ { 1, …, n }; 8 i, k, s i * s k } Inference could be exponential in the size of sets  Smaller set sizes yield efficient learning  Under some conditions, DecL with smaller set sizes is identical to Global Learning Page 17

18 Decomposed Inference Explained Decomposition for training instance j Pick a set from the decomposition Consider different possible instantiations of variables corresponding to s… … such that the instantiations combined with the remaining ground truth variables give a feasible output i.e. satisfies constraints Do a MAX over all such outputs Inference over outputs which can be obtained by “wiggling” variables in subsets of S j Inference over the entire output space Variables outside the set fixed to their ground truth values Page 18

19 Creating Decompositions Allow different decompositions S j for different training instances y j Aim to get results close to doing exact inference: we need decompositions which yield exactness (next few slides) Example: Learning with Decompositions in which all subsets of size k are considered: DecL- k  DecL- 1 same as Pseudomax (Sontag et al, 2010) which is similar to Pseudolikelihood (Besag, 77) learning In practice, decompositions should be based on domain knowledge – put highly coupled variables in the same set Page 19

20 Existing Structural learning algorithms DecL: Efficient decomposed structural learning  Intuition  Formalization Theoretical results: exactness Experimental evaluation Outline Page 20

21 Theoretical Results: Assume Separability Ideally we want Decomposed Learning with decompositions having small sets to give the same results as Global Learning For analyzing the equivalence between DecL and GL, we assume that the training data is separable Separability: existence of a set of weights W * that satisfy W * : { w * | f( x j, y j ; w * ) ¸ f( x j, y ; w *) + ¢ ( y j,y ), 8 y 2 Y } Separating weights for DecL W decl : { w * | f( x j, y j ; w * ) ¸ f( x j, y ; w *) + ¢ ( y j,y ), 8 y 2 nbr(y j ) }  Naturally: W * µ W decl Score of ground truth y j Score of non ground-truth y Loss-based margin Page 21

22 Theoretical Results: Exactness The property we desire is Exactness: W decl = W *  as a property of constraints, ground truth y j, and the globally separating weight W * Different from asymptotic consistency results of Pseudolikelihood/Pseudomax! Exactness much more useful – learning with DecL yields the same weights as GL Main theorem in the paper: providing general exactness condition Page 22

23 Main Theorem Theorem 1. DecL is exact if 8 w 2 W *, 9 ² > 0, such that 8 w ’ 2 B( w, ² ), 8 ( x j, y j ) 2 D we have if 9 y 2 Y such that f ( x j, y ; w ′) + ¢ ( y j, y ) > f ( x j, y j ; w ′) then 9 y ’ 2 nbr ( y j ) with f ( x j, y ’ ; w ′) + ¢ ( y j, y ’) > f ( x j, y j ; w ′) Details in the paper Example corollaries of this theorem in the next slides Page 23 Global InseparabilityDecL Inseparability Only for weights near W *

24 Example Exactness: Linear Scoring Functions with Constraints Singleton scoring function structured via constraints With relatively symmetrical constraints on Y, it is possible to show exactness for decompositions with set sizes independent of the number of variables, n Theorem: If Y is specified by k OR constraints, then DecL-( k +1) is exact Page 24

25 Example Consequence of Exactness for OR Example application of the theorem When Y is specified by k horn-clause … y 1,1 Æ y 1,2 … Æ y 1,r ! y 1,r+1, y 2,1 Æ y 2,2 … Æ y 2,r ! y 2,r+1,  y k,1 Æ y k,2 … Æ y k,l ! y k,r+1 … DecL-( k +1) is exact We need decompositions with set-sizes independent of the number of variables, r, in constraints Page 25

26 One Example of Exactness: Pairwise Markov Networks Scoring function define over a graph with edges E Assume domain knowledge on W * : we know that for correct (separating) w, if Á i,k (.;w) is:  Submodular: Á i,k (0,0)+ Á i,k (1,1) > Á i,k (0,1) + Á i,k (1,0) OR  Supermodular : Á i,k (0,0)+ Á i,k (1,1) < Á i,k (0,1) + Á i,k (1,0) y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 Page 26 Singleton/Vertex components Pairwise/Edge components

27 Decomposition for PMNs Define Theorem: S pair decomposition consisting of connected components of E j yields Exactness E EjEj sub(Á) sup(Á) 1 0 Page 27

28 Existing Structural learning algorithms DecL: Efficient decomposed structural learning  Intuition  Formalization Theoretical properties of DecL Experimental evaluation Outline Page 28

29 Experiments Experimentally compare Decomposed Learning (DecL) to  Global Learning (GL),  Local Learning (LL) and  Local Learning + Constraints (if available, during test-time inference) (LL+C) Study the robustness of DecL in conditions where our theoretical assumptions may not hold Page 29

30 Synthetic Experiments Experiments on random synthetic data with 10 binary variables Labels assigned with random singleton scoring functions and random linear constraints Local Learning (LL) baselines Global Learning (GL) and Dec. Learning (DecL)-2,3 DecL-1 aka Pseudomax Page 30 No. of training examples Avg. Hamming Loss

31 Multi-label Document Classification Experiments on multi-label document classification  Documents with multi-labels corn, crude, earn, grain, interest… Modeled as a Pairwise Markov Network over a complete graph over all the labels – singleton and pairwise components LL – local learning baseline that ignores pairwise interactions Page 31

32 Results: Per Instance F1 and training time (hours) Page 32 F1 Scores Time taken to train (hours)

33 Results: Per Instance F1 and training time (hours) Page 33 F1 Scores Time taken to train (hours)

34 Example Problem: Information Extraction Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … Constraints like:  The ‘title’ tokens are likely to appear together in a single block,  A paper should have at most one ‘title’ Page 34

35 Information Extraction: Modeling Modeled as HMM with additional constraints  The constraints make inference with HMM Hard Local Learning (LL) in this case is HMM with no constraints Domain Knolwedge: HMM transition matrix is likely to be diagonal heavy – generalization of submodular pairwise potentials for Pairwise Markov Networks  ) use decomposition S pair Bottomline: DecL is 2 to 8 times faster than GL and gives same accuracies Page 35

36 Citation Info. Extraction: Accuracy and Training Time Page 36 F1 Scores Time taken to train (hours)

37 Ads. Info. Extraction: Accuracy and Training Time Page 37 F1 Scores Time taken to train (hours)

38 Take Home: Efficient Structural Learning with DecL We presented Decomposed Learning (DecL): efficient learning by reducing the inference to a small output space Exactness: Provided conditions for when DecL is provably identical to global structural learning (GL) Experiments: DecL performs as good as GL on real-world data, with significant cost reduction (with 50% - 90% reduction in training time) QUESTIONS? Page 38


Download ppt "Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign Page 1."

Similar presentations


Ads by Google