# Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

## Presentation on theme: "Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A."— Presentation transcript:

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A AA A A AAA A AA A A

Outline 2  Type of structures considered  Generative vs Discriminative  Global discriminative vs local discriminative  Decoding:  at testing vs at learning  methods for decoding  Predefined features vs latent features  I will use red italic to have illustration of methods; oversimplify some points

Types of Structures 3  Sequences:  Chain CRFs, HMMs, (chain type) M3Ns,....  Trees:  Constituency trees: weighted CFGs (including LA-PCFGs), left-corner/shift-reduce parsers (the MaxEnt parser, ISBN parser,...)  Dependency structures: MST-parser, Nivre’s shift reduce parser,...  Rankings  Prank (today)  Not considered: DAGs (e.g., some semantic representations), Bipartite graphs (machine translation), or more general graphs...

Generative vs Discriminative 4  Discriminative: CRFs, MEMM, Structured Perceptron, Max-Margin Markov Networks (M3Ns),...  Learn mapping from to, so that expected error is minimal  Pros:  model what you actually care about  complex features of x are easy to integrate  different errors can be considered  less assumptions (and therefore, better asymptotic performance)  Generative  Score how likely is the combination of input and output  Pros:  easier to learn (if everything is observable – ML parameters are normalized counts)  “cleaner” semi-supervised learning, select to maximize  often, better with small datasets  some approaches care about (speech recognition, statistical machine translation,...)  arguably, preferable with latent variables HMMs, PCFGs (including the LA-PCFGs),...

Global Discr. vs Local Discr. 5  Local (distribs over small decsions) MEMMs, SVM decision classifiers in Nivre’s shift reduce parser  Pros:  no real decoding at training time (cheap learning)  complex features of can be integrated easily (about training! still need to decode at testing)  Cons:  mismatch btw test and train modes: rely on true features in training and on predicted ones in testing  label bias (cannot dump a unlikely transition if the number of outgoing states is not sufficiently large)  Global (distribs over the entire sequences) struct perceptron, CRFs, M3Ns (model: MST parser)  Pros  Theoretically much cleaner and in practice works better  Cons  Decoding at training time (+ partition function for CRFs); but approximate learning methods exist  Learning can be very problematic if complex features of are used  Both models require decoding at testing. Decoding does not really depend on the training criteria but on the features of

Specific learning criteria 6  CRFs  Maximize  Perceptron  Ensure separability on the training set (with large margin in some variations – e.g., ALMA): rank correct structure above incorrect one  Max-Margin Markov Networks (M3Ns)  Separate training set with maximal margin (sensitively to the error)  For every labeled example where is any structure, is some loss function (e.g., Hamming distance for sequence measuring how many labels do not match)  “Wrong sequences with small errors should be penalized less than with more errors”  SVM-Struct, Boosting,....

Decoding at training vs testing: examples 7 at training at testing MEMM (local disc) “No” (multiclass classifiers are trained) Approximate search if complex decomposition over y, Viterbi – otherwise. “Standard” chain CRF (global discr) Full (+ partition function) Full (Viterbi) HMM (generative) No Full (Viterbi) Increm Perceptron (global discr) Approximate Approximate (less approximate) Searn (local disrim) Approximate (more than that) Approximate  Different combinations are possible....

Inference (argmax) 8  Simple dependencies in y:  Viterbi to find the most likely sequence (or, Chi-Liu-Edmonds for MST)  Or, marginal decoding to find the most likely label for every “position”  Complex dependencies:  Beam or greedy search (or some smarter search methods)  Reformulate the inference problems as a integer linear program and use methods known in ILP  (We do not care here when the inference is used: either at training or testing, or at both)

Latent Variables vs Explicit Features 9  Explicit features:  Pros:  Mostly convex optimization (no local minima)  Cheaper to learn  Cons:  Models is as good as the features are: extensive feature engineering needed  Non local dependencies in y are often necessary  Latent variable models:  Pros:  Learn how to propagate relevant information (learns complex features from simple ones)  Can learn a model with simple decompositions over extended y -- efficient decoding  Latent representation (e.g., extended parsing states or extended grammar) can potentially be useful in other tasks – multi-task learning  Cons:  Non-convex optimization – need to avoid local minima (tricky)  More expensive to train Most of the model we considered: CRFs, MEMMs, etc LA-PCFGs, ISBNs

Last bits 10  Term paper: due Mar 31 but send me ideas, outlines, draft well before the deadline (soon!)  Feedback on the content would be very much appreciated (as I am preparing a lecture class with a similar set of topics)  Thanks for participating!!!

Download ppt "Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A."

Similar presentations