Presentation on theme: "Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A."— Presentation transcript:
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A AA A A AAA A AA A A
Outline 2 Type of structures considered Generative vs Discriminative Global discriminative vs local discriminative Decoding: at testing vs at learning methods for decoding Predefined features vs latent features I will use red italic to have illustration of methods; oversimplify some points
Types of Structures 3 Sequences: Chain CRFs, HMMs, (chain type) M3Ns,.... Trees: Constituency trees: weighted CFGs (including LA-PCFGs), left-corner/shift-reduce parsers (the MaxEnt parser, ISBN parser,...) Dependency structures: MST-parser, Nivre’s shift reduce parser,... Rankings Prank (today) Not considered: DAGs (e.g., some semantic representations), Bipartite graphs (machine translation), or more general graphs...
Generative vs Discriminative 4 Discriminative: CRFs, MEMM, Structured Perceptron, Max-Margin Markov Networks (M3Ns),... Learn mapping from to, so that expected error is minimal Pros: model what you actually care about complex features of x are easy to integrate different errors can be considered less assumptions (and therefore, better asymptotic performance) Generative Score how likely is the combination of input and output Pros: easier to learn (if everything is observable – ML parameters are normalized counts) “cleaner” semi-supervised learning, select to maximize often, better with small datasets some approaches care about (speech recognition, statistical machine translation,...) arguably, preferable with latent variables HMMs, PCFGs (including the LA-PCFGs),...
Global Discr. vs Local Discr. 5 Local (distribs over small decsions) MEMMs, SVM decision classifiers in Nivre’s shift reduce parser Pros: no real decoding at training time (cheap learning) complex features of can be integrated easily (about training! still need to decode at testing) Cons: mismatch btw test and train modes: rely on true features in training and on predicted ones in testing label bias (cannot dump a unlikely transition if the number of outgoing states is not sufficiently large) Global (distribs over the entire sequences) struct perceptron, CRFs, M3Ns (model: MST parser) Pros Theoretically much cleaner and in practice works better Cons Decoding at training time (+ partition function for CRFs); but approximate learning methods exist Learning can be very problematic if complex features of are used Both models require decoding at testing. Decoding does not really depend on the training criteria but on the features of
Specific learning criteria 6 CRFs Maximize Perceptron Ensure separability on the training set (with large margin in some variations – e.g., ALMA): rank correct structure above incorrect one Max-Margin Markov Networks (M3Ns) Separate training set with maximal margin (sensitively to the error) For every labeled example where is any structure, is some loss function (e.g., Hamming distance for sequence measuring how many labels do not match) “Wrong sequences with small errors should be penalized less than with more errors” SVM-Struct, Boosting,....
Decoding at training vs testing: examples 7 at training at testing MEMM (local disc) “No” (multiclass classifiers are trained) Approximate search if complex decomposition over y, Viterbi – otherwise. “Standard” chain CRF (global discr) Full (+ partition function) Full (Viterbi) HMM (generative) No Full (Viterbi) Increm Perceptron (global discr) Approximate Approximate (less approximate) Searn (local disrim) Approximate (more than that) Approximate Different combinations are possible....
Inference (argmax) 8 Simple dependencies in y: Viterbi to find the most likely sequence (or, Chi-Liu-Edmonds for MST) Or, marginal decoding to find the most likely label for every “position” Complex dependencies: Beam or greedy search (or some smarter search methods) Reformulate the inference problems as a integer linear program and use methods known in ILP (We do not care here when the inference is used: either at training or testing, or at both)
Latent Variables vs Explicit Features 9 Explicit features: Pros: Mostly convex optimization (no local minima) Cheaper to learn Cons: Models is as good as the features are: extensive feature engineering needed Non local dependencies in y are often necessary Latent variable models: Pros: Learn how to propagate relevant information (learns complex features from simple ones) Can learn a model with simple decompositions over extended y -- efficient decoding Latent representation (e.g., extended parsing states or extended grammar) can potentially be useful in other tasks – multi-task learning Cons: Non-convex optimization – need to avoid local minima (tricky) More expensive to train Most of the model we considered: CRFs, MEMMs, etc LA-PCFGs, ISBNs
Last bits 10 Term paper: due Mar 31 but send me ideas, outlines, draft well before the deadline (soon!) Feedback on the content would be very much appreciated (as I am preparing a lecture class with a similar set of topics) Thanks for participating!!!