Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan,

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.

Machine Learning with Discriminative Methods Lecture 18 – Structured Prediction CS Spring 2015 Alex Berg.

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Support Vector Machines

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

ICCV 2007 tutorial Part III Message-passing algorithms for energy minimization Vladimir Kolmogorov University College London.

Supervised Learning Recap

Crash Course on Machine Learning Part IV Several slides from Derek Hoiem, and Ben Taskar.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep

Lecture 14 – Neural Networks

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Conditional Random Fields

Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.

Support Vector Machines

Learning Structured Prediction Models: A Large Margin Approach

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Online Learning Algorithms

An Introduction to Support Vector Machines Martin Law.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)

Final review LING572 Fei Xia Week 10: 03/11/

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

This week: overview on pattern recognition (related to machine learning)

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

An Introduction to Support Vector Machines (M. Law)

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

Biointelligence Laboratory, Seoul National University

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

John Lafferty Andrew McCallum Fernando Pereira

Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Today Graphical Models Representing conditional dependence graphically

SA-1 University of Washington Department of Computer Science & Engineering Robotics and State Estimation Lab Dieter Fox Stephen Friedman, Lin Liao, Benson.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Learning to Align: a Statistical Approach

Lecture 07: Soft-margin SVM

Structured prediction:

Max-margin sequential learning methods

Lecture 07: Soft-margin SVM

Discriminative Probabilistic Models for Relational Data

Statistical NLP Spring 2011

Primal Sparse Max-Margin Markov Networks

Presentation transcript:

Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning

“Don’t worry, Howard. The big questions are multiple choice.”

Handwriting Recognition brace Sequential structure xy

Object Segmentation Spatial structure xy

Natural Language Parsing The screen was a sea of red Recursive structure xy

Bilingual Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Combinatorial structure

Protein Structure and Disulfide Bridges Protein: 1IMT AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHKIPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK

Local Prediction Classify using local information  Ignores correlations & constraints! breac

Local Prediction building tree shrub ground

Structured Prediction Use local information Exploit correlations breac

Structured Prediction building tree shrub ground

Outline Structured prediction models Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Structured Models Mild assumption: linear combination space of feasible outputs scoring function

Chain Markov Net (aka CRF*) a-z y x *Lafferty et al. 01

Chain Markov Net (aka CRF*) a-z y x *Lafferty et al. 01

Associative Markov Nets Point features spin-images, point height Edge features length of edge, edge orientation yjyj ykyk  jk jj “associative” restriction

CFG Parsing #(NP  DT NN) … #(PP  IN NP) … #(NN  ‘sea’)

Bilingual Word Alignment position orthography association What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

Disulfide Bonds: Non-bipartite Matching RSCCPCYWGGCPWGQNCYPEGCSGPKV Fariselli & Casadio `01, Baldi et al. ‘04

Scoring Function RSCCPCYWGGCPWGQNCYPEGCSGPKV RSCCPCYWGGCPWGQNCYPEGCSGPKV amino acid identities phys/chem properties

Structured Models Mild assumptions: linear combination sum of part scores space of feasible outputs scoring function

Supervised Structured Prediction Learning Prediction Estimate w Example: Weighted matching Generally: Combinatorial optimization Data Model: Likelihood (intractable) MarginLocal (ignores structure)

Outline Structured prediction models Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

We want: Equivalently: OCR Example a lot! … “brace” “aaaaa” “brace”“aaaab” “brace”“zzzzz”

We want: Equivalently: ‘It was red’ Parsing Example a lot! S AB CD S AB DF S AB CD S EF GH S AB CD S AB CD S AB CD … ‘It was red’

We want: Equivalently: ‘What is the’ ‘Quel est le’ Alignment Example a lot! … ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’ ‘What is the’ ‘Quel est le’

Structured Loss b c a r e b r o r e b r o c e b r a c e ‘What is the’ ‘Quel est le’ S AE CD S BE AC S BD AC S AB CD ‘It was red’

Large margin estimation Given training examples, we want: Maximize margin Mistake weighted margin: # of mistakes in y *Collins 02, Altun et al 03, Taskar 03

Large margin estimation Eliminate Add slacks for inseparable case

Large margin estimation Brute force enumeration Min-max formulation ‘Plug-in’ linear program for inference

Min-max formulation LP Inference Structured loss (Hamming): Inference discrete optim. Key step: continuous optim.

Outline Structured prediction models Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

y  z Map for Markov Nets : : : : : 0 a b : z a b : z ab.zab.zab.zab.z

Markov Net Inference LP Has integral solutions z for chains, trees Can be fractional for untriangulated networks normalization agreement

Associative MN Inference LP For K=2, solutions are always integral (optimal) For K>2, within factor of 2 of optimal “associative” restriction

CFG Chart CNF tree = set of two types of parts: Constituents (A, s, e) CF-rules (A  B C, s, m, e)

CFG Inference LP inside outside Has integral solutions z root

Matching Inference LP Has integral solutions z degree What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

LP Duality Linear programming duality Variables  constraints Constraints  variables Optimal values are the same When both feasible regions are bounded

Min-max Formulation LP duality

Min-max formulation summary Formulation produces concise QP for Low-treewidth Markov networks Associative MNs (K=2) Context free grammars Bipartite matchings Approximate for untriangulated MNs, AMNs with K>2 *Taskar et al 04

Unfactored Primal/Dual QP duality Exponentially many constraints/variables

Factored Primal/Dual By QP duality Dual inherits structure from problem-specific inference LP Variables  correspond to a decomposition of  variables of the flat case

The Connection b c a r e b r o r e b r o c e b r a c e r c a o c r b 1 e 

Duals and Kernels Kernel trick works: Factored dual Local functions (log-potentials) can use kernels

Simple iterative method Unstable for structured output: fewer instances, big updates May not converge if non-separable Noisy Voted / averaged perceptron [Freund & Schapire 99, Collins 02] Regularize / reduce variance by aggregating over iterations Alternatives: Perceptron

Add most violated constraint Handles more general loss functions Only polynomial # of constraints needed Need to re-solve QP many times Worst case # of constraints larger than factored Alternatives: Constraint Generation [Collins 02; Altun et al, 03; Tsochantaridis et al, 04]

Handwriting Recognition Length: ~8 chars Letter: 16x8 pixels 10-fold Train/Test 5000/50000 letters 600/6000 words Models: Multiclass-SVMs* CRFs M 3 nets *Crammer & Singer CRFs MC–SVMsM^3 nets Test error (average per-character) raw pixels quadratic kernel cubic kernel 45% error reduction over linear CRFs 33% error reduction over multiclass SVMs better

Hypertext Classification WebKB dataset Four CS department websites: 1300 pages/3500 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth 53% error reduction over SVMs 38% error reduction over RMNs relaxed dual *Taskar et al 02 better loopy belief propagation

3D Mapping Laser Range Finder GPS IMU Data provided by: Michael Montemerlo & Sebastian Thrun Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

Segmentation results Hand labeled 180K test points ModelAccuracy SVM68% V-SVM73% M3NM3N93%

Fly-through

Word Alignment Results Model*Error Local learning+matching10.0 Our approach8.5 Data: [Hansards – Canadian Parliament] Features induced on  1 mil unsupervised sentences Trained on 100 sentences (10,000 edges) Tested on 350 sentences (35,000 edges) [Taskar+al 05] *Error: weighted combination of precision/recall [Lacoste-Julien+Taskar+al 06] GIZA/IBM4 [Och & Ney 03]6.5 +Our approach+QAP4.5 +Local learning+matching5.4 +Our approach4.9

Outline Structured prediction models Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Non-bipartite matchings: O(n 3 ) combinatorial algorithm No polynomial-size LP known Spanning trees No polynomial-size LP known Simple certificate of optimality Intuition: Verifying optimality easier than optimizing Compact optimality condition of wrt ij kl

Certificate for non-bipartite matching Alternating cycle: Every other edge is in matching Augmenting alternating cycle: Score of edges not in matching greater than edges in matching Negate score of edges not in matching Augmenting alternating cycle = negative length alternating cycle Matching is optimal  no negative alternating cycles Edmonds ‘65

Certificate for non-bipartite matching Pick any node r as root = length of shortest alternating path from r to j Triangle inequality: Theorem: No negative length cycle  distance function d exists Can be expressed as linear constraints: O(n) distance variables, O(n 2 ) constraints

Certificate formulation Formulation produces compact QP for Spanning trees Non-bipartite matchings Any problem with compact optimality condition *Taskar et al. ‘05

Disulfide Bonding Prediction Data [Swiss Prot 39] 450 sequences (4-10 cysteines) Features: windows around C-C pair physical/chemical properties [Taskar+al 05] Model*Acc Local learning+matching41% Recursive Neural Net [Baldi+al’04] 52% Our approach (certificate)55% *Accuracy: % proteins with all correct bonds

Formulation summary Brute force enumeration Min-max formulation ‘Plug-in’ convex program for inference Certificate formulation Directly guarantee optimality of

Omissions Kernels Non-parametric models Structured generalization bounds Bounds on hamming loss Scalable algorithms (no QP solver needed) Structured SMO (works for chains, trees) [Taskar 04] Structured ExpGrad (works for chains, trees) [Bartlett+al 04] Structured ExtraGrad (works for matchings, AMNs) [Taskar+al 06]

Open questions Statistical consistency Hinge loss not consistent for non-binary output [See Tewari & Bartlett 05, McAllester 07] Learning with approximate inference Does constant factor approximate inference guarantee anything about learning? No [See Kulesza & Pereira 07] Perhaps other assumptions needed Discriminative structure learning Using sparsifying priors

Conclusion Two general techniques for structured large-margin estimation Exact, compact, convex formulations Allow efficient use of kernels Tractable when other estimation methods are not Efficient learning algorithms Empirical success on many domains

References Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. ICML03. M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP02 K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR01 J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML04 More papers at

Modeling First Order Effects MonotonicityLocal inversionLocal fertility QAP NP-complete Sentences (  30 words,  1k vars)  few seconds (Mosek) Learning: use LP relaxation Testing: using LP, 83.5% sentences, 99.85% edges integral

Segmentation Model  Min-Cut 0 1 Local evidence Spatial smoothness Computing is hard in general, but if edge potentials attractive  min-cut algorithm Multiway-cut for multiclass case  use LP relaxation [Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]

Scalable Algorithms Batch and online Linear in the size of the data Iterate until convergence For each example in the training sample Run inference using current parameters (varies by method) Online: Update parameters using computed example values Batch: Update parameters using computed sample values Structured SMO (Taskar et al, 03; Taskar 04) Structured Exponentiated Gradient (Bartlett et al, 04) Structured Extragradient (Taskar et al, 05)

Experimental Setup Standard Penn treebank split (2-21/22/23) Generative baselines Klein & Manning 03 and Collins 99 Discriminative Basic = max-margin version of K&M 03 Lexical & Lexical + Aux Lexical features (on constituent parts only) t s-1 [t s … t e ] t e+1  predicted tags x s-1 [x s … x e ] x e+1 Auxillary features Flat classifier using same features Prediction of K&M 03 on each span

Results for sentences ≤40 words ModelLPLRF1F1 Generative Lexical+Aux* Collins 99* *Trained only on sentences ≤20 words *Taskar et al 04

Example The Egyptian president said he would visit Libya today to resume the talks. Generative model: Libya today is base NP Lexical model: today is a one word constituent