Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania.

Slides:

Advertisements

Similar presentations

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Imbalanced data David Kauchak CS 451 – Fall 2013.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Detecting Faces in Images: A Survey

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Machine Learning with Discriminative Methods Lecture 18 – Structured Prediction CS Spring 2015 Alex Berg.

Efficient Large-Scale Structured Learning

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Model generalization Test error Bias, variance and complexity

Crash Course on Machine Learning Part IV Several slides from Derek Hoiem, and Ben Taskar.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Computer vision: models, learning and inference

AdaBoost & Its Applications

Face detection Many slides adapted from P. Viola.

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,

EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.

1 Fast Asymmetric Learning for Cascade Face Detection Jiaxin Wu, and Charles Brubaker IEEE PAMI, 2008 Chun-Hao Chang 張峻豪 2009/12/01.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

The Viola/Jones Face Detector (2001)

Cascaded Models for Articulated Pose Estimation

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.

Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Face Detection CSE 576. Face detection State-of-the-art face detection demo (Courtesy Boris Babenko)Boris Babenko.

An Introduction to Support Vector Machines Martin Law.

Crash Course on Machine Learning

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Face Alignment Using Cascaded Boosted Regression Active Shape Models

Efficient Model Selection for Support Vector Machines

Graphical models for part of speech tagging

Loss-based Learning with Weak Supervision M. Pawan Kumar.

Leo Zhu CSAIL MIT Joint work with Chen, Yuille, Freeman and Torralba 1.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

An Introduction to Support Vector Machines (M. Law)

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Robust Real Time Face Detection

Lecture 09 03/01/2012 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.

LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:

Language Identification and Part-of-Speech Tagging

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Learning to Align: a Statistical Approach

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Empirical risk minimization

Nonparametric Semantic Segmentation

Structured prediction:

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

Empirical risk minimization

An introduction to Machine Learning (ML)

Presentation transcript:

Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Supervised Learning Learn from Regression: Binary Classification: Multiclass Classification: Structured Prediction:

Handwriting Recognition `structured’ xy

Machine Translation xy ‘Ce n'est pas un autre problème de classification.’ ‘This is not another classification problem.’

Pose Estimation xy © Arthur Gretton

Structured Models space of feasible outputs scoring function parts = cliques, productions Complexity of inference depends on “part” structure

Supervised Structured Prediction Learning Prediction Discriminative estimation of θ Data Model: Intractable/impractical for complex models Intractable/impractical for complex models 3 rd order OCR model = 1 mil states * length Berkeley English grammar = 4 mil productions * length^3 Tree-based pose model = 100 mil states * # joints

Approximation vs. Computation Usual trade-off: approximation vs. estimation – Complex models need more data – Error from over-fitting New trade-off: approximation vs. computation – Complex models need more time/memory – Error from inexact inference This talk: enable complex models via cascades

An Inspiration: Viola Jones Face Detector Scanning window at every location, scale and orientation

Classifier Cascade Most patches are non-face Filter out easy cases fast!! Simple features first Low precision, high recall Learned layer-by-layer Next layer more complex C1C1 C2C2 C3C3 CnCn Non-face Face

Related Work Global Thresholding and Multiple-Pass Parsing. J Goodman, 97 A maximum-entropy-inspired parser. E. Charniak, 00. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking, E Charniak & M Johnson, 05 TAG, dynamic pro- gramming, and the perceptron for efficient, feature-rich parsing, X. Carreras, M. Collins, and T. Koo, 08 Coarse-to-Fine Natural Language Processing, S. Petrov, 09 Coarse-to-fine face detection. Fleuret, F., Geman, D, 01 Robust real-time object detection. Viola, P., Jones, M, 02 Progressive search space reduction for human pose estimation, Ferrari, V., Marin-Jimenez, M., Zisserman, A, 08

What’s a Structured Cascade? What to filter? – Clique assignments What are the layers? – Higher order models – Higher resol. models How to learn filters? – Novel convex loss – Simple online algorithm – Generalization bounds F1F1 F2F2 F3F3 FnFn ?????? ‘structured’

Trade-off in Learning Cascades Accuracy: Minimize the number of errors incurred by each level Efficiency: Maximize the number of filtered assignments at each level Filter 1 Filter 2 Filter D Predict

Max-marginals (Sequences) a b c d a b c d a b c d a b c d Score of an output: Compute max (*bc*) Max marginal:

Filtering with Max-marginals Set threshold Filter clique assignment if a b c d a b c d a b c d a b c d

Filtering with Max-marginals Set threshold Filter clique assignment if a b c d a b c d a b c d a b c d Remove edge bc

Why Max-marginals? Valid path guarantee – Filtering leaves at least one valid global assignment Faster inference – No exponentiation, O(k) vs. O(k log k) in some cases Convex estimation – Simple stochastic subgradient algorithm Generalization bounds for error/efficiency – Guarantees on expected trade-off

Choosing a Threshold Threshold must be specific to input x – Max-marginal scores are not normalized Keep top-K max-marginal assignments? – Hard(er) to handle and analyze Convex alternative: max-mean-max function: max scoremean max marginal m = #edges

Choosing a Threshold (cont) Max marginal scores Mean max marginal Max score Score of truth Range of possible thresholds α = efficiency level

Example OCR Cascade abcdefghijkl… Mean ≈ 0 Max = 368 α = 0.55 Threshold ≈ 204 acdefgijl bhk… b h k

Example OCR Cascade b h k a e nu r a b dga c egnoua e mnuwb h k

b h k a e nu r a b dga c egnou ▪b▪b▪h▪h▪k▪k bahaka be he…aaabad ag ea…aaacae ag an… a e mnuw aaaeam an au…

Example OCR Cascade ▪b▪b▪h▪h▪k▪k bahaka be he…aaabad ag ea…aaacae ag an…aaaeam an au… Mean ≈ 1412 Max = 1575 α = 0.44 Threshold ≈ 1502

▪b▪b ba be aaab ag ea aa acae ag an aaaeam an au ▪b▪b ba be aaab ag ea aa acae ag an aaaeam an au Example OCR Cascade ▪h▪h▪k▪k hakahe… ad… … … Mean ≈ 1412 Max = α = 0.44 Threshold ≈ 1502

Example OCR Cascade ka ▪h▪h▪k▪k hahe… ad… … … Mean ≈ 1412 Max = 1575 ▪h▪h▪k▪k hakaheke aded kn do owom nd α = 0.44 Threshold ≈ 1502

Example OCR Cascade ▪h▪h▪k▪k hakaheke aded kn do owom nd … … ▪▪h▪▪k ▪ha▪ka▪he▪ke hadkadhedked adoedondo dowdom

10x10x1210x10x2440x40x2480x80x24110×122×24 Example: Pictorial Structure Cascade Upper arm Lower arm H H T T ULA LLA URA LRA k ≈ 320,000 k 2 ≈ 100mil

Filter loss  If score(truth) > threshold, all true states are safe Efficiency loss  Proportion of unfiltered clique assignments Quantifying Loss

Learning One Cascade Level Fix α, solve convex problem for θ Minimize filter mistakes at efficiency level α No filter mistakesMargin w/ slack

An Online Algorithm Stochastic sub-gradient update: Features of truth Convex combination: Features of best guess + Average features of max marginal “witnesses”

Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data Expected loss on true distribution

)  -ramp( Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data - Empirical  -ramp upper bound 0 1

Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data n number of examples m number of clique assignments number of cliques B

Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data Similar bound holds for L e and all ® 2 [0,1]

OCR Experiments Dataset:

Efficiency Experiments POS Tagging (WSJ + CONLL datasets) – English, Bulgarian, Portuguese Compare 2 nd order model filters – Structured perceptron (max-sum) – SCP w/ ® 2 [0, 0.2,0.4,0.6,0.8] (max-sum) – CRF log-likelihood (sum-product marginals) Tightly controlled – Use random 40% of training set – Use development set to fit regularization parameter ¸ and α for all methods

POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) English

POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Portuguese

POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Bulgarian

English POS Cascade (2 nd order) The cascade is efficient. DT NN VRB ADJ

Pose Estimation (Buffy Dataset) “Ferrari score:” percent of parts correct within radius

Pose Estimation

Cascade Efficiency vs. Accuracy

Features: Shape/Segmentation Match

Features: Contour Continuation

Conclusions Novel loss significantly improves efficiency Principled learning of accurate cascades Deep cascades focus structured inference and allow rich models Open questions: – How to learn cascade structure – Dealing with intractable models – Other applications: factored dynamic models, grammars

Thanks! Structured Prediction Cascades, Weiss & Taskar, AISTATS10 Cascaded Models for Articulated Pose Estimation, Sapp, Toshev & Taskar, ECCV10

Learning to Filter Optimization problem: subject to Maximize efficiency Hard constraint on filtering accuracy Not convex due to coupling ofααand θ Learn θ for fixed ®

Safe filtering When does filtering make mistakes? Lemma: a b c d a b c d a b c d a b c d Clique assignments with max-score lower than the truth are safe to filter

Safe filtering When does filtering make mistakes? Lemma: To guarantee no mistakes, set score of ground truth > threshold + margin

Structured Prediction Underlying goal of learning large-scale graphical models: prediction of structured outputs Structured: multivariate, correlated, constrained

Object Segmentation Spatial structure xy

Natural Language Parsing The screen was a sea of red Recursive structure xy

a b l m n wx

Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Combinatorial structure

Handwriting Recognition brace xy

Structured Prediction Protein folding Machine translation ‘Ce n'est pas un autre problème de classification.’ InputOutput AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHKIPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK ‘This is not another classification problem.’ huge!

Pose Estimation xy

Cascade Model Complexity Filter 1 Filter 2 Filter D Predict

Example: OCR task abcde…abcde… abcde…abcde… abcde…abcde… abcde…abcde… abcde…abcde… Filter states

Example: OCR task abdabd cenrcenr acdeoacdeo aceracer aceoaceo Expand to bigrams…

Example: OCR task abdabd ac bc dc ae be de … ca ea na ra cc bc … aa ca da ea oa ac … aa ca ea ra ac cc … Compute max-marginals Filter edges

Example: OCR task abab ac be de br ar … ca na ra cd rd … ca da ac dc … ea ac re … Expand to trigrams…

Example: OCR task  a  b Expand to trigrams…  ac  be  de  ar  br aca ana ara bca bna bra cca rca cda rda rac dac aea aac ace are Compute max-marginals Filter edges

Visualizing L e vs. L f trade-off Train w to minimize loss function on training set For each ² 2 [0,0.002]: – Choose ® to minimize L e subject to L f · ² on the development set – Compute L e on test set – Plot ( ²,L e ) 10 replicates to estimate variance Error Cap (%) on Development Set % 0.05% errors on devel 50% pruned on test (Chose ® =0.43)

POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) English

POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Portuguese

POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Bulgarian

Efficiency within cascade SPF outperforms SP and CRF baselines – SP unable to achieve very low error rate What about the rest of the cascade? – Train/test methods on sparse lattice output by a unigram SCP model