Download presentation
Presentation is loading. Please wait.
Published byStuart Barker Modified over 9 years ago
1
Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania
2
Supervised Learning Learn from Regression: Binary Classification: Multiclass Classification: Structured Prediction:
3
Handwriting Recognition `structured’ xy
4
Machine Translation xy ‘Ce n'est pas un autre problème de classification.’ ‘This is not another classification problem.’
5
Pose Estimation xy © Arthur Gretton
6
Structured Models space of feasible outputs scoring function parts = cliques, productions Complexity of inference depends on “part” structure
7
Supervised Structured Prediction Learning Prediction Discriminative estimation of θ Data Model: Intractable/impractical for complex models Intractable/impractical for complex models 3 rd order OCR model = 1 mil states * length Berkeley English grammar = 4 mil productions * length^3 Tree-based pose model = 100 mil states * # joints
8
Approximation vs. Computation Usual trade-off: approximation vs. estimation – Complex models need more data – Error from over-fitting New trade-off: approximation vs. computation – Complex models need more time/memory – Error from inexact inference This talk: enable complex models via cascades
9
An Inspiration: Viola Jones Face Detector Scanning window at every location, scale and orientation
10
Classifier Cascade Most patches are non-face Filter out easy cases fast!! Simple features first Low precision, high recall Learned layer-by-layer Next layer more complex C1C1 C2C2 C3C3 CnCn Non-face Face
11
Related Work Global Thresholding and Multiple-Pass Parsing. J Goodman, 97 A maximum-entropy-inspired parser. E. Charniak, 00. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking, E Charniak & M Johnson, 05 TAG, dynamic pro- gramming, and the perceptron for efficient, feature-rich parsing, X. Carreras, M. Collins, and T. Koo, 08 Coarse-to-Fine Natural Language Processing, S. Petrov, 09 Coarse-to-fine face detection. Fleuret, F., Geman, D, 01 Robust real-time object detection. Viola, P., Jones, M, 02 Progressive search space reduction for human pose estimation, Ferrari, V., Marin-Jimenez, M., Zisserman, A, 08
12
What’s a Structured Cascade? What to filter? – Clique assignments What are the layers? – Higher order models – Higher resol. models How to learn filters? – Novel convex loss – Simple online algorithm – Generalization bounds F1F1 F2F2 F3F3 FnFn ?????? ‘structured’
13
Trade-off in Learning Cascades Accuracy: Minimize the number of errors incurred by each level Efficiency: Maximize the number of filtered assignments at each level Filter 1 Filter 2 Filter D Predict
14
Max-marginals (Sequences) a b c d a b c d a b c d a b c d Score of an output: Compute max (*bc*) Max marginal:
15
Filtering with Max-marginals Set threshold Filter clique assignment if a b c d a b c d a b c d a b c d
16
Filtering with Max-marginals Set threshold Filter clique assignment if a b c d a b c d a b c d a b c d Remove edge bc
17
Why Max-marginals? Valid path guarantee – Filtering leaves at least one valid global assignment Faster inference – No exponentiation, O(k) vs. O(k log k) in some cases Convex estimation – Simple stochastic subgradient algorithm Generalization bounds for error/efficiency – Guarantees on expected trade-off
18
Choosing a Threshold Threshold must be specific to input x – Max-marginal scores are not normalized Keep top-K max-marginal assignments? – Hard(er) to handle and analyze Convex alternative: max-mean-max function: max scoremean max marginal m = #edges
19
Choosing a Threshold (cont) Max marginal scores Mean max marginal Max score Score of truth Range of possible thresholds α = efficiency level
20
Example OCR Cascade abcdefghijkl… 144269-1378052-4249360-27 -774368-24 Mean ≈ 0 Max = 368 α = 0.55 Threshold ≈ 204 acdefgijl 144-1378052-4249-27 -774-24 bhk… b h k
21
Example OCR Cascade b h k a e nu r a b dga c egnoua e mnuwb h k
22
b h k a e nu r a b dga c egnou ▪b▪b▪h▪h▪k▪k bahaka be he…aaabad ag ea…aaacae ag an… a e mnuw aaaeam an au…
23
Example OCR Cascade ▪b▪b▪h▪h▪k▪k bahaka be he…aaabad ag ea…aaacae ag an…aaaeam an au… 144015581575 1440 15581575 1413 1553 1400 14801575 1397 1393 1257 12851302 1294 1356 1336 13061390 1346 1306 Mean ≈ 1412 Max = 1575 α = 0.44 Threshold ≈ 1502
24
▪b▪b ba be aaab ag ea aa acae ag an aaaeam an au ▪b▪b ba be aaab ag ea aa acae ag an aaaeam an au Example OCR Cascade ▪h▪h▪k▪k hakahe… ad… … … 144015581575 1440 15581575 1413 1553 1400 14801575 1397 1393 1257 12851302 1294 1356 1336 13061390 1346 1306 Mean ≈ 1412 Max = 1575 1440 1413 1400 1480 1397 1393 1257 12851302 1294 1356 1336 13061390 1346 1306 α = 0.44 Threshold ≈ 1502
25
Example OCR Cascade ka ▪h▪h▪k▪k hahe… ad… … … Mean ≈ 1412 Max = 1575 ▪h▪h▪k▪k hakaheke aded kn do owom nd α = 0.44 Threshold ≈ 1502
26
Example OCR Cascade ▪h▪h▪k▪k hakaheke aded kn do owom nd … … ▪▪h▪▪k ▪ha▪ka▪he▪ke hadkadhedked adoedondo dowdom
27
10x10x1210x10x2440x40x2480x80x24110×122×24 Example: Pictorial Structure Cascade Upper arm Lower arm H H T T ULA LLA URA LRA k ≈ 320,000 k 2 ≈ 100mil
28
Filter loss If score(truth) > threshold, all true states are safe Efficiency loss Proportion of unfiltered clique assignments Quantifying Loss
29
Learning One Cascade Level Fix α, solve convex problem for θ Minimize filter mistakes at efficiency level α No filter mistakesMargin w/ slack
30
An Online Algorithm Stochastic sub-gradient update: Features of truth Convex combination: Features of best guess + Average features of max marginal “witnesses”
31
Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data Expected loss on true distribution
32
) -ramp( Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data - Empirical -ramp upper bound 0 1
33
Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data n number of examples m number of clique assignments number of cliques B
34
Generalization Bounds W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data Similar bound holds for L e and all ® 2 [0,1]
35
OCR Experiments Dataset: http://www.cis.upenn.edu/~taskar/ocr
36
Efficiency Experiments POS Tagging (WSJ + CONLL datasets) – English, Bulgarian, Portuguese Compare 2 nd order model filters – Structured perceptron (max-sum) – SCP w/ ® 2 [0, 0.2,0.4,0.6,0.8] (max-sum) – CRF log-likelihood (sum-product marginals) Tightly controlled – Use random 40% of training set – Use development set to fit regularization parameter ¸ and α for all methods
37
POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) English
38
POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Portuguese
39
POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Bulgarian
40
English POS Cascade (2 nd order) The cascade is efficient. DT NN VRB ADJ
41
Pose Estimation (Buffy Dataset) “Ferrari score:” percent of parts correct within radius
42
Pose Estimation
43
Cascade Efficiency vs. Accuracy
44
Features: Shape/Segmentation Match
45
Features: Contour Continuation
46
Conclusions Novel loss significantly improves efficiency Principled learning of accurate cascades Deep cascades focus structured inference and allow rich models Open questions: – How to learn cascade structure – Dealing with intractable models – Other applications: factored dynamic models, grammars
47
Thanks! Structured Prediction Cascades, Weiss & Taskar, AISTATS10 Cascaded Models for Articulated Pose Estimation, Sapp, Toshev & Taskar, ECCV10
48
Learning to Filter Optimization problem: subject to Maximize efficiency Hard constraint on filtering accuracy Not convex due to coupling ofααand θ Learn θ for fixed ®
49
Safe filtering When does filtering make mistakes? Lemma: a b c d a b c d a b c d a b c d Clique assignments with max-score lower than the truth are safe to filter
50
Safe filtering When does filtering make mistakes? Lemma: To guarantee no mistakes, set score of ground truth > threshold + margin
51
Structured Prediction Underlying goal of learning large-scale graphical models: prediction of structured outputs Structured: multivariate, correlated, constrained
52
Object Segmentation Spatial structure xy
53
Natural Language Parsing The screen was a sea of red Recursive structure xy
54
a b l m n wx
55
Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Combinatorial structure
56
Handwriting Recognition brace xy
57
Structured Prediction Protein folding Machine translation ‘Ce n'est pas un autre problème de classification.’ InputOutput AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHKIPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK ‘This is not another classification problem.’ huge!
58
Pose Estimation xy
59
Cascade Model Complexity Filter 1 Filter 2 Filter D Predict
60
Example: OCR task abcde…abcde… abcde…abcde… abcde…abcde… abcde…abcde… abcde…abcde… Filter states
61
Example: OCR task abdabd cenrcenr acdeoacdeo aceracer aceoaceo Expand to bigrams…
62
Example: OCR task abdabd ac bc dc ae be de … ca ea na ra cc bc … aa ca da ea oa ac … aa ca ea ra ac cc … Compute max-marginals Filter edges
63
Example: OCR task abab ac be de br ar … ca na ra cd rd … ca da ac dc … ea ac re … Expand to trigrams…
64
Example: OCR task a b Expand to trigrams… ac be de ar br aca ana ara bca bna bra cca rca cda rda rac dac aea aac ace are Compute max-marginals Filter edges
65
Visualizing L e vs. L f trade-off Train w to minimize loss function on training set For each ² 2 [0,0.002]: – Choose ® to minimize L e subject to L f · ² on the development set – Compute L e on test set – Plot ( ²,L e ) 10 replicates to estimate variance Error Cap (%) on Development Set % 0.05% errors on devel 50% pruned on test (Chose ® =0.43)
66
POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) English
67
POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Portuguese
68
POS Tagging Results Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) Bulgarian
69
Efficiency within cascade SPF outperforms SP and CRF baselines – SP unable to achieve very low error rate What about the rest of the cascade? – Train/test methods on sparse lattice output by a unigram SCP model
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.