Download presentation
Presentation is loading. Please wait.
1
Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin
2
We want to model conditional correlations Reading people’s minds X : fMRI voxels Y : semantic features Metal? Manmade? Found in house?... predict Predict independently? Y i ~ X, for all i Correlated! E.g., Person? & Live in water? Colorful? & Yellow? Image from http://en.wikipedia.org/wiki/File:FMRI.jpg (Application from Palatucci et al., 2009)
3
Conditional Random Fields (CRFs) (Lafferty et al., 2001) Pro: Avoid modeling P(X) In fMRI, X ≈ 500 to 10,000 voxels
4
Conditional Random Fields (CRFs) Pro: Avoid modeling P(X) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure
5
Conditional Random Fields (CRFs) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure Pro: Avoid modeling P(X)
6
Conditional Random Fields (CRFs) Normalization depends on X=x Con: Compute Z(x) for each inference Y1Y1 Y2Y2 Y3Y3 Y4Y4 Pro: Avoid modeling P(X)
7
Conditional Random Fields (CRFs) Exact inference intractable in general. Approximate inference expensive. Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Y1Y1 Y2Y2 Y3Y3 Y4Y4
8
Conditional Random Fields (CRFs) Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Pro: Fast, exact inference Y1Y1 Y2Y2 Y3Y3 Y4Y4
9
CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Feature selection Y1Y1 Y2Y2 Y3Y3 Y4Y4 Structure learning
10
CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Local inputs (scalable) instead of Global inputs (not scalable)
11
This work Goals: Structured conditional models P(Y|X) Scalable methods Tree structures Local inputs X ij Max spanning trees Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights Experiments: synthetic & fMRI
12
Related work MethodFeature selection? Tractable models? Torralba et al. (2004) Boosted Random Fields YesNo Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No Shahaf et al. (2009) Edge weight + low-treewidth model NoYes Vs. our work Choice of edge weights Local inputs
13
Chow-Liu For generative models:
14
Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Y i ;Y j | X) intractable for big X
15
Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Y i ;Y j | X) intractable for big X Algorithmic framework Given: data {(y (i),x (i) )}. Given: input mapping Y i X i Weight potential edge (Y i,Y j ) with Score(i,j) Choose max spanning tree Local inputs!
16
Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j E.g., Local Conditional Mutual Information
17
Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters). No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j
18
Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights Experiments: synthetic & fMRI Heuristics Piecewise likelihood Local CMI DCI
19
Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Edge score w/ local inputs X ij Bounds log likelihood Fails on simple counterexample Does badly in practice Helps explain other edge scores
20
Piecewise likelihood (PWL) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential Choose (2,j) Over (j,k) Y1Y1 Y2Y2 Y3Y3 YnYn
21
Local Conditional Mutual Info Decomposable score w/ local inputs X ij Does pretty well in practice Can fail with strong potentials Theorem: Local CMI bounds log likelihood gain
22
Local Conditional Mutual Info Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential Y3Y3 Y2Y2 Y1Y1
23
Decomposable Conditional Influence (DCI) Exact measure of gain for some edges Edge score w/ local inputs X ij Succeeds on counterexample Does best in practice PWL From Y2Y2 Y1Y1 Y3Y3
24
Experiments Given: Data {(y i,x i )}; input mapping Y i X i Compute edge scores: Regress P(Y ij |X ij ) (10-fold CV to choose regularization) Choose max spanning tree Parameter learning: Conjugate gradient on L2-regularized log likelihood 10-fold CV to choose regularization Algorithmic details
25
Synthetic experiments Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Y i X i P(Y|X)P(X) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... X1X1 X2X2 X3X3 XnXn
26
intractable P(Y,X) Synthetic experiments P(Y|X), P(X): chains & trees Y1Y1 Y3Y3 Y4Y4 Y2Y2 Y5Y5 P(Y|X)P(X) P(Y,X): tractable & intractable X1X1 X3X3 X4X4 X2X2 X5X5 Φ(Y ij,X ij ): tractable P(Y,X) X1X1 X3X3 X4X4 X2X2 X5X5
27
Synthetic experiments P(Y|X): chains & trees P(Y|X) P(Y,X): tractable & intractable With & without cross-factors Φ(Y ij,X ij ): Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... cross factors Associative (all positive & alternating +/-) & random factors
28
Synthetic: vary # train exs.
29
Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
30
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
31
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
32
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
33
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples
34
Synthetic: vary # train exs.
35
Synthetic: vary model size Fixed 50 train exs., 1000 test exs.
36
fMRI experiments X (500 fMRI voxels) Y (218 semantic features) Metal? Manmade? Found in house?... predict Data, setup from Palatucci et al. (2009) Decode (hand-built map) Object (60 total) Bear Screwdriver... Zero-shot learning: Can predict objects not in training data (given decoding). Image from http://en.wikipedia.org/wiki/File:FMRI.jpg
37
fMRI experiments X (500 fMRI voxels) Y (218 semantic features) predict Y,X real-valued Gaussian factors: Input mapping: Regressed Y i ~ Y -i,X Chose top K inputs Added fixed Regularized A & C,b separately CV for parameter learning very expensive Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 &
38
Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y (i) ’, Y (j) ’ If ||Y (i) - Y (i) ’|| 2 < ||Y (j) - Y (i) ’|| 2 then we got i right. fMRI experiments
39
better Accuracy: CRFs a bit worse
40
fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better
41
fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better
42
fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better
43
Conclusion Scalable learning of CRF structure Analyzed edge scores for spanning tree methods Local Linear Entropy Scores imperfect Heuristics Pleasing theoretical properties Empirical success—we recommend DCI Future work Templated CRFs Learning edge score Assumptions on model/factors which give learnability Thank you!
44
References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998. Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001. M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009. M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008. D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009. C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005. C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007. A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.
45
(extra slides)
46
B: Score Decay Assumption
47
B: Example complexity
48
Future work: Templated CRFs Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization WebKB (Craven et al., 1998) Given webpages {(Y i = page type, X i = content )} Use template to: Choose tree over pages Instantiate parameters P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast
49
Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Y i,Y j ) for MST algorithm.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.