Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin
We want to model conditional correlations Reading people’s minds X : fMRI voxels Y : semantic features Metal? Manmade? Found in house?... predict Predict independently? Y i ~ X, for all i Correlated! E.g., Person? & Live in water? Colorful? & Yellow? Image from (Application from Palatucci et al., 2009)
Conditional Random Fields (CRFs) (Lafferty et al., 2001) Pro: Avoid modeling P(X) In fMRI, X ≈ 500 to 10,000 voxels
Conditional Random Fields (CRFs) Pro: Avoid modeling P(X) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure
Conditional Random Fields (CRFs) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs) Normalization depends on X=x Con: Compute Z(x) for each inference Y1Y1 Y2Y2 Y3Y3 Y4Y4 Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs) Exact inference intractable in general. Approximate inference expensive. Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Y1Y1 Y2Y2 Y3Y3 Y4Y4
Conditional Random Fields (CRFs) Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Pro: Fast, exact inference Y1Y1 Y2Y2 Y3Y3 Y4Y4
CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Feature selection Y1Y1 Y2Y2 Y3Y3 Y4Y4 Structure learning
CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Local inputs (scalable) instead of Global inputs (not scalable)
This work Goals: Structured conditional models P(Y|X) Scalable methods Tree structures Local inputs X ij Max spanning trees Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights Experiments: synthetic & fMRI
Related work MethodFeature selection? Tractable models? Torralba et al. (2004) Boosted Random Fields YesNo Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No Shahaf et al. (2009) Edge weight + low-treewidth model NoYes Vs. our work Choice of edge weights Local inputs
Chow-Liu For generative models:
Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Y i ;Y j | X) intractable for big X
Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Y i ;Y j | X) intractable for big X Algorithmic framework Given: data {(y (i),x (i) )}. Given: input mapping Y i X i Weight potential edge (Y i,Y j ) with Score(i,j) Choose max spanning tree Local inputs!
Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j E.g., Local Conditional Mutual Information
Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters). No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j
Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights Experiments: synthetic & fMRI Heuristics Piecewise likelihood Local CMI DCI
Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Edge score w/ local inputs X ij Bounds log likelihood Fails on simple counterexample Does badly in practice Helps explain other edge scores
Piecewise likelihood (PWL) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential Choose (2,j) Over (j,k) Y1Y1 Y2Y2 Y3Y3 YnYn
Local Conditional Mutual Info Decomposable score w/ local inputs X ij Does pretty well in practice Can fail with strong potentials Theorem: Local CMI bounds log likelihood gain
Local Conditional Mutual Info Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential Y3Y3 Y2Y2 Y1Y1
Decomposable Conditional Influence (DCI) Exact measure of gain for some edges Edge score w/ local inputs X ij Succeeds on counterexample Does best in practice PWL From Y2Y2 Y1Y1 Y3Y3
Experiments Given: Data {(y i,x i )}; input mapping Y i X i Compute edge scores: Regress P(Y ij |X ij ) (10-fold CV to choose regularization) Choose max spanning tree Parameter learning: Conjugate gradient on L2-regularized log likelihood 10-fold CV to choose regularization Algorithmic details
Synthetic experiments Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Y i X i P(Y|X)P(X) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... X1X1 X2X2 X3X3 XnXn
intractable P(Y,X) Synthetic experiments P(Y|X), P(X): chains & trees Y1Y1 Y3Y3 Y4Y4 Y2Y2 Y5Y5 P(Y|X)P(X) P(Y,X): tractable & intractable X1X1 X3X3 X4X4 X2X2 X5X5 Φ(Y ij,X ij ): tractable P(Y,X) X1X1 X3X3 X4X4 X2X2 X5X5
Synthetic experiments P(Y|X): chains & trees P(Y|X) P(Y,X): tractable & intractable With & without cross-factors Φ(Y ij,X ij ): Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... cross factors Associative (all positive & alternating +/-) & random factors
Synthetic: vary # train exs.
Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples
Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples
Synthetic: vary # train exs.
Synthetic: vary model size Fixed 50 train exs., 1000 test exs.
fMRI experiments X (500 fMRI voxels) Y (218 semantic features) Metal? Manmade? Found in house?... predict Data, setup from Palatucci et al. (2009) Decode (hand-built map) Object (60 total) Bear Screwdriver... Zero-shot learning: Can predict objects not in training data (given decoding). Image from
fMRI experiments X (500 fMRI voxels) Y (218 semantic features) predict Y,X real-valued Gaussian factors: Input mapping: Regressed Y i ~ Y -i,X Chose top K inputs Added fixed Regularized A & C,b separately CV for parameter learning very expensive Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 &
Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y (i) ’, Y (j) ’ If ||Y (i) - Y (i) ’|| 2 < ||Y (j) - Y (i) ’|| 2 then we got i right. fMRI experiments
better Accuracy: CRFs a bit worse
fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better
fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better
fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better
Conclusion Scalable learning of CRF structure Analyzed edge scores for spanning tree methods Local Linear Entropy Scores imperfect Heuristics Pleasing theoretical properties Empirical success—we recommend DCI Future work Templated CRFs Learning edge score Assumptions on model/factors which give learnability Thank you!
References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS C. Sutton, A. McCallum. Piecewise training of undirected models. UAI C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.
(extra slides)
B: Score Decay Assumption
B: Example complexity
Future work: Templated CRFs Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization WebKB (Craven et al., 1998) Given webpages {(Y i = page type, X i = content )} Use template to: Choose tree over pages Instantiate parameters P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast
Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Y i,Y j ) for MST algorithm.