Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin.

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin

We want to model conditional correlations Reading people’s minds X : fMRI voxels Y : semantic features Metal? Manmade? Found in house?... predict Predict independently? Y i ~ X, for all i Correlated! E.g., Person? & Live in water? Colorful? & Yellow? Image from http://en.wikipedia.org/wiki/File:FMRI.jpg (Application from Palatucci et al., 2009)

Conditional Random Fields (CRFs) (Lafferty et al., 2001) Pro: Avoid modeling P(X) In fMRI, X ≈ 500 to 10,000 voxels

Conditional Random Fields (CRFs) Pro: Avoid modeling P(X) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure

Conditional Random Fields (CRFs) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs) Normalization depends on X=x Con: Compute Z(x) for each inference Y1Y1 Y2Y2 Y3Y3 Y4Y4 Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs) Exact inference intractable in general. Approximate inference expensive. Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Y1Y1 Y2Y2 Y3Y3 Y4Y4

Conditional Random Fields (CRFs) Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Pro: Fast, exact inference Y1Y1 Y2Y2 Y3Y3 Y4Y4

CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Feature selection Y1Y1 Y2Y2 Y3Y3 Y4Y4 Structure learning

CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Local inputs (scalable) instead of Global inputs (not scalable)

This work Goals: Structured conditional models P(Y|X) Scalable methods Tree structures Local inputs X ij Max spanning trees Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights Experiments: synthetic & fMRI

Related work MethodFeature selection? Tractable models? Torralba et al. (2004) Boosted Random Fields YesNo Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No Shahaf et al. (2009) Edge weight + low-treewidth model NoYes Vs. our work  Choice of edge weights  Local inputs

Chow-Liu For generative models:

Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Y i ;Y j | X) intractable for big X

Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Y i ;Y j | X) intractable for big X Algorithmic framework Given: data {(y (i),x (i) )}. Given: input mapping Y i  X i Weight potential edge (Y i,Y j ) with Score(i,j) Choose max spanning tree Local inputs!

Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j E.g., Local Conditional Mutual Information

Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters).  No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j

Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights  Experiments: synthetic & fMRI Heuristics  Piecewise likelihood  Local CMI  DCI

Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Edge score w/ local inputs X ij Bounds log likelihood Fails on simple counterexample Does badly in practice Helps explain other edge scores

Piecewise likelihood (PWL) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential  Choose (2,j) Over (j,k) Y1Y1 Y2Y2 Y3Y3 YnYn

Local Conditional Mutual Info Decomposable score w/ local inputs X ij Does pretty well in practice Can fail with strong potentials Theorem: Local CMI bounds log likelihood gain

Local Conditional Mutual Info Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential  Y3Y3 Y2Y2 Y1Y1

Decomposable Conditional Influence (DCI) Exact measure of gain for some edges Edge score w/ local inputs X ij Succeeds on counterexample Does best in practice PWL From Y2Y2 Y1Y1 Y3Y3

Experiments Given: Data {(y i,x i )}; input mapping Y i  X i Compute edge scores: Regress P(Y ij |X ij ) (10-fold CV to choose regularization) Choose max spanning tree Parameter learning: Conjugate gradient on L2-regularized log likelihood 10-fold CV to choose regularization Algorithmic details

Synthetic experiments Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Y i  X i P(Y|X)P(X) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... X1X1 X2X2 X3X3 XnXn

intractable P(Y,X) Synthetic experiments P(Y|X), P(X): chains & trees Y1Y1 Y3Y3 Y4Y4 Y2Y2 Y5Y5 P(Y|X)P(X) P(Y,X): tractable & intractable X1X1 X3X3 X4X4 X2X2 X5X5 Φ(Y ij,X ij ): tractable P(Y,X) X1X1 X3X3 X4X4 X2X2 X5X5

Synthetic experiments P(Y|X): chains & trees P(Y|X) P(Y,X): tractable & intractable With & without cross-factors Φ(Y ij,X ij ): Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... cross factors Associative (all positive & alternating +/-) & random factors

Synthetic: vary # train exs.

Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

Synthetic: vary # train exs.

Synthetic: vary model size Fixed 50 train exs., 1000 test exs.

fMRI experiments X (500 fMRI voxels) Y (218 semantic features) Metal? Manmade? Found in house?... predict Data, setup from Palatucci et al. (2009) Decode (hand-built map) Object (60 total) Bear Screwdriver... Zero-shot learning: Can predict objects not in training data (given decoding). Image from http://en.wikipedia.org/wiki/File:FMRI.jpg

fMRI experiments X (500 fMRI voxels) Y (218 semantic features) predict Y,X real-valued  Gaussian factors: Input mapping: Regressed Y i ~ Y -i,X Chose top K inputs Added fixed Regularized A & C,b separately  CV for parameter learning very expensive  Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 &

Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y (i) ’, Y (j) ’ If ||Y (i) - Y (i) ’|| 2 < ||Y (j) - Y (i) ’|| 2 then we got i right. fMRI experiments

better Accuracy: CRFs a bit worse

fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better

fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better

Conclusion Scalable learning of CRF structure Analyzed edge scores for spanning tree methods Local Linear Entropy Scores imperfect Heuristics Pleasing theoretical properties Empirical success—we recommend DCI Future work Templated CRFs Learning edge score Assumptions on model/factors which give learnability Thank you!

References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998. Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001. M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009. M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008. D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009. C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005. C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007. A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.

(extra slides)

B: Score Decay Assumption

B: Example complexity

Future work: Templated CRFs Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization WebKB (Craven et al., 1998) Given webpages {(Y i = page type, X i = content )} Use template to: Choose tree over pages Instantiate parameters  P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast

Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Y i,Y j ) for MST algorithm.

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin.

Similar presentations

Presentation on theme: "Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin.

Similar presentations

Presentation on theme: "Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin."— Presentation transcript:

Similar presentations

About project

Feedback