Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin.

Similar presentations


Presentation on theme: "Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin."— Presentation transcript:

1 Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin

2 We want to model conditional correlations Reading people’s minds X : fMRI voxels Y : semantic features Metal? Manmade? Found in house?... predict Predict independently? Y i ~ X, for all i Correlated! E.g., Person? & Live in water? Colorful? & Yellow? Image from http://en.wikipedia.org/wiki/File:FMRI.jpg (Application from Palatucci et al., 2009)

3 Conditional Random Fields (CRFs) (Lafferty et al., 2001) Pro: Avoid modeling P(X) In fMRI, X ≈ 500 to 10,000 voxels

4 Conditional Random Fields (CRFs) Pro: Avoid modeling P(X) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure

5 Conditional Random Fields (CRFs) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure Pro: Avoid modeling P(X)

6 Conditional Random Fields (CRFs) Normalization depends on X=x Con: Compute Z(x) for each inference Y1Y1 Y2Y2 Y3Y3 Y4Y4 Pro: Avoid modeling P(X)

7 Conditional Random Fields (CRFs) Exact inference intractable in general. Approximate inference expensive. Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Y1Y1 Y2Y2 Y3Y3 Y4Y4

8 Conditional Random Fields (CRFs) Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Pro: Fast, exact inference Y1Y1 Y2Y2 Y3Y3 Y4Y4

9 CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Feature selection Y1Y1 Y2Y2 Y3Y3 Y4Y4 Structure learning

10 CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Local inputs (scalable) instead of Global inputs (not scalable)

11 This work Goals: Structured conditional models P(Y|X) Scalable methods Tree structures Local inputs X ij Max spanning trees Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights Experiments: synthetic & fMRI

12 Related work MethodFeature selection? Tractable models? Torralba et al. (2004) Boosted Random Fields YesNo Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No Shahaf et al. (2009) Edge weight + low-treewidth model NoYes Vs. our work  Choice of edge weights  Local inputs

13 Chow-Liu For generative models:

14 Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Y i ;Y j | X) intractable for big X

15 Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Y i ;Y j | X) intractable for big X Algorithmic framework Given: data {(y (i),x (i) )}. Given: input mapping Y i  X i Weight potential edge (Y i,Y j ) with Score(i,j) Choose max spanning tree Local inputs!

16 Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j E.g., Local Conditional Mutual Information

17 Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters).  No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j

18 Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights  Experiments: synthetic & fMRI Heuristics  Piecewise likelihood  Local CMI  DCI

19 Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Edge score w/ local inputs X ij Bounds log likelihood Fails on simple counterexample Does badly in practice Helps explain other edge scores

20 Piecewise likelihood (PWL) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential  Choose (2,j) Over (j,k) Y1Y1 Y2Y2 Y3Y3 YnYn

21 Local Conditional Mutual Info Decomposable score w/ local inputs X ij Does pretty well in practice Can fail with strong potentials Theorem: Local CMI bounds log likelihood gain

22 Local Conditional Mutual Info Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential  Y3Y3 Y2Y2 Y1Y1

23 Decomposable Conditional Influence (DCI) Exact measure of gain for some edges Edge score w/ local inputs X ij Succeeds on counterexample Does best in practice PWL From Y2Y2 Y1Y1 Y3Y3

24 Experiments Given: Data {(y i,x i )}; input mapping Y i  X i Compute edge scores: Regress P(Y ij |X ij ) (10-fold CV to choose regularization) Choose max spanning tree Parameter learning: Conjugate gradient on L2-regularized log likelihood 10-fold CV to choose regularization Algorithmic details

25 Synthetic experiments Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Y i  X i P(Y|X)P(X) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... X1X1 X2X2 X3X3 XnXn

26 intractable P(Y,X) Synthetic experiments P(Y|X), P(X): chains & trees Y1Y1 Y3Y3 Y4Y4 Y2Y2 Y5Y5 P(Y|X)P(X) P(Y,X): tractable & intractable X1X1 X3X3 X4X4 X2X2 X5X5 Φ(Y ij,X ij ): tractable P(Y,X) X1X1 X3X3 X4X4 X2X2 X5X5

27 Synthetic experiments P(Y|X): chains & trees P(Y|X) P(Y,X): tractable & intractable With & without cross-factors Φ(Y ij,X ij ): Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... cross factors Associative (all positive & alternating +/-) & random factors

28 Synthetic: vary # train exs.

29 Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

30 Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

31 Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

32 Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

33 Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

34 Synthetic: vary # train exs.

35 Synthetic: vary model size Fixed 50 train exs., 1000 test exs.

36 fMRI experiments X (500 fMRI voxels) Y (218 semantic features) Metal? Manmade? Found in house?... predict Data, setup from Palatucci et al. (2009) Decode (hand-built map) Object (60 total) Bear Screwdriver... Zero-shot learning: Can predict objects not in training data (given decoding). Image from http://en.wikipedia.org/wiki/File:FMRI.jpg

37 fMRI experiments X (500 fMRI voxels) Y (218 semantic features) predict Y,X real-valued  Gaussian factors: Input mapping: Regressed Y i ~ Y -i,X Chose top K inputs Added fixed Regularized A & C,b separately  CV for parameter learning very expensive  Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 &

38 Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y (i) ’, Y (j) ’ If ||Y (i) - Y (i) ’|| 2 < ||Y (j) - Y (i) ’|| 2 then we got i right. fMRI experiments

39 better Accuracy: CRFs a bit worse

40 fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better

41 fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better

42 fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better

43 Conclusion Scalable learning of CRF structure Analyzed edge scores for spanning tree methods Local Linear Entropy Scores imperfect Heuristics Pleasing theoretical properties Empirical success—we recommend DCI Future work Templated CRFs Learning edge score Assumptions on model/factors which give learnability Thank you!

44 References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998. Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001. M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009. M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008. D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009. C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005. C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007. A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.

45 (extra slides)

46 B: Score Decay Assumption

47 B: Example complexity

48 Future work: Templated CRFs Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization WebKB (Craven et al., 1998) Given webpages {(Y i = page type, X i = content )} Use template to: Choose tree over pages Instantiate parameters  P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast

49 Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Y i,Y j ) for MST algorithm.


Download ppt "Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin."

Similar presentations


Ads by Google