Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin.

Slides:



Advertisements
Similar presentations
Mean-Field Theory and Its Applications In Computer Vision1 1.
Advertisements

What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.
An Introduction to Conditional Random Field Ching-Chun Hsiao 1.
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
Learning to Combine Bottom-Up and Top-Down Segmentation Anat Levin and Yair Weiss School of CS&Eng, The Hebrew University of Jerusalem, Israel.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Conditional Random Fields and beyond …
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.
Carnegie Mellon Evidence-Specific Structures for Rich Tractable CRFs Anton Chechetka, Carlos Guestrin General approach: 1.Ground the model / features 2.Use.
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Abstract We present a model of curvilinear grouping using piecewise linear representations of contours and a conditional random field to capture continuity.
Conditional Random Fields
Machine Learning CMPT 726 Simon Fraser University
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Learning Algorithms
Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom.
Crash Course on Machine Learning
A Bayesian Approach For 3D Reconstruction From a Single Image
Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.
Graphical models for part of speech tagging
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Machine Learning for Sequential.
Bayesian Inference and Posterior Probability Maps Guillaume Flandin Wellcome Department of Imaging Neuroscience, University College London, UK SPM Course,
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Randomized Algorithms for Bayesian Hierarchical Clustering
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Training Conditional Random Fields using Virtual Evidence Boosting Lin Liao, Tanzeem Choudhury †, Dieter Fox, and Henry Kautz University of Washington.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Markov Random Fields & Conditional Random Fields
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
NTU & MSRA Ming-Feng Tsai
TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation J. Shotton ; University of Cambridge J. Jinn,
Probabilistic Equational Reasoning Arthur Kantor
Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
SA-1 University of Washington Department of Computer Science & Engineering Robotics and State Estimation Lab Dieter Fox Stephen Friedman, Lin Liao, Benson.
IEEE 2015 Conference on Computer Vision and Pattern Recognition Active Learning for Structured Probabilistic Models with Histogram Approximation Qing SunAnkit.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Nonparametric Semantic Segmentation
ECE 5424: Introduction to Machine Learning
Learning to Combine Bottom-Up and Top-Down Segmentation
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Parameter Learning 2 Structure Learning 1: The good
Discriminative Probabilistic Models for Relational Data
Markov Networks.
Sequential Learning with Dependency Nets
Presentation transcript:

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin

We want to model conditional correlations Reading people’s minds X : fMRI voxels Y : semantic features Metal? Manmade? Found in house?... predict Predict independently? Y i ~ X, for all i Correlated! E.g., Person? & Live in water? Colorful? & Yellow? Image from (Application from Palatucci et al., 2009)

Conditional Random Fields (CRFs) (Lafferty et al., 2001) Pro: Avoid modeling P(X) In fMRI, X ≈ 500 to 10,000 voxels

Conditional Random Fields (CRFs) Pro: Avoid modeling P(X) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure

Conditional Random Fields (CRFs) Y1Y1 Y2Y2 Y3Y3 Y4Y4 encode conditional independence structure Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs) Normalization depends on X=x Con: Compute Z(x) for each inference Y1Y1 Y2Y2 Y3Y3 Y4Y4 Pro: Avoid modeling P(X)

Conditional Random Fields (CRFs) Exact inference intractable in general. Approximate inference expensive. Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Y1Y1 Y2Y2 Y3Y3 Y4Y4

Conditional Random Fields (CRFs) Con: Compute Z(x) for each inference Pro: Avoid modeling P(X) Use tree CRFs! Pro: Fast, exact inference Y1Y1 Y2Y2 Y3Y3 Y4Y4

CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Feature selection Y1Y1 Y2Y2 Y3Y3 Y4Y4 Structure learning

CRF Structure Learning Tree CRFs Fast, exact inference Avoid modeling P(X) Local inputs (scalable) instead of Global inputs (not scalable)

This work Goals: Structured conditional models P(Y|X) Scalable methods Tree structures Local inputs X ij Max spanning trees Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights Experiments: synthetic & fMRI

Related work MethodFeature selection? Tractable models? Torralba et al. (2004) Boosted Random Fields YesNo Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No Shahaf et al. (2009) Edge weight + low-treewidth model NoYes Vs. our work  Choice of edge weights  Local inputs

Chow-Liu For generative models:

Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Y i ;Y j | X) intractable for big X

Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Y i ;Y j | X) intractable for big X Algorithmic framework Given: data {(y (i),x (i) )}. Given: input mapping Y i  X i Weight potential edge (Y i,Y j ) with Score(i,j) Choose max spanning tree Local inputs!

Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j E.g., Local Conditional Mutual Information

Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters).  No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Generalized edge scores Key step: Weight edge (Y i,Y j ) with Score(i,j). Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Y i,Y j,X i,X j

Outline Gold standard Max spanning trees Generalized edge weights Heuristic weights  Experiments: synthetic & fMRI Heuristics  Piecewise likelihood  Local CMI  DCI

Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Edge score w/ local inputs X ij Bounds log likelihood Fails on simple counterexample Does badly in practice Helps explain other edge scores

Piecewise likelihood (PWL) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential  Choose (2,j) Over (j,k) Y1Y1 Y2Y2 Y3Y3 YnYn

Local Conditional Mutual Info Decomposable score w/ local inputs X ij Does pretty well in practice Can fail with strong potentials Theorem: Local CMI bounds log likelihood gain

Local Conditional Mutual Info Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... True P(Y,X) Strong potential  Y3Y3 Y2Y2 Y1Y1

Decomposable Conditional Influence (DCI) Exact measure of gain for some edges Edge score w/ local inputs X ij Succeeds on counterexample Does best in practice PWL From Y2Y2 Y1Y1 Y3Y3

Experiments Given: Data {(y i,x i )}; input mapping Y i  X i Compute edge scores: Regress P(Y ij |X ij ) (10-fold CV to choose regularization) Choose max spanning tree Parameter learning: Conjugate gradient on L2-regularized log likelihood 10-fold CV to choose regularization Algorithmic details

Synthetic experiments Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Y i  X i P(Y|X)P(X) Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... X1X1 X2X2 X3X3 XnXn

intractable P(Y,X) Synthetic experiments P(Y|X), P(X): chains & trees Y1Y1 Y3Y3 Y4Y4 Y2Y2 Y5Y5 P(Y|X)P(X) P(Y,X): tractable & intractable X1X1 X3X3 X4X4 X2X2 X5X5 Φ(Y ij,X ij ): tractable P(Y,X) X1X1 X3X3 X4X4 X2X2 X5X5

Synthetic experiments P(Y|X): chains & trees P(Y|X) P(Y,X): tractable & intractable With & without cross-factors Φ(Y ij,X ij ): Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn... cross factors Associative (all positive & alternating +/-) & random factors

Synthetic: vary # train exs.

Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples

Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples

Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples

Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples

Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|= test examples

Synthetic: vary # train exs.

Synthetic: vary model size Fixed 50 train exs., 1000 test exs.

fMRI experiments X (500 fMRI voxels) Y (218 semantic features) Metal? Manmade? Found in house?... predict Data, setup from Palatucci et al. (2009) Decode (hand-built map) Object (60 total) Bear Screwdriver... Zero-shot learning: Can predict objects not in training data (given decoding). Image from

fMRI experiments X (500 fMRI voxels) Y (218 semantic features) predict Y,X real-valued  Gaussian factors: Input mapping: Regressed Y i ~ Y -i,X Chose top K inputs Added fixed Regularized A & C,b separately  CV for parameter learning very expensive  Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 &

Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y (i) ’, Y (j) ’ If ||Y (i) - Y (i) ’|| 2 < ||Y (j) - Y (i) ’|| 2 then we got i right. fMRI experiments

better Accuracy: CRFs a bit worse

fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better

fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better

fMRI experiments better Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better

Conclusion Scalable learning of CRF structure Analyzed edge scores for spanning tree methods Local Linear Entropy Scores imperfect Heuristics Pleasing theoretical properties Empirical success—we recommend DCI Future work Templated CRFs Learning edge score Assumptions on model/factors which give learnability Thank you!

References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS C. Sutton, A. McCallum. Piecewise training of undirected models. UAI C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.

(extra slides)

B: Score Decay Assumption

B: Example complexity

Future work: Templated CRFs Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization WebKB (Craven et al., 1998) Given webpages {(Y i = page type, X i = content )} Use template to: Choose tree over pages Instantiate parameters  P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast

Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Y i,Y j ) for MST algorithm.