Carnegie Mellon Evidence-Specific Structures for Rich Tractable CRFs Anton Chechetka, Carlos Guestrin General approach: 1.Ground the model / features 2.Use.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

A brief review of non-neural-network approaches to deep learning

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.

Machine Learning with Discriminative Methods Lecture 18 – Structured Prediction CS Spring 2015 Alex Berg.

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.

SVM—Support Vector Machines

Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,

Convergent Message-Passing Algorithms for Inference over General Graphs with Convex Free Energies Tamir Hazan, Amnon Shashua School of Computer Science.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Contextual Classification with Functional Max-Margin Markov Networks Dan MunozDrew Bagnell Nicolas VandapelMartial Hebert.

A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Carnegie Mellon Focused Belief Propagation for Query-Specific Inference Anton Chechetka Carlos Guestrin 14 May 2010.

Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.

Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.

Carnegie Mellon Algorithms for Answering Queries with Graphical Models Thesis committee: Carlos Guestrin Eric Xing Drew Bagnell Pedro Domingos (UW) 21.

Conditional Random Fields

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.

Cue Integration in Figure/Ground Labeling Xiaofeng Ren, Charless Fowlkes and Jitendra Malik, U.C. Berkeley We present a model of edge and region grouping.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.

An Introduction to Support Vector Machines Martin Law.

1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Latent Boosting for Action Recognition Zhi Feng Huang et al. BMVC Jeany Son.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

An Introduction to Support Vector Machines (M. Law)

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.

Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

Training Conditional Random Fields using Virtual Evidence Boosting Lin Liao, Tanzeem Choudhury †, Dieter Fox, and Henry Kautz University of Washington.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

John Lafferty Andrew McCallum Fernando Pereira

Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

ECE 5424: Introduction to Machine Learning

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Supervised Learning Seminar Social Media Mining University UC3M

Efficient Graph Cut Optimization for Full CRFs with Quantized Edges

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

network of simple neuron-like computing elements

Arthur Choi and Adnan Darwiche UCLA

Unifying Variational and GBP Learning Parameters of MNs EM for BNs

Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.

Discriminative Probabilistic Models for Relational Data

Label and Link Prediction in Relational Data

Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website

Variable Elimination Graphical Models – Carlos Guestrin

Primal Sparse Max-Margin Markov Networks

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Presentation transcript:

Carnegie Mellon Evidence-Specific Structures for Rich Tractable CRFs Anton Chechetka, Carlos Guestrin General approach: 1.Ground the model / features 2.Use standard ESS-CRFs + parameter sharing want P( structured Q uery | E vidence ): P( |, ) P( | ) webpage text + links professor student project B B A C P( | ) collection of images + face similarities face labels Collaborative filtering: Webpage classification: Face recognition in image collections: featuresweightsnormalization Dense models Evidence Query f 12 f 34 Induced structure over Q f 12 f 34 featureexpected feature need inference in the induced model Exact inference: #P-complete Approximate inference: NP-complete Hopeless in large dense models Easy for tree-structured models Features can be arbitrarily correlated Convex objective Unique global optimum Intuitive gradient: Tree models Capture complex dependencies Natural extensions to relational settings Arbitrarily bad inference quality Arbitrarily bad parameters quality Simple dependencies only Relational settings are not tree-structured Efficient exact inference Efficient learning of optimal parameters This work: Keep efficient exact inference + parameters, enable rich dependencies and relational extensions evidence-specific structurestandard weighted features structure selection parameters structure selection algorithm Battery is good Engine starts E={gas tank is empty} Battery is good Engine starts Dependence in general No dependence for this specific evidence CRF with Evidence-Specific Structure Formalism: Motivation Conditional Random Fields Model Structure Tradeoffs Intuition: Edge importance depends on evidence Fixed dense model Evidence-specific tree “mask” Evidence-specific model ×= (())() E=e 1 E=e 2 E=e 3 E=e 1 E=e 2 E=e 3 × Capture all potential dependencies Select the most important tree specific to the evidence value () ( ) Select tree structures, based on evidence, to capture the most important dependencies: T(E,u) encodes the output of a structure selection algorithm Global perspective on structure selection Easy to guarantee tree structure (by selecting appropriate alg.) Looking at one edge at a time is not enough to guarantee tree structure: being a tree is a global property Objective still convex in w (but not u ) Efficient exact inference Efficient learning of optimal parameters w Much richer class of models than fixed trees (potential for capturing complex correlations) Structure selection decoupled from feature design and weights (can use an arbitrarily dense model as the basis) 1.choose features f 2.choose tree learning algorithm T(E,  ) 3.learn u 4.select evidence-specific trees T(e i,u) for every datapoint (E=e i,Q=q i ) [u is fixed at this stage] 5.given u, trees T(e i,u), learn w [L-BFGS, etc.] Learning a ESS-CRF model: Algorithm StageDense CRFsESS-CRFs (this work) Structure selectionApproximateApproximate Feature weight learning Approximate (no gurarantees) Exact Test time inference Approximate (no gurarantees) Exact Parameter sharing for both w and u : one weight per relation, not per grounding P(Q) (no evidence) P(Q i,Q j |E=e) (pairwise conditionals) + Chow-Liu algorithm = good tree for E=e P(Q|E=e) Directly generalize existing algorithms for the no-evidence case: Train stage: Learning Good Evidence-Specific Trees EQ EQ 1,Q 2 EEQ 1,Q 3 Q 3,Q 4, original high- dimensional problem low-dimensional pairwise problems … learn pairwise conditional estimators  params u Test stage (evidence-specific Chow-Liu alg.): Instantiate evidence in pairwise estimators: Compute mutual information values edge weights Q1Q1 Q3Q3 Q4Q4 Q2Q2 Return maximum spanning tree: Q1Q1 Q3Q3 Q4Q4 Q2Q2 Fewer Sources of Errors Learning Optimal Feature Weights Our Approach: Evidence-Specific Structures sparsity conforms to the evidence-specific structure structure-related parameters u are fixed from the tree-learning step efficient exact computation because T(e,u) is a tree ++ = E=e 1 ( ())( ) E=e 3 E=e 2 individual datapoints  tree-sparse gradients (with different evidence-dependent sparsity patterns) overall dataset: dense gradient, but still tractable Relational Extensions Results Face recognition [w/ Denver Dash, Matthai Philipose] Exploit face similarities to propagate labels in collections of images Semi-superwised relational model 250…1700 images, 4…24 unique people Compare against dense discriminative models disagree Equal or better accuracy 100 times faster WebKB [data + features thanks to Ben Taskar] webpages text + links  page type (student, project,…) same accuracy as dense models ~10 times faster Can exactly compute the convex objective and its gradient Use L-BFGS or conjugate gradient to find the unique global optimum w.r.t. w exactly Gradient similar to standard CRFs: QiQi QjQj … 0 Parameters dimensionality independent of model size Reduces overfitting Structure selection only after grounding No worries about structure being a tree on the relational level P(Q i,Q j ) (pairwise marginals) + Chow-Liu algorithm = optimal tree () Acknowledgements: this work has been supported by NSF Career award IIS and by ARO MURI W911NF and W911NF agree