Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Slides:



Advertisements
Similar presentations
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Nathan Wiebe, Ashish Kapoor and Krysta Svore Microsoft Research ASCR Workshop Washington DC Quantum Deep Learning.
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dynamic Bayesian Networks (DBNs)
Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graph.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Recent Development on Elimination Ordering Group 1.
Belief Propagation, Junction Trees, and Factor Graphs
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graphs.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Inference in Gaussian and Hybrid Bayesian Networks ICS 275B.
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Constructing evolutionary trees from rooted triples Bang Ye Wu Dept. of Computer Science and Information Engineering Shu-Te University.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Pattern Recognition and Machine Learning
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Sum-Product Networks: A New Deep Architecture
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Inference in Bayesian Networks
Solving MAP Exactly by Searching on Compiled Arithmetic Circuits
Still More Uncertainty
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Markov Networks.
Variable Elimination Graphical Models – Carlos Guestrin
Presentation transcript:

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos

Don’t use model complexity as your learning bias … Use inference complexity.

The Goal A BC Answers! Data Model Learning algorithmsInference algorithms The actual task This talk: How to learn accurate and efficient models by tightly integrating learning and inference Experiments: exact inference in 100

Outline Standard solutions (and why they fail) Background –Learning with Bayesian networks –Inference with arithmetic circuits Learning arithmetic circuits –Scoring –Search –Efficiency Experiments Conclusion

Solution 1: Exact Inference A BC Answers! Data ModelJointree SLOW

Solution 2: Approximate Inference A BC Answers! Data ModelApproximation Approximations are often too inaccurate. More accurate algorithms tend to be slower. SLOW

Solution 3: Learn a tractable model Related work: Thin junction trees Answers! Data Jointree Polynomial in data, but still exponential in treewidth [E.g.: Chechetka & Guestrin, 2007]

Thin junction trees are thin Maximum effective treewidth is 2-5 We have learned models with treewidth >100 Their junction trees Our junction trees

Solution 3: Learn a tractable model Our work: Arithmetic circuits with penalty on circuit size A BC Answers! Data Model Circuit  ++ ( )

Outline Standard solutions (and why they fail) Background –Learning with Bayesian networks –Inference with arithmetic circuits Learning arithmetic circuits –Overview –Optimizations Experiments Conclusion

Bayesian networks Problem: Compactly represent probability distribution over many variables Solution: Conditional independence P(A,B,C,D) = P(A) P(B|A) P(C|A) P(D|B,C) A BC D

… With decision-tree CPDs Problem: Number of parameters is exponential in the maximum number of parents Solution: Context-specific independence P(D|B,C) = B=? false C=? true

… Compiled to circuits Problem: Inference is exponential in treewidth Solution: Compile to arithmetic circuits A AA   ++ ¬A  ¬A  B BB  ¬B  ¬B 

Arithmetic circuits Directed, acyclic graph with single root –Leaf nodes are inputs –Interior nodes are addition or multiplication –Can represent any distribution Inference is linear in model size! –Never larger than junction tree –Can exploit local structure to save time/space A AA   ++ ¬A  ¬A  B BB  ¬B  ¬B 

ACs for Inference Bayesian network: P(A,B,C) = P(A) P(B) P(C|A,B) Network polynomial: A B C  A  B  C|AB + ¬A B C  ¬A  B  C|¬AB + … Can compute arbitrary marginal queries by evaluating network polynomial. Arithmetic circuits (ACs) offer efficient, factored representations of this polynomial. Can take advantage of local structure such as context-specific independence.

BN Structure Learning Start with an empty network Greedily add splits to decision trees one at a time, enforcing acyclicity [Chickering et al., 1996] score(C,T) = log P(T|C) – k p n p (C) (accuracy – # parameters)

Key Idea For an arithmetic circuit C on an i.i.d. training sample T: Typical cost function: Our cost function: score(C,T) = log P(T|C) – k p n p (C) – k e n e (C) (accuracy – # parameters – circuit size) score(C,T) = log P(T|C) – k p n p (C) (accuracy – # parameters)

Basic algorithm Following Chickering et al. (1996), we induce our statistical models by greedily selecting splits for the decision-tree CPDs. Our approach has two key differences: 1. We optimize a different objective function 2. We return a Bayesian network that has already been compiled into a circuit

Efficiency Compiling each candidate AC from scratch at each step is too expensive. Instead: Incrementally modify AC as we add splits.

A AA ¬A  ¬A B ¬B   Before split

+    A  A|B ¬A  ¬A|B A  A|¬B ¬A  ¬A|¬B B ¬B After split

Algorithm Create initial product of marginals circuit Create initial split list Until convergence: For each split in list Apply split to circuit Score result Undo split Apply highest-scoring split to circuit Add new child splits to list Remove inconsistent splits from list

How to split a circuit D: Parameter nodes to be split V: Indicators for the splitting variable M: First mutual ancestors of D and V For each indicator in V, Copy all nodes between M and D or V, conditioned on. For each m in M, Replace children of m that are ancestors of D or V with a sum over copies of the ancestors times the each copy was conditioned on.

Optimizations We avoid rescoring splits every iteration by: 1.Noting that likelihood gain never changes, only number of edges added 2.Evaluating splits with higher likelihood gain first, since likelihood gain is an upper bound on score. 3.Re-evaluate number of edges added only when another split may have affected it (AC-Greedy). 4.Assume the number of edges added by a split only increases as the algorithm progresses (AC-Quick).

Experiments Datasets: –KDD-Cup 2000: 65 variables –Anonymous MSWeb: 294 variables –EachMovie (subset): 500 variables Algorithms: –WinMine toolkit + Gibbs sampling –AC + Exact inference Queries: Generated queries from the test data with varying numbers of evidence and query variables

Learning time

Inference time

Accuracy: EachMovie

Accuracy: MSWeb

Accuracy: KDD Cup

Our method learns complex, accurate models AC-GreedyAC-QuickWinMine EachMovie–55.7–54.9–53.7 KDD Cup–2.16 MSWeb–9.85 –9.69 AC-GreedyAC-QuickWinMine EachMovie KDD Cup38 53 MSWeb Log-likelihood of test data: Treewidth of learned models: (2 35 = 34 billion)

Conclusion Problem: Learning accurate intractable models = Learning inaccurate models Solution: Use inference complexity as learning bias Algorithm: Learn arithmetic circuits with penalty on circuit size Result: Much faster and more accurate inference than standard Bayes net learning