. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.

Slides:



Advertisements
Similar presentations
Lower Bounds for Additive Spanners, Emulators, and More David P. Woodruff MIT and Tsinghua University To appear in FOCS, 2006.
Advertisements

CS188: Computational Models of Human Behavior
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
A Tutorial on Learning with Bayesian Networks
Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Weakening the Causal Faithfulness Assumption
Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.
. On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network Or Zuk, Shiri Margel and Eytan Domany Dept. of Physics of Complex.
Sampling Distributions (§ )
Model Assessment and Selection
An Introduction to Variational Methods for Graphical Models.
Introduction of Probabilistic Reasoning and Bayesian Networks
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Visual Recognition Tutorial
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Bayesian Network Representation Continued
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Statistics.
Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.
Presenting: Assaf Tzabari
Parametric Inference.
Required Sample size for Bayesian network Structure learning
Bayesian Networks Alan Ritter.
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Simulation Output Analysis
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Bayesian Networks Martin Bachler MLA - VO
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning.
Siddhartha Shakya1 Estimation Of Distribution Algorithm based on Markov Random Fields Siddhartha Shakya School Of Computing The Robert Gordon.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
INTRODUCTION TO Machine Learning 3rd Edition
Slides for “Data Mining” by I. H. Witten and E. Frank.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Lecture 2: Statistical learning primer for biologists
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
. Entropy of Hidden Markov Processes Or Zuk 1 Ido Kanter 2 Eytan Domany 1 Weizmann Inst. 1 Bar-Ilan Univ. 2.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Maximum Likelihood Estimation
Data Mining Lecture 11.
Markov Properties of Directed Acyclic Graphs
Regression Models - Introduction
Estimating Networks With Jumps
Summarizing Data by Statistics
Markov Random Fields Presented by: Vladan Radosavljevic.
Learning From Observed Data
Presentation transcript:

. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of Science ^Broad Inst. Of MIT and Harvard

2  Let X 1,..,X n be binary random variables.  A Bayesian Network is a pair B ≡.  G – Directed Acyclic Graph (DAG). G =. V = {X 1,..,X n } the vertex set. Pa G (i) is the set of vertices X j s.t. (X j,X i ) in E.  θ - Parameterization. Represent conditional probabilities:  Together, they define a unique joint probability distribution P B over the n random variables. Introduction X2X X1X1 X1X1 X2X2 X3X3 X5X5 X4X4 X 5  {X 1,X 4 } | {X 2,X 3 }

3 Structure Learning  We looked at a score based approach:  For each graph G, one gives a score based on the data S(G) ≡ S N (G; D) (N is the sample size)  Score is composed of two components: 1. Data fitting (log-likelihood) LL N (G;D) = max LL N (G,Ө;D)‏ 2. Model complexity Ψ(N) |G| |G| = The Dimension. # parameters in (G,Ө). S N (G) = LL N (G;D) - Ψ(N) |G|  This is known as the MDL (Minimum Description Length) score. Assumption : 1 << Ψ(N) << N. Score is consistent.  Of special interest: Ψ(N) = ½log N. The BIC score (Bayesian Information Criteria) is asymptotically equivalent to the Bayesian score.

4 Previous Work  [Friedman&Yakhini 96] Unknown structure, no hidden variables. [Dasgupta 97] Known structure, Hidden variables. [Hoeffgen, 93] Unknown structure, no hidden variables. [Abbeel et al. 05] Factor graphs [Greiner et al. 97] classification error.  Concentrated on approximating the generative distribution. Typical results: N > N 0 (ε,δ) D(P true || P learned ) 1- δ. D – some distance between distributions. Usually relative entropy (we use relative entropy from now on).  We are interested in learning the correct structure. Intuition and practice  A difficult problem (both computationally and statistically.)‏ Empirical study: [Dai et al. IJCAI 97] New: [Wainwright et al. 06], [Bresler et al. 08] – undirected graphs

5 Structure Learning  Assume data is generated from B * =, with P B* generative distribution. Assume further that G* is minimal w. resp. to P B* : |G*| = min {|G|, P B* subset of M(G))‏  An error occurs when any ‘wrong’ graph G is preferred over G*. Many possible G’s. Complicated relations between them.  Observation: Directed graphical models (with no hidden variables) are curved exponential families [Geiger et al. 01].  [Haughton 88] – The MDL score is consistent.  [Haughton 89] – Bounds on the error probabilities: P (N) (under-fitting) ~ O(e -αN )‏ ; P (N) (over-fitting) ~ O(N -β )‏ Previously: Bounds only on β. Not on α, nor on the multiplicative constants.

6 Structure Learning Simulations: 4-Nodes Networks. Totally 543 DAGs, in 185 equivalence classes.  Draw at random a DAG G*.  Draw all parameters θ uniformly from [0,1].  Generate 5,000 samples from P  Gives scores S N (G) to all G’s and look at S N (G*)

7 Structure Learning  Relative entropy between the true and learned distributions:  Fraction of Edge Learned Correctly  Rank of the correct structure (equiv. class):

8 All DAGs and Equivalence Classes for 3 Nodes

9 Two Types of Error  An error occurs when any ‘wrong’ graph G is preferred over G*. Many possible G’s. Study them one by one.  Distinguish between two types of errors: 1. Graphs G which are not I-maps for P B* (‘under- fitting’). These graphs impose to many independency relations, some of which do not hold in P B*. 2. Graphs G which are I-maps for P B* (‘over-fitting’), yet they are over parameterized, |G| > |G*|  Study each error separately.

10 'Under-fitting' Errors 1. Graphs G which are not I-maps for P B*  Intuitively, in order to get S N (G*) > S N (G), we need: a. P (N) to be closer to P B* than to any point Q in G b. The penalty difference Ψ(N) (|G| - |G*|) is small enough. (Only relevant for |G*| > |G|).  For a., use concentration bounds (Sanov). For b., simple algebraic manipulations.

11  Sanov's Theorem [Sanov 57]: Draw N sample from a probability distribution P. Let P (N) be the sample distribution. Then: Pr( D(P (N) || P) > ε) < N (n+1) 2 -εN  Used in our case to show: (for some c>0)‏  For |G| ≤ |G*|, we are able to bound c: 'Under-fitting' Errors 1. Graphs G which are not I-maps for P B*

12  Upper-bound on decay exponent: c≤D(G||P B* )log 2. Could be very slow if G is close to P B*  Lower-bound: Use Chernoff Bounds to bound the difference between the true and sample entropies. 'Under-fitting' Errors  Two important parameters of the network: a. ‘Minimal probability’: b. ‘Minimal edge information’:

13  Here errors are Moderate deviations events, as opposed to Large deviations events in the previous case.  The probability of error does not decay exponentially with N, but is O(N -β ).  By [Woodroofe 78], β=½(|G|-|G*|).  Therefore, for large enough values of N, error is dominated by over-fitting. 'Over-fitting' Errors 2. Graphs G which are over-parameterized I- maps for P B*

14  Perform simulations:  Take a BN over 4 binary nodes.  Look at two wrong models Example What happens for small values of N? X1X1 X2X2 X3X3 X4X4 G*G* X1X1 X2X2 X3X3 X4X4 G2G2 X1X1 X2X2 X3X3 X4X4 G1G1

15 Example Errors become rare events. Simulate using importance sampling (30 iterations): [Zuk et al. UAI 06]

16 Recent Results/Future Directions  Want to minimize sum of errors (‘over-fitting’+’under- fitting’). Change penalty in the MDL score to Ψ(N) = ½log N – c log log N  # variables n >> 1. Small Max. degree # parents ≤ d.  Simulations for trees (computationally efficient: Chow-Liu)‏  Hidden variables – Even more basic questions (e.g. identifiably, consistency) are unknown generally.  Requiring exact model was maybe to strict – perhaps it is likely to learn wrong models which are close to the correct one. If we require only to learn 1-ε of the edges – how does this reduce sample complexity?