Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Similar presentations


Presentation on theme: "Learning Bayesian Networks (From David Heckerman’s tutorial)"— Presentation transcript:

1 Learning Bayesian Networks (From David Heckerman’s tutorial)

2 Bayes net(s) data X 1 true false true X21532X21532 X 3 0.7 -1.6 5.9 6.3........... Learning Bayes Nets From Data X1X1 X4X4 X9X9 X3X3 X2X2 X5X5 X6X6 X7X7 X8X8 Bayes-net learner + prior/expert information

3 Overview n Introduction to Bayesian statistics: Learning a probability n Learning probabilities in a Bayes net n Learning Bayes-net structure

4 Learning Probabilities: Classical Approach Simple case: Flipping a thumbtack tails heads True probability  is unknown Given iid data, estimate  using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)

5 Learning Probabilities: Bayesian Approach tails heads True probability  is unknown Bayesian probability density for  p()p()  01

6 Bayesian Approach: use Bayes' rule to compute a new density for  given data prior likelihood posterior

7 The Likelihood “binomial distribution”

8 Example: Application of Bayes rule to the observation of a single "heads" p(  |heads)  01 p()p()  01 p(heads|  )=   01 priorlikelihoodposterior

9 A Bayes net for learning probabilities  X1X1 X2X2 XNXN... toss 1 toss 2toss N

10 Sufficient statistics (#h,#t) are sufficient statistics

11 The probability of heads on the next toss

12 Prior Distributions for  n Direct assessment n Parametric distributions –Conjugate distributions (for convenience) –Mixtures of conjugate distributions

13 Conjugate Family of Distributions Beta distribution: Properties:

14 Intuition The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" Equivalent sample size =  h +  t Equivalent sample size =  h +  t n The larger the equivalent sample size, the more confident we are about the true probability

15 Beta Distributions Beta(3, 2 )Beta(1, 1 )Beta(19, 39 )Beta(0.5, 0.5 )

16 Assessment of a Beta Distribution Method 1: Equivalent sample - assess  h and  t - assess  h +  t and  h /(  h +  t ) Method 2: Imagined future samples

17 Generalization to m discrete outcomes ("multinomial distribution") Dirichlet distribution: Properties:

18 More generalizations (see, e.g., Bernardo + Smith, 1994) Likelihoods from the exponential family n Binomial n Multinomial n Poisson n Gamma n Normal

19 Overview n Intro to Bayesian statistics: Learning a probability n Learning probabilities in a Bayes net n Learning Bayes-net structure

20 From thumbtacks to Bayes nets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails  X1X1 X2X2 XNXN... toss 1 toss 2toss N

21 The next simplest Bayes net X heads/tails Y tails heads “heads”“tails”

22 The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N ?

23 The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N "parameter independence"

24 The next simplest Bayes net X heads/tails Y XX X1X1 X2X2 XNXN YY Y1Y1 Y2Y2 YNYN case 1 case 2 case N "parameter independence"  two separate thumbtack-like learning problems

25 A bit more difficult... X heads/tails Y Three probabilities to learn:  X=heads  X=heads  Y=heads|X=heads  Y=heads|X=heads  Y=heads|X=tails  Y=heads|X=tails

26 A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails heads tails

27 A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails

28 A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails ? ? ?

29 A bit more difficult... X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails 3 separate thumbtack-like problems

30 In general… Learning probabilities in a BN is straightforward if n Local distributions from the exponential family (binomial, poisson, gamma,...) n Parameter independence n Conjugate priors n Complete data

31 Incomplete data makes parameters dependent X heads/tails Y XX X1X1 X2X2  Y|X=heads Y1Y1 Y2Y2 case 1 case 2  Y|X=tails

32 Overview n Intro to Bayesian statistics: Learning a probability n Learning probabilities in a Bayes net n Learning Bayes-net structure

33 Learning Bayes-net structure Given data, which model is correct? XY model 1: XY model 2:

34 Bayesian approach Given data, which model is correct? more likely? XY model 1: XY model 2: Data d

35 Bayesian approach: Model Averaging Given data, which model is correct? more likely? XY model 1: XY model 2: Data d average predictions

36 Bayesian approach: Model Selection Given data, which model is correct? more likely? XY model 1: XY model 2: Data d Keep the best model: - Explanation - Understanding - Tractability

37 To score a model, use Bayes rule Given data d: "marginal likelihood" model score likelihood

38 Thumbtack example conjugate prior X heads/tails

39 More complicated graphs X heads/tails Y 3 separate thumbtack-like learning problems X Y|X=heads Y|X=tails

40 Model score for a discrete BN

41 Computation of Marginal Likelihood Efficient closed form if n Local distributions from the exponential family (binomial, poisson, gamma,...) n Parameter independence n Conjugate priors n No missing data (including no hidden variables)

42 Practical considerations The number of possible BN structures for n variables is super exponential in n n How do we find the best graph(s)? n How do we assign structure and parameter priors to all possible graph?

43 Model search n Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) n Heuristic methods –Greedy –Greedy with restarts –MCMC methods score all possible single changes any changes better? perform best change yes no return saved structure initialize structure

44 Structure priors 1. All possible structures equally likely 2. Partial ordering, required / prohibited arcs 3. p(m)  similarity(m, prior BN)

45 Parameter priors n All uniform: Beta(1,1) n Use a prior BN

46 Parameter priors Recall the intuition behind the Beta prior for the thumbtack: The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" The hyperparameters  h and  t can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" Equivalent sample size =  h +  t Equivalent sample size =  h +  t n The larger the equivalent sample size, the more confident we are about the long-run fraction

47 Parameter priors x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8 + equivalent sample size imaginary count for any variable configuration parameter priors for any BN structure for X 1 …X n parameter modularity

48 x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8 prior network+equivalent sample size data improved network(s) x 1 true false true x 2 false true x 3 true false........... Combine user knowledge and data x1x1 x4x4 x9x9 x3x3 x2x2 x5x5 x6x6 x7x7 x8x8


Download ppt "Learning Bayesian Networks (From David Heckerman’s tutorial)"

Similar presentations


Ads by Google