Still More Uncertainty

Still More Uncertainty

Administrivia Read Ch 14, 15

Axioms of Probability Theory
Just 3 are enough to build entire theory! 1. Range of probabilities: 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is: A B A  B True

Bayes rules! Simple proof from def of conditional probability 4

Independence P(AB) = P(A)P(B) A A  B B True © Daniel S. Weld 5

Conditional Independence
A&B not independent, since P(A|B) < P(A) A A  B B True © Daniel S. Weld 6

Conditional Independence
But: A&B are made independent by C A A  B AC P(A|C) = P(A|B,C) B True C B  C © Daniel S. Weld 7

Joint Distribution All you need to know Can answer any question
Inference by enumeration © Daniel S. Weld

Joint Distribution All you need to know Can answer any question
Inference by enumeration But… exponential Both time & space Solution: exploit conditional independence © Daniel S. Weld 9

Sample Bayes Net Earthquake Burglary Radio Alarm Nbr1Calls Nbr2Calls
Pr(B=t) Pr(B=f) Earthquake Burglary Pr(A|E,B) e,b (0.1) e,b (0.8) e,b (0.15) e,b (0.99) Radio Alarm Nbr1Calls Nbr2Calls © Daniel S. Weld 10

Given Parents, X is Independent of Non-Descendants
© Daniel S. Weld 11

Given Markov Blanket, X is Independent of All Other Nodes
MB(X) = Par(X)  Childs(X)  Par(Childs(X)) © Daniel S. Weld 12

Given Markov Blanket, X is Independent of All Other Nodes
Sprinkler On Rained Grass Wet MB(X) = Par(X)  Childs(X)  Par(Childs(X)) © Daniel S. Weld 13

Galaxy Quest 2 astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars Normally there is a chance of error, e, of under or over counting by 1 star But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars

Structures F1 F2 F1 N F2 M1 M2 M1 M2 M1 M2 N N F1 F2
Two astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars. Normally there is a chance of error, e, of under or over counting by 1 star. But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars

Inference in BNs We generally want to compute
Pr(X), or Pr(X|E) where E is (conjunctive) evidence The graphical independence representation Yields efficient inference Computations organized by network topology Two simple algorithms: Variable elimination (VE) Junction trees © Daniel S. Weld 16

Variable Elimination Procedure
The initial potentials are the CPTs in BN Repeat until only query variable remains: Choose another variable to eliminate Multiply all potentials that contain the variable If no evidence for the variable then sum the variable out and replace original potential by the new result Else, remove variable based on evidence Normalize remaining potential to get the final distribution over the query variable

Junction Tree Inference Algorithm
Incorporate Evidence: For each evidence variable, go to one table including that variable Set to 0 all entries that disagree with evidence Renormalize this potential Upward Step: Pass message to parents Downward Step: Pass message to children

Learning Parameter Estimation:
Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld 19

Topics Bayesian networks overview Infernence Parameter Estimation:
Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld

Which coin will I use? Coin Flip P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations

What is the probability of heads after two experiments?
Your Estimate? What is the probability of heads after two experiments? Most likely coin: C2 Best estimate for P(H) P(H|C2) = 0.5 C1 C2 C3 P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9 P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Maximum Likelihood Estimate
The best hypothesis that fits observed data assuming uniform prior After 2 Experiments: heads, tails Most likely coin: Best estimate for P(H) P(H|C2) = 0.5 C2 P(H|C2) = 0.5 P(C2) = 1/3

Using Prior Knowledge Should we always use a Uniform Prior ?
Background knowledge: Heads => we have take-home midterm Dan likes take-homes… => Dan is more likely to use a coin biased in his favor P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70 P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3 P(H|C2) = 0.5

Maximum A Posteriori (MAP) Estimate
The best hypothesis that fits observed data assuming a non-uniform prior Most likely coin: Best estimate for P(H) P(H|C3) = 0.9 C3 P(H|C3) = 0.9 P(C3) = 0.70

Summary For Now Prior Hypothesis Maximum Likelihood Estimate
Maximum A Posteriori Estimate Bayesian Estimate Uniform The most likely Any Weighted combination

Topics Parameter Estimation:
Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld

Parameter Estimation and Bayesian Networks
J M T F ... We have: - Bayes Net structure and observations - We need: Bayes Net parameters

J M T F ... Prior Now compute either MAP or Bayesian estimate P(B) = ? + data =

What Prior to Use? The following are two common priors
Binary variable Beta Posterior distribution is binomial Easy to compute posterior Discrete variable Dirichlet Posterior distribution is multinomial © Daniel S. Weld

One Prior: Beta Distribution
a,b For any positive integer y, G(y) = (y-1)!

Beta Distribution Example: Flip coin with Beta distribution as prior over p [prob(heads)] Parameterized by two positive numbers: a, b Mode of distribution (E[p]) is a/(a+b) Specify our prior belief for p = a/(a+b) Specify confidence in this belief with high initial values for a and b Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)

J M T F ... Prior .3 B ¬B .7 P(B|data) = ? Beta(1,4) “+ data” = (3,7) Prior P(B)= 1/(1+4) = 20% with equivalent sample size 5

J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?

J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3)

J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3) + data= (3,4)

What if we don’t know structure?

Learning The Structure of Bayesian Networks
Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? Problem !?!? When scoring, add a penalty  model complexity

Learning The Structure of Bayesian Networks
Search thru the space For each structure, learn parameters Pick the one that fits observed data best Penalize complex models Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?

Structure Learning as Search
Local Search Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld

A E C D B A B C A E C D B A E C D B D E A E C D B

Score Functions Bayesian Information Criteion (BIC) MAP score
P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points) MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponentially with # of parameters for this to work well © Daniel S. Weld

Naïve Bayes Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1

Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997]
Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time

Still More Uncertainty

Similar presentations

Presentation on theme: "Still More Uncertainty"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Still More Uncertainty

Similar presentations

Presentation on theme: "Still More Uncertainty"— Presentation transcript:

Similar presentations

About project

Feedback