Author: David Heckerman Presented By: Yan Zhang - 2006 Jeremy Gould – 2013 Chip Galusha -2014 1.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

A Tutorial on Learning with Bayesian Networks
Bayesian Network and Influence Diagram A Guide to Construction And Analysis.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Statistics: Purpose, Approach, Method. The Basic Approach The basic principle behind the use of statistical tests of significance can be stated as: Compare.
Chapter 4 Mathematical Expectation.
Chapter 4 Probability and Probability Distributions
Author: David Heckerman Presented By: Yan Zhang Jeremy Gould –
Introduction of Probabilistic Reasoning and Bayesian Networks
Parameter Estimation using likelihood functions Tutorial #1
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.
Basic Probability. Theoretical versus Empirical Theoretical probabilities are those that can be determined purely on formal or logical grounds, independent.
Visual Recognition Tutorial
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Thanks to Nir Friedman, HU
1 Learning with Bayesian Networks Author: David Heckerman Presented by Yan Zhang April
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
A Practical Course in Graphical Bayesian Modeling; Class 1 Eric-Jan Wagenmakers.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Crash Course on Machine Learning
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Chapter 5 Sampling Distributions
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
11-1 Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Probability and Statistics Chapter 11.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Theory of Probability Statistics for Business and Economics.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.
1 Let X represent a Binomial r.v as in (3-42). Then from (2-30) Since the binomial coefficient grows quite rapidly with n, it is difficult to compute (4-1)
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Uncertainty Management in Rule-based Expert Systems
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Sampling and estimation Petter Mostad
Textbook Basics of an Expert System: – “Expert systems: Design and Development,” by: John Durkin, 1994, Chapters 1-4. Uncertainty (Probability, Certainty.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Probability. Probability Probability is fundamental to scientific inference Probability is fundamental to scientific inference Deterministic vs. Probabilistic.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Probability and Probability Distributions. Probability Concepts Probability: –We now assume the population parameters are known and calculate the chances.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Basic Probability Theory
CSCI 5822 Probabilistic Models of Human and Machine Learning
Lecture 11 Sections 5.1 – 5.2 Objectives: Probability
Chapter 5 Sampling Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayes for Beginners Luca Chech and Jolanda Malamud
28th September 2005 Dr Bogdan L. Vrusias
Probability Probability Principles of EngineeringTM
Mathematical Foundations of BME Reza Shadmehr
Chapter 5: Sampling Distributions
Presentation transcript:

Author: David Heckerman Presented By: Yan Zhang Jeremy Gould – 2013 Chip Galusha

Outline Bayesian Approach Bayesian vs. classical probability methods Bayes Theorm Examples Bayesian Network Structure Inference Learning Probabilities Dealing with Unknowns Learning the Network Structure Two coin toss – an example Conclusions Exam Questions 2

Bayesian vs. the Classical Approach Bayesian statistical methods start with existing 'prior' beliefs, and update these using data to give 'posterior' beliefs, which may be used as the basis for inferential decisions and probability assessments. Classical probability refers to the true or actual probability of the event and is not concerned with observed behavior. 3

Example – Is this Man a Martian Spy? 4

Example We start with two concepts: 1. Hypothesis (H) – He either is or is not a Martian spy. 2. Data (D) – Some set of information about the subject. Perhaps financial data, phone records, maybe we bugged his office… 5

Example Frequentist Says Bayesian Says Given a hypothesis (He IS a Martian) there is a probability P of seeing this data: P( D | H ) (Considers absolute ground truth, the uncertainty/noise is in the data.) Given this data there is a probability P of this hypothesis being true: P( H | D ) (This probability indicates our level of belief in the hypothesis.) 6

Bayesian vs. the Classical Approach Bayesian approach restricts its prediction to the next (N+1) occurrence of an event given the observed previous (N) events. Classical approach is to predict likelihood of any given event regardless of the number of occurrences. 7 NOTE: The Bayesian approach can be updated as new data is observed.

Bayes Theorem 8 where In both cases, p(y) is a marginal distribution and can be thought of as a normalizing constant which allows us to rewrite the above as:

Example – Coin Toss I want to toss a coin n = 100 times. Let’s denote the random variable X as the outcome of one flip: p(X=head) = θ p(X=tail) =1- θ Before doing this experiment we have some belief in our mind: the prior probability. Let’s assume that this event will have a Beta distribution (a common assumption):

Example – Coin Toss If we assume a fair coin we can fix α = β = 5 which gives: (Hopefully, what you were expecting!) Beta Distribution

Example – Coin Toss Now I can run my experiment. As I go I can update my beliefs based on the observed heads (h) and tails (t) by applying Bayes Law to posterior, we have: 11

Example – Coin Toss 12 Since we’re assuming a Beta distribution this becomes: …our posterior probability. Supposing that we observed h = 45, t = 55, we would get:

Example – Coin Toss 13

Integration 14 To find the probability that X n+1 = heads, we could also integrate over all possible values of θ to find the average value of θ which yields: This might be necessary if we were working with a distribution with a less obvious Expected Value.

More than Two Outcomes In the previous example, we used a Beta distribution to encode the states of the random variable. This was possible because there were only 2 states/outcomes of the variable X. In general, if the observed variable X is discrete, having r possible states {1,…,r}, the likelihood function is given by: 15 In this general case we can use a Dirichlet distribution instead:

Vocabulary Review Prior Probability, P( θ | y ): Prior Probability of a particular value of θ given no observed data (our previous “belief”) Posterior Probability, P(θ | D): Probability of a particular value of θ given that D has been observed (our final value of θ). Observed Probability or “Likelihood”, P(D|θ): Likelihood of sequence of coin tosses D being observed given that θ is a particular value. P(D): Raw probability of D 16

Bayesian Advantages It turns out that the Bayesian technique permits us to do some very useful things from a mining perspective! 1. We can use the Chain Rule with Bayesian Probabilities: 17 Ex. This isn’t something we can easily do with classical probability! 2. As we’ve already seen using the Bayesian model permits us to update our beliefs based on new data. P(A,B,C)=P(A|B,C)+P(B|C)+P(C)

Outline Bayesian Approach Bayes Therom Bayesian vs. classical probability methods Coin toss – an example Bayesian Network Structure Inference Learning Probabilities Dealing with Unknowns Learning the Network Structure Two coin toss – an example Conclusions Exam Questions 18

Bayesian Network 19 To create a Bayesian network we will ultimately need 3 things: A set of Variables X={X 1,…, X n } A Network Structure S Conditional Probability Table (CPT) Note that when we start we may not have any of these things or a given element may be incomplete! Probabilities encoded by a Bayesian network may be Bayesian of physical or both

A Little Notation S: The network structure S h : The hypothesis corresponding to network structure S c : A complete network structure X i : A variable and corresponding node Pa i : The variable or node corresponding to the parents of node X i D: A data set 20

Let’s start with a simple case where we are given all three things: a credit fraud network designed to determine the probability of credit fraud. 21 Bayesian Network: Detecting Credit Card Fraud

Bayesian Network: Setup Correctly identify goals Identify many possible relevant observations Determine what subset of those observations is worth modeling Organize observations into variables and order 22

Set of Variables 23 Each node represents a random variable. (Let’s assume discrete for now.)

Network Structure 24 Each edge/arch represents a conditional dependence between variables.

Conditional Probability Table 25 Each rule represents the quantification of a conditional dependency.

26 Since we’ve been given the network structure we can easily see the conditional dependencies: P(A|F,A,S,G) = P(A) P(S|F,A,S,G) = P(S) P(G|F,A,S,G) = P(G|F) P(J|F,A,S,G) = P(J|F,A,S) Conditional Dependencies Need to be careful with the order!

27 Note that the absence of an edge indicates conditional independence: P(A|G) = P(A)

28 Important Note: The presence of a of cycle will render one or more of the relationships intractable!

Inference 29 Now suppose we want to calculate (infer) our confidence level in a hypothesis on the fraud variable f given some knowledge about the other variables. This can be directly calculated via: (Kind of messy…)

Inference 30 Fortunately, we can use the Chain Rule to simplify! This Simplification is especially powerful when the network is sparse which is frequently the case in real world problems. This shows how we can use a Bayesian Network to infer a probability not stored directly in the model.

Now for the Data Mining! So far we haven’t added much value to the data. So let’s take advantage of the Bayesian model’s ability to update our beliefs and learn from new data. First we’ll rewrite our joint probability distribution in a more compact form: 31

Learning Probabilities in a Bayesian Network First we need to make two assumptions: 1. There is no missing data (i.e. the data accurately describes the distribution) 1. The parameter vectors are independent (generally a good assumption, at least locally). 32

Learning Probabilities in a Bayesian Network If these assumptions hold we can express the probabilities as: 33

What if we are missing data? Distinguish if missing data is dependent on variable states or independent of state. If independent of state: 1.Monte Carlo Methods -> Gibbs Sampling 2.Gaussian Approximations 34

Dealing with Unknowns Now we know how to use our network to infer conditional relationships and how to update our network with new data. But what if we aren’t given a well defined network? We could start with missing or incomplete: 1. Set of Variables 2. Conditional Relationship Data 3. Network Structure 35

Unknown Variable Set Our goal when choosing variables is to: “Organize…into variables having mutually exclusive and collectively exhaustive states.” This is a problem shared by all data mining algorithms: What should we measure and why? There is not and probably cannot be an algorithmic solution to this problem as arriving at any solution requires intelligent and creative thought. 36

Unknown Conditional Relationships This can be easy. So long as we can generate a plausible initial belief about a conditional relationship we can simply start with our assumption and let our data refine our model via the mechanism shown in the Learning Probabilities in a Bayesian Network slide. 37

Unknown Conditional Relationships However, when our ignorance becomes serious enough that we no longer even know what is dependent on what we segue into the Unknown Structure scenario. 38

Learning the Network Structure Sometimes the conditional relationships are not obvious. In this case we are uncertain with the network structure: we don’t know where the edges should be. 39

Learning the Network Structure Theoretically, we can use a Bayesian approach to get the posterior distribution of the network structure: Unfortunately, the number of possible network structure increase exponentially with n – the number of nodes. We’re basically asking ourselves to consider every possible graph with n nodes! 40

Learning the Network Structure Two main methods for shortening the search for a network model: Model Selection To select a “good” model (i.e. the network structure) from all possible models, and use it as if it were the correct model. Selective Model Averaging To select a manageable number of good models from among all possible models and pretend that these models are exhaustive. The math behind both techniques is quite involved so I’m afraid we’ll have to content ourselves with a toy example today. 41

Two Coin Toss Example Experiment: flip two coins and observe the outcome Propose two network structures: S h 1 or S h 2 Assume P(S h 1 )=P(S h 2 )=0.5 After observing some data, which model is more accurate for this collection of data? 42 X1X1 X2X2 X1X1 X2X2 p(H)=p(T)=0.5 p(H|H)= 0.1 p(T|H)= 0.9 p(H|T)= 0.9 p(T|T)= 0.1 Sh1Sh1 Sh2Sh2 P(X 2 |X 1 )

Two Coin Toss Example X1X1 X2X2 1TT 2TH 3HT 4HT 5TH 6HT 7TH 8TH 9HT 10HT 43

Two Coin Toss Example X1X1 X2X2 1TT 2TH 3HT 4HT 5TH 6HT 7TH 8TH 9HT 10HT 44

Two Coin Toss Example X1X1 X2X2 1TT 2TH 3HT 4HT 5TH 6HT 7TH 8TH 9HT 10HT

Two Coin Toss Example 46

Two Coin Toss Example 47

Two Coin Toss Example 48

Outline Bayesian Approach Bayes Therom Bayesian vs. classical probability methods coin toss – an example Bayesian Network Structure Inference Learning Probabilities Learning the Network Structure Two coin toss – an example Conclusions Exam Questions 49

Conclusions Bayesian method Bayesian network Structure Inference Learning with Bayesian Networks Dealing with Unknowns 50

Question1: What are Bayesian Networks? A graphical model the encodes probabilistic relationship among variables of interest 51

Question 2: Compare the Bayesian and classical approaches to probability (any one point). Bayesian Approach: Classical Probability: +Reflects an expert’s knowledge +The belief is kept updating when new data item arrives - Arbitrary (More subjective) Wants P( H | D ) +Objective and unbiased - Need repeated trials Wants P( D | H ) 52

Question 3: Mention at least 1 Advantage of Bayesian Networks Handle incomplete data sets by encoding dependencies Learning about causal relationships Combine domain knowledge and data Avoid over fitting 53

The End Any Questions? 54