A Tutorial on Learning with Bayesian Networks

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Basics of Statistical Estimation
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Probabilistic models Haixu Tang School of Informatics.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Author: David Heckerman Presented By: Yan Zhang Jeremy Gould –
Bayesian Networks. Introduction A problem domain is modeled by a list of variables X 1, …, X n Knowledge about the problem domain is represented by a.
Introduction of Probabilistic Reasoning and Bayesian Networks
Parameter Estimation using likelihood functions Tutorial #1
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
3-1 Introduction Experiment Random Random experiment.
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.
Learning Bayesian Networks (From David Heckerman’s tutorial)
1 Learning with Bayesian Networks Author: David Heckerman Presented by Yan Zhang April
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Author: David Heckerman Presented By: Yan Zhang Jeremy Gould – 2013 Chip Galusha
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Statistical Decision Theory
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Introduction to Bayesian Networks
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Sampling and estimation Petter Mostad
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Dynamic Programming & Hidden Markov Models. Alan Yuille Dept. Statistics UCLA.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Bayesian Belief Network AI Contents t Introduction t Bayesian Network t KDD Data.
Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.
Bayesian Estimation and Confidence Intervals Lecture XXII.
CS 2750: Machine Learning Directed Graphical Models
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
CSCI 5822 Probabilistic Models of Human and Machine Learning
CS498-EA Reasoning in AI Lecture #20
Statistical NLP: Lecture 4
CSCI 5822 Probabilistic Models of Human and Machine Learning
Parameter Learning 2 Structure Learning 1: The good
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

A Tutorial on Learning with Bayesian Networks David Heckerman

What is a Bayesian Network? “a graphical model for probabilistic relationships among a set of variables.”

Why use Bayesian Networks? Don’t need complete data set Can learn causal relationships Combines domain knowledge and data Avoids overfitting – don’t need test data

Probability 2 types Bayesian Classical

“I think the coin will land on heads 50% of the time” Bayesian Probability ‘Personal’ probability Degree of belief Property of person who assigns it Observations are fixed, imagine all possible values of parameters from which they could have come “I think the coin will land on heads 50% of the time”

Classical Probability Property of environment ‘Physical’ probability Imagine all data sets of size N that could be generated by sampling from the distribution determined by parameters. Each data set occurs with some probability and produces an estimate “The probability of getting heads on this particular coin is 50%”

Notation Variable: X State of X = x Set of variables: Y Assignment of variables (configuration): y Probability that X = x of a person with state of information ξ: Uncertain variable: Θ Parameter: θ Outcome of lth try: Xl D = {X1 = x1, ... XN = xN} observations

How to compute p(xN+1|D, ξ) from p(θ|ξ)? Example Thumbtack problem: will it land on the point (heads) or the flat bit (tails)? Flip it N times What will it do on the N+1th time? How to compute p(xN+1|D, ξ) from p(θ|ξ)?

Step 1 Use Bayes’ rule to get probability distribution for Θ given D and ξ where

Step 2 Expand p(D|θ,ξ) – likelihood function for binomial sampling Observations in D are mutually independent – probability of heads is θ and tails is 1- θ Substitute into the previous equation...

Step 3 Average over possible values of Θ to determine probability Ep (θ|D,ξ)(θ) is the expectation of θ w.r.t. the distribution p(θ|D,ξ)

Prior Distribution The prior is taken from a beta distribution: P(θ|ξ) = Beta (θ|αh, αt) αh, αt are hyperparameters to distinguish from the θ parameter – sufficient statistics Beta prior means posterior is beta too

Assessing the prior Imagined future data: Equivalent samples Assess probability in first toss of thumbtack Imagine you’ve seen outcomes of k flips Reassess probability Equivalent samples Start with Beta(0,0) prior, observe αh, αt heads and tails – posterior will be Beta(αh, αt) Beta (0,0) is state of minimum information Assess αh, αt by determining number of observations of heads and tails equivalent to our current knowledge

Can’t always use Beta prior What if you bought the thumbtack in a magic shop? It could be biased. Need a mixture of Betas – introduces hidden variable H

Distributions We’ve only been talking about binomials so far Observations could come from any physical probability distribution We can still use Bayesian methods. Same as before: Define variables for unknown parameters Assign priors to variables Use Bayes’ rule to update beliefs Average over possible values of Θ to predict things

Exponential Family For distributions in the exponential family – Calculation can be done efficiently and in closed form E.g. Binomial, multinomial, normal, Gamma, Poisson...

Exponential Family Bernardo and Smith (1994) compiled important quantities and Bayesian computations for commonly used members of the family Paper focuses on multinomial sampling

Multinomial sampling X is discrete – r possible states x1 ... xr Likelihood function: Same number of parameters as states Parameters = physical probabilities Sufficient statistics for D = {X1 = x1, ... XN = xN}: {N1, ... Nr} where Ni is the number of times X = xi in D

Multinomial Sampling Prior used is Dirichlet: P(θ|ξ) = Dir(θ|α1, ..., αr) Posterior is Dirichlet too P(θ|ξ) = Dir(θ|α1+N1, ..., αr+Nr) Can assess this same way you can Beta distribution

Bayesian Network Network structure of BN: Directed acyclic graph (DAG) Each node of the graph represents a variable Each arc asserts the dependence relationship between the pair of variables A probability table associating each node to its immediate parent nodes

Bayesian Network (cont’d) A Bayesian network for detecting credit-card fraud Direction of arcs: from parent to descendant node Parents of node Xi: Pai Pa(Jewelry) = {Fraud, Age, Sex}

Bayesian Network (cont’d) Network structure: S Set of variables: Parents of Xi : Pai Joint distribution of X: Markov condition: ND(Xi) = nondescendent nodes of Xi

Constructing BN Given set Now, for every Xi: such that Xi and X\ (chain rule of prob) Now, for every Xi: such that Xi and X\ are cond. independent given Pai

Constructing BN (cont’d) Using the ordering (F,A,S,G,J) But by using the ordering (J,G,S,A,F) we obtain a fully connected structure Use some prior assumptions of the causal relationships among variables

Inference in BN The goal is to compute any probability of interest (probabilistic inference) Inference (even approximate) in an arbitrary BN for discrete variables is NP-hard (Cooper, 1990 / Dagum and Luby, 1993) Most commonly used algorithms: Lauritzen & Spiegelhalter (1988), Jensen et al. (1990) and Dawid (1992) basic idea: transform BN to a tree – exploit mathematical Properties of that tree

Inference in BN (cont’d)

Learning in BN Learning the parameters from data Learning the structure from data Learning the parameters: known structure, data is fully observable

Learning parameters in BN Recall thumbtack problem: Step 1: Step 2: expand p(D|θ,ξ) Step 3: Average over possible values of Θ to determine probability

Learning parameters in BN (cont’d) Joint probability distribution: : Hypothesis of structure S θi : vectors of parameters for the local distribution θs : vector of {θ1 , θ2 , ..., θN } D = {X1, X2,... XN} random sample Goal is to calculate the posterior distribution:

Learning parameters in BN (cont’d) Illustration with multinomial distr. : Each X1 is discrete: values from Local distr. is a collection of multinomial distros, one for each config of Pai configurations of Pai mutually independent

Learning parameters in BN (cont’d) Parameter independence: Therefore: We can update each vector of θij independently Assume that prior distr. of θij is Thus, posterior distr. of θij is: where Nijk is the number of cases in D in which and

Learning parameters in BN (cont’d) To compute , we have to average over possible conf of θs : Using parameter independence: we obtain: where