Oliver Schulte Machine Learning 726

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Basics of Statistical Estimation
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Learning: Parameter Estimation
Oliver Schulte Machine Learning 726
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Computer vision: models, learning and inference
Thanks to Nir Friedman, HU
Learning Bayesian Networks (From David Heckerman’s tutorial)
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
Statistical Decision Theory
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian statistics Probabilities for everything.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
MATH 643 Bayesian Statistics. 2 Discrete Case n There are 3 suspects in a murder case –Based on available information, the police think the following.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Oliver Schulte Machine Learning 726 Decision Tree Classifiers.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Estimation and Confidence Intervals Lecture XXII.
CMPT 310 Simon Fraser University Oliver Schulte Learning.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Bayesian Estimation and Confidence Intervals
Oliver Schulte Machine Learning 726
Naïve Bayes Classifier
Classification Algorithms
Data Science Algorithms: The Basic Methods
Decision Trees: Another Example
Model Inference and Averaging
CS 2750: Machine Learning Density Estimation
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Hidden Markov Models Part 2: Algorithms
More about Posterior Distributions
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CSCI 5822 Probabilistic Models of Human and Machine Learning
Generative Models and Naïve Bayes
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Generative Models and Naïve Bayes
Mathematical Foundations of BME Reza Shadmehr
Applied Statistics and Probability for Engineers
Naïve Bayes Classifier
Presentation transcript:

Oliver Schulte Machine Learning 726 Bayes Net Learning Oliver Schulte Machine Learning 726 If you use “insert slide number” under “Footer”, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.

Learning Bayes Nets

Structure Learning Example: Sleep Disorder Network generally we don’t get into structure learning in this course. Source: Development of Bayesian Network models for obstructive sleep apnea syndrome assessment Fouron, Anne Gisèle. (2006) . M.Sc. Thesis, SFU.

Parameter Learning Scenarios Complete data (today). Later: Missing data (EM). Parent Node/ Child Node Discrete Continuous Maximum Likelihood Decision Trees logit distribution (logistic regression) conditional Gaussian (not discussed) linear Gaussian (linear regression)

The Parameter Learning Problem Input: a data table XNxD. One column per node (random variable) One row per instance. How to fill in Bayes net parameters? Day Outlook Temperature Humidity Wind PlayTennis 1 sunny hot high weak no 2 strong 3 overcast yes 4 rain mild 5 cool normal 6 7 8 9 10 11 12 13 14 Humidity What is N? What is D? PlayTennis: Do you play tennis Saturday morning? For now complete data, incomplete data another day (EM). PlayTennis

Start Small: Single Node What would you choose? Day Humidity 1 high 2 3 4 5 normal 6 7 8 9 10 11 12 13 14 Humidity P(Humidity = high) θ How about P(Humidity = high) = 50%?

Parameters for Two Nodes Day Humidity PlayTennis 1 high no 2 3 yes 4 5 normal 6 7 8 9 10 11 12 13 14 P(Humidity = high) θ Humidity H P(PlayTennis = yes|H) high θ1 normal θ2 PlayTennis Is θ as in single node model? How about θ1=3/7? How about θ2=6/7?

Maximum Likelihood Estimation

MLE An important general principle: Choose parameter values that maximize the likelihood of the data. Intuition: Explain the data as well as possible. Recall from Bayes’ theorem that the likelihood is P(data|parameters) = P(D|θ). calligraphic font D in book.

Finding the Maximum Likelihood Solution: Single Node Day Humidity P(Hi|θ) 1 high θ 2 3 4 5 normal 1-θ 6 7 8 9 10 11 12 13 14 Humidity P(Humidity = high) θ independent identically distributed data! iid binomial MLE Write down In example, P(D|θ)= θ7(1-θ)7. Maximize θfor this function.

Solving the Equation

Finding the Maximum Likelihood Solution: Two Nodes In a Bayes net, can maximize each parameter separately. Fix a parent condition  single node problem.

Finding the Maximum Likelihood Solution: Single Node, >2 possible values. Lagrange Multipliers

Problems With MLE The 0/0 problem: what if there are no data for a given parent-child configuration? Single point estimate: does not quantify uncertainty. Is 6/10 the same as 6000/10000? [show Bayes net with playtennis as child, three parents. Discuss first, do they see the problems? Curse of Dimensionality. Discussion: how to solve this problem?

Classical Statistics and MLE To quantify uncertainty, specify confidence interval. For the 0/0 problem, use data smoothing.

Bayesian Parameter Learning

Parameter Probabilities Intuition: Quantity uncertainty about parameter values by assigning a prior probability to parameter values. Not based on data. [give Russell and Norvig example]

Bayesian Prediction/Inference What probability does the Bayesian assign to PlayTennis = true? I.e., how should we bet on PlayTennis = true? Answer: Make a prediction for each parameter value. Average the predictions using the prior as weights. [Russell and Norvig Example]

Mean Bayesian prediction can be seen as the expected value of a probability distribution P. Aka average or mean of P. Notation: E(P), mu. Give example of grades.

Variance Define Variance of a parameter estimate = uncertainty. Decreases with learning.

Continuous priors Probabilities usually range over a continuous interval. Then probabilities of probabilities are probabilities of continuous variables. Probability of continuous variables = probability density function. p(x) behaves like probability of discrete value, but with integrals replacing sum. E.g. [integral over 01 = 1]. Exercise: Find the p.d.f. of the uniform distribution over an interval [a,b].

Bayesian Prediction With P.D.F.s

Bayesian Learning

Bayesian Updating Update prior using Bayes’ theorem. Exercise: Find the posterior of the uniform distribution given 10 heads, 20 tails. Answer: theta^h x (1-theta)t/2^{-n}. Notice that the posterior has a different from than the prior.

The Laplace Correction Start with uniform prior: the probability of Playtennis could be any value in [0,1], with equal prior probability. Suppose I have observed n data points. Find posterior distribution. Predict probability of heads using posterior distribution. Integral: Solved by Laplace in A.D. x!

Parametrized Priors Motivation: Suppose I don’t want a uniform prior. Smooth with m>0. Express prior knowledge. Use parameters for the prior distribution. Called hyperparameters. Chosen so that updating the prior is easy.

Beta Distribution: Definition

Beta Distribution: Examples

Updating the Beta Distribution

Conjugate Prior for non-binary variables Dirichlet distribution: generalizes Beta distribution for variables with >2 values.

Summary Maximum likelihood: general parameter estimation method. Choose parameters that make the data as likely as possible. For Bayes net parameters: MLE = match sample frequency. Typical result! Problems: not defined for 0/0 situation. doesn’t quantity uncertainty in estimate. Bayesian approach: Assume prior probability for parameters; prior has hyperparameters. E.g., beta distribution. prior choice not based on data. inferences (averaging) can be hard to compute.