Oliver Schulte Machine Learning 726

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Probabilistic models Haixu Tang School of Informatics.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Learning: Parameter Estimation

Oliver Schulte Machine Learning 726

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Parameter Estimation using likelihood functions Tutorial #1

Bayesian Learning No reading assignment for this topic

Visual Recognition Tutorial

Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Machine Learning CMPT 726 Simon Fraser University

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.

Naïve Bayesian Classifiers Before getting to Naïve Bayesian Classifiers let’s first go over some basic probability theory p(C k |A) is known as a conditional.

Thanks to Nir Friedman, HU

Learning Bayesian Networks (From David Heckerman’s tutorial)

Crash Course on Machine Learning

Recitation 1 Probability Review

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

Statistical Learning (From data to distributions).

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Oliver Schulte Machine Learning 726 Bayes Net Classifiers The Naïve Bayes Model.

Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.

Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)

Maximum Likelihood Estimation

Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

CMPT 310 Simon Fraser University Oliver Schulte Learning.

Data Modeling Patrice Koehl Department of Biological Sciences

Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.

Oliver Schulte Machine Learning 726

Oliver Schulte Machine Learning 726

Naïve Bayes Classifier

Classification Algorithms

Data Science Algorithms: The Basic Methods

Decision Trees: Another Example

CS 2750: Machine Learning Density Estimation

CS 2750: Machine Learning Probability Review Density Estimation

Bayes Net Learning: Bayesian Approaches

Maximum Likelihood Estimation

Latent Variables, Mixture Models and EM

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry

More about Posterior Distributions

Important Distinctions in Learning BNs

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CSCI 5822 Probabilistic Models of Human and Machine Learning

Generative Models and Naïve Bayes

Important Distinctions in Learning BNs

Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis

Parameter Learning 2 Structure Learning 1: The good

Parametric Methods Berlin Chen, 2005 References:

Generative Models and Naïve Bayes

Recap: Naïve Bayes classifier

The Bias-Variance Trade-Off

Machine Learning: Decision Tree Learning

Naïve Bayes Classifier

Presentation transcript:

Oliver Schulte Machine Learning 726 Bayes Net Learning Oliver Schulte Machine Learning 726 If you use “insert slide number” under “Footer”, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.

Learning Bayes Nets

Structure Learning Example: Sleep Disorder Network generally we don’t get into structure learning in this course. Source: Development of Bayesian Network models for obstructive sleep apnea syndrome assessment Fouron, Anne Gisèle. (2006) . M.Sc. Thesis, SFU.

Parameter Learning Scenarios Complete data (today). Later: Missing data (EM). Parent Node/ Child Node Discrete Continuous Maximum Likelihood Decision Trees logit distribution (logistic regression) conditional Gaussian (not discussed) linear Gaussian (linear regression)

The Parameter Learning Problem Input: a data table XNxD. One column per node (random variable) One row per instance. How to fill in Bayes net parameters? Day Outlook Temperature Humidity Wind PlayTennis 1 sunny hot high weak no 2 strong 3 overcast yes 4 rain mild 5 cool normal 6 7 8 9 10 11 12 13 14 Humidity What is N? What is D? PlayTennis: Do you play tennis Saturday morning? For now complete data, incomplete data another day (EM). PlayTennis

Start Small: Single Node What would you choose? Day Humidity 1 high 2 3 4 5 normal 6 7 8 9 10 11 12 13 14 Humidity P(Humidity = high) θ How about P(Humidity = high) = 50%?

Parameters for Two Nodes Day Humidity PlayTennis 1 high no 2 3 yes 4 5 normal 6 7 8 9 10 11 12 13 14 P(Humidity = high) θ H P(PlayTennis = yes|H) high θ1 normal θ2 Is θ as in single node model? How about θ1=3/7? How about θ2=6/7? Humidity PlayTennis

Maximum Likelihood Estimation

MLE An important general principle: Choose parameter values that maximize the likelihood of the data. Intuition: Explain the data as well as possible. Recall from Bayes’ theorem that the likelihood is P(data|parameters) = P(D|θ). calligraphic font D in book.

Finding the Maximum Likelihood Solution: Single Node Humidity P(Hi|θ) high θ normal 1-θ Humidity P(Humidity = high) θ Write down In example, P(D|θ)= θ7(1-θ)7. Maximize θ for this function. independent identically distributed data! iid binomial MLE

Solving the Equation Often convenient to apply logarithms to products. ln(P(D|θ))= 7ln(θ) + 7 ln(1-θ). Find derivative, set to 0. Make notes.

Finding the Maximum Likelihood Solution: Two Nodes Humidity PlayTennis P(H,P|θ, θ1, θ2 high no θx (1-θ1) yes θx θ1 normal (1-θ) x θ2 (1-θ) x (1-θ2) (1-θ)x θ2 P(Humidity = high) θ H P(PlayTennis = yes|H) high θ1 normal θ2 PlayTennis Humidity

Finding the Maximum Likelihood Solution: Two Nodes In example, P(D|θ, θ1, θ2)= θ7(1-θ)7 (θ1)3(1-θ1)4 (θ2)6 (1-θ2). Take logs and set to 0. Humidity PlayTennis P(H,P|θ, θ1, θ2 high no θx (1-θ1) yes θx θ1 normal (1-θ) x θ2 (1-θ) x (1-θ2) (1-θ)x θ2 In a Bayes net, can maximize each parameter separately. Fix a parent condition  single node problem.

Finding the Maximum Likelihood Solution: Single Node, >2 possible values. Day Outlook 1 sunny 2 3 overcast 4 rain 5 6 7 8 9 10 11 12 13 14 Outlook P(Outlook) sunny θ1 overcast θ2 rain θ3 Outlook In example, P(D|θ1, θ2, θ3)= (θ1)5 (θ2)4 (θ3)5. Take logs and set to 0. Replace θ3 by 1- θ1- θ2. Or use Lagrange multipliers.

Smoothing

Motivation MLE goes to extreme values on small unbalanced samples. E.g., observe 5 heads 100% heads. The 0 count problem: there may not be any data in part of the space. E.g., there are no data for Outlook = overcast, PlayTennis = no. Day Outlook Temperature Humidity Wind PlayTennis 1 sunny hot high weak no 2 strong 3 overcast yes 4 rain mild 5 cool normal 6 7 8 9 10 11 12 13 14 PlayTennis Outlook Discuss first, do they see the problems? Curse of Dimensionality. Discussion: how to solve this problem? Humidity

Smoothing Frequency Estimates h heads, t tails, n = h+t. Prior probability estimate p. Equivalent Sample Size m. m-estimate = Interpretation: we started with a “virtual” sample of m tosses with mp heads. p = ½,m=2  Laplace correction =

Exercise Apply the Laplace correction to estimate P(outlook = overcast| PlayTennis = no) P(outlook = sunny| PlayTennis = no) P(outlook = rain| PlayTennis = no) Outlook PlayTennis sunny no overcast yes rain

Bayesian Parameter Learning Short Version

Uncertainty in Estimates A single point estimate does not quantify uncertainty. Is 6/10 the same as 6000/10000? Classical statistics: specify confidence interval for estimate. Bayesian approach: Assign a probability to parameter values.

Parameter Probabilities Intuition: Quantify uncertainty about parameter values by assigning a prior probability to parameter values. Not based on data. Example: Hypothesis Chance of Heads Prior probability of Hypothesis 1 100% 10% 2 75% 20% 3 50% 40% 4 25% 5 0% Yes, these are probabilities of probabilities.

Maximum Posterior Inference Recall that the maximum posterior estimate for dataset D satisfies argmaxθ (P(θ|D))=argmaxθ (P(D|θ)xP(θ) The prior can be used to smooth estimates e.g. for observing 1 head

Example: Uniform Prior Suppose we start with a uniform distribution for the chance p that X=1 for a binary variable (think coin flips). What is the maximum posterior estimates? Answer: the same as using the Laplace correction. Solved by Laplace in 1814!

Summary Maximum likelihood: general parameter estimation method. Choose parameters that make the data as likely as possible. For Bayes net parameters: MLE = match sample frequency. Typical result! Problems: not defined for 0 count situation. doesn’t quantity uncertainty in estimate. Bayesian approach: Assume prior probability for parameters; prior has hyperparameters. E.g., beta distribution. prior choice not based on data. inferences (averaging) can be hard to compute. should add discussion of Gaussian without parents. Other cases are covered later.