Overview Full Bayesian Learning MAP learning

Slides:

Advertisements

Similar presentations

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Advertisements

Mixture Models and the EM Algorithm

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.

Segmentation and Fitting Using Probabilistic Methods

A. Darwiche Learning in Bayesian Networks. A. Darwiche Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown.

Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Bayesian Learning, Regression-based learning. Overview  Bayesian Learning  Full  MAP learning  Maximum Likelihood Learning  Learning Bayesian Networks.

Visual Recognition Tutorial

EE-148 Expectation Maximization Markus Weber 5/11/99.

Bayesian network inference

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.

. Learning Bayesian networks Slides by Nir Friedman.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.

Expectation Maximization Algorithm

Maximum Likelihood (ML), Expectation Maximization (EM)

Bayesian Learning and Learning Bayesian Networks.

Visual Recognition Tutorial

Learning Bayesian Networks

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.

Semi-Supervised Learning

Crash Course on Machine Learning

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Bayesian Networks. Male brain wiring Female brain wiring.

EM and expected complete log-likelihood Mixture of Experts

Naive Bayes Classifier

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Statistical Learning (From data to distributions).

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Slides for “Data Mining” by I. H. Witten and E. Frank.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.

Lecture 2: Statistical learning primer for biologists

Flat clustering approaches

Web-Mining Agents Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 

Learning in Bayesian Networks. Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown Structure Incomplete.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Web-Mining Agents Part: Data Mining

Naive Bayes Classifier

Classification of unlabeled data:

Bayes Net Learning: Bayesian Approaches

CS 2750: Machine Learning Expectation Maximization

Data Mining Lecture 11.

Latent Variables, Mixture Models and EM

Bayesian Models in Machine Learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Bayesian Learning Chapter

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

Overview Full Bayesian Learning MAP learning Maximun Likelihood Learning Learning Bayesian Networks Fully observable (complete data) With hidden (unobservable) variables

Learning Parameters with Hidden Variables So far we have assumed that I can collect data on all variables in the network What if this is not true, i.e. the network has hidden variables? Clearly I can’t use the frequency approach, because I am missing all the counts involving H

Quick Fix Get rid of the hidden variables. It may work in the simple network given earlier, but what about the following one? Each variable has 3 values (low, moderate, high) the numbers by the nodes represent how many parameters need to be specified for the CPT of that node 78 probabilities to be specified overall

Not Necessarily a Good Fix The symptom variables are no longer conditionally independent given their parents Many more links, and many more probabilities to be specified: 708 overall Need much more data to properly learn the network

Another Example We saw how we could use a Naïve Classifier for classifying messages in terms of user reading choice based on labelled data The data included the classification can use the frequencies to learn the necessary CPTs

Another Example But I may want to find groups of users that behave similarly when reading news, without knowing the categories a priori Type of unsupervised learning known as clustering I can do that by assuming k user categories and making them the values of an hidden variable C in the Naive classifier How to computer the parameters? C

Expectation-Maximization (EM) If we keep the hidden variables, and want to learn the network parameters from data, we have a form of unsupervised learning the data do not include information on the true nature of each data point Expectation-Maximization general algorithm for learning model parameters from incomplete data We’ll see how it works on learning parameters for Bnets with discrete, continuous variables

EM: general idea If we had data for all the variables in the network, we could learn the parameters by using ML (or MAP) models e.g. frequencies of the relevant events as we saw in previous examples If we had the parameters in the network, we could estimate the posterior probability of any event, including the hidden variables e.g., P(H|A,B,C)

EM: General Idea The algorithm starts from “invented” (e.g., randomly generated) information to solve the learning problem, e.g., the network parameters It then refines this initial guess by cycling through two basic steps Expectation (E): update the data with predictions generated via the current model Maximization (M): given the updated data, infer the Maximum Likelihood (or MAP model) This is the same step that we described when learning parameters for fully observable networks It can be shown that EM increases the log likelihood of the data at any iteration, and often achieves a local maximum

EM: How if Works on Naive Bayes We start from our network with hidden variable and incomplete data ? We then make up the missing data. Suppose that we want C to have three values, [1,2,3] (categories) What would we need to learn the network parameters?

EM: How if Works on Naive Bayes We start from our network with hidden variable and incomplete data ? We then make up the missing data. Suppose that we want C to have three values, [1,2,3] (categories) What would we need to learn the network parameters? for P(C) = Count(datapoints with C=i)|Count(all dataponts) ) i=1,2,3 for P(Xj|C) = Count(datapoints with Xj = valk and C=i)|Counts(data with C=i) for all values valk of Xj and i=1,2,3

EM: Augmenting the Data We only have Count(all datapoints). We approximate the others with expected counts derived from the model Expected count is the sum, over all N examples in my dataset, of the probability that each example is in category I Where Available from the model

EM: Augmenting the Data This process is analogous to creating new tuples of data that include values for the hidden variables For each tuple in the original data (i.e. [t,f,t,t]) below, expected counts duplicate it as many times as are the values of C, and add to each new tuple one of the values of C Get an expected count for each new tuple from the model.

EM: Augmenting the Data This process is analogous to creating new tuples of data that include values for the hidden variables For each tuple in the original data (i.e. [t,f,t,t]) below, expected counts duplicate it as many times as are the values of C, and add to each new tuple one of the values of C Get an expected count for each new from the model.

E-Step And all the probabilities in the expression above are given in the Naive Bayes model The result of the computation will replace the corresponding expected count in the augmented data Repeat to update all counts in the table

M-Step Use expected counts to update model parameters via the ML Augmented data Probabilities

Repeat Augmented data Probabilities

Another Example Back to the cherry/lime candy world. Two bags of candies (1 and 2) have been mixed together Candies are described by 3 features: Flavor and Wrapper as before, plus Hole (whether they have a hole in the middle) The distribution of candies in each bag is described again by a naive Bayes model, below θ= P(Bag = 1) θFj = P(Flavor = cherry|Bag = j) θWj = P(Wrapper = red|Bag = j) θHj = P(Hole = yes|Bag = j) j =1,2

Another Example Assume that the true parameters are θ= 0.5; θF1 = θW1 = θH1 = 0.8; θF2 = θW2 = θH2 = 0.3; The following 1000 datapoints have been generated by sampling the true model We want to re-learn the true parameters using EM

Start Point This time, we start by directly assigning a guesstimate for the parameters Usually done randomly; here we select numbers convenient for computation We’ll work through one cycle of EM to compute θ(1).

E-step First, we need the expected count of candies from Bag 1, Sum of the probabilities that each of the N data points comes from bag 1 Be flavorj, wrapperj, holej the values of the corresponding attributes for the jth datapoint

E-step This summation can be broken down into the 8 candy groups in the data table. For instance the sum over the 273 cherry candies with red wrap and hole (first entry in the data table) gives

M-step If we do compute the sums over the other 7 candy groups we get At this point, we can perform the M-step to refine θ, by taking the expected frequency of the data points that come from Bag 1

One More Parameter If we want to do the same for parameter θF1 E-step: computer the expected count of cherry candies from Bag 1 Can compute the above value from the Naïve model as we did earlier M-step: refine θF1 by computing the corresponding expected frequencies

Learning Performance After a complete cycle through all the parameters, we get For any set of parameters, I can compute the log likelihood as we did in the previous class It can be shown that the log likelihood increases with each EM iteration, surpassing even the likelihood of the original model after only 3 iterations

EM: Discussion For more complex Bnets the algorithm is basically the same In general, I may need to compute the conditional probability parameter for each variable Xi given its parents Pai θijk= P(Xi = xij|Pai = paik) The expected counts are computed by suming over the examples, after having computer for each P(Xi = xij,Pai = paik) using any Bnet inference algorithm The inference can be intractable, in which case there are variations of EM that use sampling algorithms for the E-Step

EM: Discussion The algorithm is sensitive to “degenerated” local maxima due to extreme configurations e.g., data with outliers can generate categories that include only 1 outlier each because these models have the highest log likelihoods Possible solution: re-introduce priors over the learning hypothesis and use the MAP version of EM