Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Rutgers CS440, Fall 2003 Learning under uncertainty How to learn probabilistic models such as Bayesian networks, Markov models, HMMs, …? Examples: –Class confusion example: how did we come up with the CPTs? –Earthquake-burglary network structure? –How do we learn HMMs for speech recognition? –Kalman model (e.g., mass, friction) parameters? –User models encoded as Bayesian networks for HCI?

Rutgers CS440, Fall 2003 Hypotheses and Bayesian theory Problem: –Two kinds of candy, lemon and chocolate –Packed in five types of unmarked bags: (100% C, 0% L) 10% of time, (75% C, 25% L) 10% time, (50% C, 50% L) 40% of time, (25% C, 75% L) 20% of time, (0% C, 100%L) 10% of time –Task: open a bag, ( unwrap candy, observe it, …), then predict what the next one will be Formulation: –H (Hypothesis): h 1 (100,0) or h 2 (75,25) or h 3 (50,50) or h 4 (25,75) or h 5 (0,100) –d i (Data): i-th open candy: L (lemon) or C (chocolate) –Goal: predict d i+1 after seeing D = { d 0, d 1, …, d i } P( d i+1 | D )

Rutgers CS440, Fall 2003 Bayesian learning Bayesian solution Estimate probabilities of hypothesis (candy bag types), then predict data (candy type) P(h i | D ) ~ P( D | h i ) P(h i ) P(d i | D ) =  hi P( d i | h i ) P( h i | D ) P( D | h i ) = P( d 0 | h i ) x … x P(d i | h i ) Hypothesis posterior Data likelihood Hypothesis prior Prediction I.I.D. (independently, identically distributed) data points

Rutgers CS440, Fall 2003 Bayesian prediction properties True hypothesis eventually dominates Bayesian prediction is optimal (minimizes prediction error) Comes at a price: usually many hypotheses, intractable summation

Rutgers CS440, Fall 2003 Approximations to Bayesian prediction MAP – Maximum a posteriori P( d | D ) = P( d | h MAP ), h MAP = arg max hi P( h i | D ) (easier to compute) Role of prior, P(h i ): penalizes complex hypotheses ML – Maximum likelihood P( d | D ) = P( d | h ML ), h ML = arg max hi P( D | h i )

Rutgers CS440, Fall 2003 Learning from complete data Learn parameters of Bayesian models from data –e.g., learn probabilities of C & L for a bag of candy whose proportions of C&L are unknown by observing opened candy from that bag Candy problem parameters:  u – probability of C in bag u  u – probability of bag u CandyParameter C uu L 1-  u BagParameter 1 11 2 22 3 33 4 44 5 55 Bag u

Rutgers CS440, Fall 2003 ML Learning from complete data ML approach: select model parameters to maximize likelihood of seen data 1.Need to assume distribution model that determines how the samples (of candy) are distributed in a bag 2.Select parameters of the model that maximize the likelihood of the seen data likelihood model: binomial log-likelihood

Rutgers CS440, Fall 2003 Maximum likelihood learning (binomial distribution) How to find a solution to the above problem?

Rutgers CS440, Fall 2003 Maximum likelihood learning (cont’d) Take the first derivative of (log) likelihood and set it to zero Counting! 01N ……… Total: 102 011 d=Ld=CSample

Rutgers CS440, Fall 2003 Naïve Bayes model One set of causes, multiple independent sources of evidence C E1E1 E2E2 ENEN Example: C  {spam, not spam}, Ei  { token i present, token i absent } EiEi Limiting assumption, often works well in practice ……

Rutgers CS440, Fall 2003 Inference & Decision in NB model Inference Evidence scoreHypothesis (class) scorePrior score Decision Log odds ratio

Rutgers CS440, Fall 2003 Learning in NB models Example: Given a set of K email messages, each with tokens D ={ d j = (e 1j,…,e Nj ) }, e ij  {0,1}, and labels C ={c j } (SPAM or NOT_SPAM), find the best set of CPTs P(E i |C) and P(C) Assume: P(E i |C=c) is binomial with parameter  i,c, P(C) is binomial with parameter  c ML learning: maximize likelihood of K messages, each one in one of the two classes 2N+1 parameters Label of message j Token i in message j present/absent

Rutgers CS440, Fall 2003 Learning of Bayesian network parameters Naïve Bayes learning can be extended to BNs! How? Model each CPT as binomial/multinomial distribution. Maximize likelihood of data given BN. earthquakeburglary alarm callnewscast SampleEBANC 110010 210110 311001 401011

Rutgers CS440, Fall 2003 BN Learning (cont’d) Issues: 1.Priors on parameters. What if ? Should we trust it? Maybe always add some small pseudo-count  ? 2.How do we learn a BN graph (structure)? Test all possible structures, then pick the one with the highest data likelihood? 3.What if we do not observe some nodes (evidence not on all nodes)?

Rutgers CS440, Fall 2003 Learning from incomplete data Example: –In the alarm network, we received data where we only know Newscast, Call, Earthquake, Burglary, but have no idea what Alarm state is. –In SPAM model, we do not know if a message is spam or not (missing label). SampleEBANC 110N/A10 210 10 311 01 401 11 Solution? We can still try to find network parameters that maximize likelihood of incomplete data. Hidden variable

Rutgers CS440, Fall 2003 Completing the data Maximizing incomplete data likelihood is tricky. If we could, somehow, complete the data we would know how to select model parameters that maximize the completed data. How do we complete the missing data? 1.Randomly complete? 2.Estimate missing data from evidence, P( h | Evidence ). 11…104 10…113 01…012 01P( a=0 | E=1,B=0,N=1,C=0 )011.0 CNABESample 01P( a=1 | E=1,B=0,N=1,C=0 )011.1

Rutgers CS440, Fall 2003 EM Algorithm With completed data, D c, maximize completed (log)likelihood by weighting contribution from each sample with P(h|d) E(xpectation) M(aximization) algorithm: 1.Pick initial parameter estimates  0. 2.error = Inf; 3.While (error > max error) 1.E-step: Complete data, D c, based on  k-1. 2.M-step: Compute new parameters  k that maximize completed data likelihood. 3.error = L( D |  k ) - L( D |  k-1 )

Rutgers CS440, Fall 2003 EM Example Candy problem, but now we do not know which bag the candy came from (bag label missing). E-step: M-step Prior probability of bag uCandy (C) probability in bag u

Rutgers CS440, Fall 2003 EM Learning of HMM parameters HMM needs EM for parameter learning (unless we know exactly the hidden states at every time instance) –Need to learn transition and emission parameters. E.g.: –Learning of HMMs for speech modeling. 1.Assume a general (word/language) model. 2.E-step: Recognize (your own) speech using this model (Viterbi decoding). 3.M-step: Tweak parameters to recognize your speech a bit better (ML parameter fitting). 4.Go to 2.

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Similar presentations

Presentation on theme: "Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Similar presentations

Presentation on theme: "Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed."— Presentation transcript:

Similar presentations

About project

Feedback