# S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

## Presentation on theme: "S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )"— Presentation transcript:

S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

A GENDA Learning a discrete probability distribution from data Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation

M OTIVATION Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important (e.g., in basic science) For inference (e.g., forecasting) To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, …

M ACHINE L EARNING VS. S TATISTICS Machine Learning  automated statistics This lecture Bayesian learning Maximum likelihood (ML) learning Maximum a posteriori (MAP) learning Learning Bayes Nets (R&N 20.1-3) Future lectures try to do more with even less data Decision tree learning Neural nets Support vector machines …

P UTTING G ENERAL P RINCIPLES TO P RACTICE Note: this lecture will cover general principles of statistical learning on toy problems Grounded in some of the most theoretically principled approaches to learning But the techniques are far more general Practical applications to larger problems requires a bit of mathematical savvy and “elbow grease”

B EAUTIFUL RESULTS IN MATH GIVE RISE TO SIMPLE IDEAS OR H OW TO JUMP THROUGH HOOPS IN ORDER TO JUSTIFY SIMPLE IDEAS …

C ANDY E XAMPLE Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 indistinguishable bags Suppose we draw What bag are we holding? What flavor will we draw next? h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data d : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data d : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% P(h i |d) P(d|h i ) We want this… But all we have is this!

U SING B AYES ’ R ULE P(h i | d ) =  P( d |h i ) P(h i ) is the posterior (Recall, 1/  = P( d ) =  i P( d |h i ) P(h i )) P( d |h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

L IKELIHOOD AND PRIOR Likelihood is the probability of observing the data, given the hypothesis model Hypothesis prior is the probability of a hypothesis, before having observed any data

C OMPUTING THE P OSTERIOR Assume draws are independent Let P(h 1 ),…,P(h 5 ) = (0.1, 0.2, 0.4, 0.2, 0.1) d = { 10 x } P(d|h 1 ) = 0 P(d|h 2 ) = 0.25 10 P(d|h 3 ) = 0.5 10 P(d|h 4 ) = 0.75 10 P(d|h 5 ) = 1 10 P(d|h 1 )P(h 1 )=0 P(d|h 2 )P(h 2 )=9e-8 P(d|h 3 )P(h 3 )=4e-4 P(d|h 4 )P(h 4 )=0.011 P(d|h 5 )P(h 5 )=0.1 P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 Sum = 1/  = 0.1114

P OSTERIOR H YPOTHESES

P REDICTING THE N EXT D RAW P(X| d ) =  i P(X|h i, d )P(h i | d ) =  i P(X|h i )P(h i | d ) P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 H DX P(X|h 1 ) =0 P(X|h 2 ) =0.25 P(X|h 3 ) =0.5 P(X|h 4 ) =0.75 P(X|h 5 ) =1 Probability that next candy drawn is a lime P(X|d) = 0.975

P(N EXT C ANDY IS L IME | D )

P ROPERTIES OF B AYESIAN L EARNING If exactly one hypothesis is correct, then the posterior probability of the correct hypothesis will tend toward 1 as more data is observed The effect of the prior distribution decreases as more data is observed

H YPOTHESIS S PACES OFTEN I NTRACTABLE To learn a probability distribution, a hypothesis would have to be a joint probability table over state variables 2 n entries => hypothesis space is 2 n-1 -dimensional! 2^(2 n ) deterministic hypotheses 6 boolean variables => over 10 22 hypotheses And what the heck would a prior be?

L EARNING C OIN F LIPS Let the unknown fraction of cherries be  (hypothesis) Probability of drawing a cherry is  Suppose draws are independent and identically distributed (i.i.d) Observe that c out of N draws are cherries (data)

L EARNING C OIN F LIPS Let the unknown fraction of cherries be  (hypothesis) Intuition: c/N might be a good hypothesis (or it might not, depending on the draw!)

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  j P(d j |  ) =  c (1-  ) N-c i.i.d assumptionGather c cherry terms together, then N-c lime terms

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty…

M AXIMUM L IKELIHOOD P( d |  ) be the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function So the MLE is the same as maximizing log likelihood… but: Multiplications turn into additions We don’t have to deal with such tiny numbers

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = log [  c (1-  ) N-c ]

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ]

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ] = c log  + (N-c) log (1-  )

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = c log  + (N-c) log (1-  ) At a maximum of a function, its derivative is 0 So, dl/d  (  = 0 at the maximum likelihood estimate

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = c log  + (N-c) log (1-  ) At a maximum of a function, its derivative is 0 So, dl/d  (  = 0 at the maximum likelihood estimate => 0 = c/  – (N-c)/(1-  ) =>  = c/N

M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values Alarm EarthquakeBurglar E: 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003

O THER MLE RESULTS Multi-valued variables: take fraction of counts for each value Continuous Gaussian distributions: take average value as mean, standard deviation of data as standard deviation

M AXIMUM L IKELIHOOD P ROPERTIES As the number of data points approaches infinity, the MLE approaches the true estimate With little data, MLEs can vary wildly

M AXIMUM L IKELIHOOD IN CANDY BAG EXAMPLE h ML = argmax hi P( d |h i ) P(X| d )  P(X|h ML ) h ML = undefinedh5h5 P(X|h ML ) P(X|d)

B ACK TO C OIN F LIPS The MLE is easy to compute… but what about those small sample sizes? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

M AXIMUM A P OSTERIORI E STIMATION Maximum a posteriori (MAP) estimation Idea: use the hypothesis prior to get a better initial estimate than ML, without resorting to full Bayesian estimation “Most coins I’ve seen have been fair coins, so I won’t let the first few tails sway my estimate much” “Now that I’ve seen 100 tails in a row, I’m pretty sure it’s not a fair coin anymore”

M AXIMUM A P OSTERIORI P(  | d ) is the posterior probability of the hypothesis, given the data argmax  P(  | d ) is known as the maximum a posteriori (MAP) estimate Posterior of hypothesis  given data d ={d 1,…,d N } P(  | d ) = 1/  P(d|  ) P(  ) Max over  doesn’t affect  So MAP estimate is argmax  P(d|  ) P(  )

M AXIMUM A P OSTERIORI h MAP = argmax hi P(h i | d ) P(X| d )  P(X|h MAP ) h MAP = h3h3 h4h4 h5h5 P(X|h MAP ) P(X|d)

A DVANTAGES OF MAP AND MLE OVER B AYESIAN ESTIMATION Involves an optimization rather than a large summation Local search techniques For some types of distributions, there are closed- form solutions that are easily computed

B ACK TO C OIN F LIPS Need some prior distribution P(  ) P(  | d )  P( d |  )P(  ) =  c (1-  ) N-c P(  ) Define, for all , the probability that I believe in  10  P(  )

MAP ESTIMATE Could maximize  c (1-  ) N-c P(  ) using some optimization Turns out for some families of P(  ), the MAP estimate is easy to compute 10  P(  ) Beta distributions

B ETA D ISTRIBUTION Beta ,  (  ) =    -1 (1-  )  -1 ,  hyperparameters > 0  is a normalization constant  =  =1 is uniform distribution

P OSTERIOR WITH B ETA P RIOR Posterior  c (1-  ) N-c P(  ) =   c+  -1 (1-  ) N-c+  -1 MAP estimate  =(c+  -1)/(N+  +  -2) Posterior is also a beta distribution!

P OSTERIOR WITH B ETA P RIOR What does this mean? Prior specifies a “virtual count” of a=  -1 heads, b=  -1 tails See heads, increment a See tails, increment b MAP estimate is a/(a+b) Effect of prior diminishes with more data

MAP FOR BN For any BN, the MAP parameters of any CPT can be derived by the fraction of observed values in the data + the prior virtual counts Alarm EarthquakeBurglar E: 500+100 B: 200+200 N=1000+1000 P(E) = 0.5 0.3P(B) = 0.2 0.2 A|E,B: (19+100)/(20+100) A|B: (188+180)/(200+200) A|E: (170+100)/(500+400) A| : (1+3)/(380 + 300) EBP(A|E,B) TT0.95 0.99 FT0.94 0.92 TF0.34 0.3 FF0.003 0.006

E XTENSIONS OF B ETA P RIORS Multiple discrete values: Dirichlet prior Mathematical expression more complex, but in practice still takes the form of “virtual counts” Mean, standard deviation for Gaussian distributions: Gamma prior Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions

R ECAP Bayesian learning: infer the entire distribution over hypotheses Robust but expensive for large problems Maximum Likelihood (ML): optimize likelihood Good for large datasets, not robust for small ones Maximum A Posteriori (MAP): weight likelihood by a hypothesis prior during optimization A compromise solution for small datasets Like Bayesian learning, how to specify the prior?

A PPLICATION TO S TRUCTURE L EARNING So far we’ve assumed a BN structure, which assumes we have enough domain knowledge to declare strict independence relationships Problem : how to learn a sparse probabilistic model whose structure is unknown?

B AYESIAN S TRUCTURE L EARNING But what if the Bayesian network has unknown structure? Are Earthquake and Burglar independent? P(E,B) = P(E)P(B)? EarthquakeBurglar ? BE 00 11 01 10 ……

B AYESIAN S TRUCTURE L EARNING Hypothesis 1: dependent P(E,B) as a probability table has 3 free parameters Hypothesis 2: independent Individual tables P(E),P(B) have 2 total free parameters Maximum likelihood will always prefer hypothesis 1! EarthquakeBurglar BE 00 11 01 10 ……

B AYESIAN S TRUCTURE L EARNING Occam’s razor: Prefer simpler explanations over complex ones Penalize free parameters in probability tables EarthquakeBurglar BE 00 11 01 10 ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: EarthquakeBurglar BE 00 11 01 10 ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: Pick a candidate edge to add to the BN EarthquakeBurglar BE 00 11 01 10 ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: Pick a candidate edge to add to the BN Compute new ML probabilities EarthquakeBurglar BE 00 11 01 10 ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: Pick a candidate edge to add to the BN Compute new ML probabilities If new likelihood exceeds old likelihood + threshold, Add the edge Otherwise, repeat with a different edge EarthquakeBurglar BE 00 11 01 10 …… (Usually log likelihood)

O THER T OPICS IN L EARNING D ISTRIBUTIONS Learning continuous distributions using kernel density estimation / nearest neighbor smoothing Expectation Maximization for hidden variables Learning Bayes nets with hidden variables Learning HMMs from observations Learning Gaussian Mixture Models of continuous data

N EXT T IME Introduction to machine learning R&N 18.1-2

Download ppt "S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )"

Similar presentations