Statistical Learning (From data to distributions).

Statistical Learning (From data to distributions)

Reminders HW5 deadline extended to Friday

Agenda Learning a probability distribution from data Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Expectation Maximization (EM)

Motivation Agent has made observations (data) Now must make sense of it (hypotheses) –Hypotheses alone may be important (e.g., in basic science) –For inference (e.g., forecasting) –To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, …

Candy Example Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 (indistinguishable) bags Suppose we draw What bag are we holding? What flavor will we draw next? H1 C: 100% L: 0% H2 C: 75% L: 25% H3 C: 50% L: 50% H4 C: 25% L: 75% H5 C: 0% L: 100%

Machine Learning vs. Statistics Machine Learning  automated statistics This lecture –Bayesian learning, the more “traditional” statistics (R&N 20.1-3) –Learning Bayes Nets

Bayesian Learning Main idea: Consider the probability of each hypothesis, given the data Data d: Hypotheses: P(h i |d) h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

Using Bayes’ Rule P(h i |d) =  P(d|h i ) P(h i ) is the posterior –(Recall, 1/  =  i P(d|h i ) P(h i )) P(d|h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

Computing the Posterior Assume draws are independent Let P(h 1 ),…,P(h5) = (0.1,0.2,0.4,0.2,0.1) d = { 10 x } P(d|h 1 ) = 0 P(d|h 2 ) = 0.25 10 P(d|h 3 ) = 0.5 10 P(d|h 4 ) = 0.75 10 P(d|h 5 ) = 1 10 P(d|h 1 )P(h 1 )=0 P(d|h 2 )P(h 2 )=9e-8 P(d|h 3 )P(h 3 )=4e-4 P(d|h 4 )P(h 4 )=0.011 P(d|h 5 )P(h 5 )=0.1 P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 Sum = 1/  = 0.1114

Posterior Hypotheses

Predicting the Next Draw P(X|d) =  i P(X|h i,d)P(h i |d) =  i P(X|h i )P(h i |d) P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 H DX P(X|h 1 ) =0 P(X|h 2 ) =0.25 P(X|h 3 ) =0.5 P(X|h 4 ) =0.75 P(X|h 5 ) =1 Probability that next candy drawn is a lime P(X|d) = 0.975

P(Next Candy is Lime | d)

Other properties of Bayesian Estimation Any learning technique trades off between good fit and hypothesis complexity Prior can penalize complex hypotheses –Many more complex hypotheses than simple ones –Ockham’s razor

Hypothesis Spaces often Intractable A hypothesis is a joint probability table over state variables –2 n entries => hypothesis space is [0,1]^(2 n ) –2^(2 n ) deterministic hypotheses 6 boolean variables => over 10 22 hypotheses Summing over hypotheses is expensive!

Some Common Simplifications Maximum a posteriori estimation (MAP) –h MAP = argmax hi P(h i |d) –P(X|d)  P(X|h MAP ) Maximum likelihood estimation (ML) –h ML = argmax hi P(d|h i ) –P(X|d)  P(X|h ML ) Both approximate the true Bayesian predictions as the # of data grows large

Maximum a Posteriori h MAP = argmax hi P(h i |d) P(X|d)  P(X|h MAP ) h MAP = h3h3 h4h4 h5h5 P(X|h MAP ) P(X|d)

Maximum a Posteriori For large amounts of data, P(incorrect hypothesis|d) => 0 For small sample sizes, MAP predictions are “overconfident” P(X|h MAP ) P(X|d)

Maximum Likelihood h ML = h MAP with uniform prior Relevance of prior diminishes with more data Preferred by some statisticians –Are priors “cheating”? –What is a prior anyway?

Advantages of MAP and MLE over Bayesian estimation Involves an optimization rather than a large summation –Local search techniques For some types of distributions, there are closed-form solutions that are easily computed

Learning Coin Flips (Bernoulli distribution) Let the unknown fraction of cherries be  Suppose draws are independent and identically distributed (i.i.d) Observe that c out of N draws are cherries

Maximum Likelihood Likelihood of data d={d 1,…,d N } given  –P(d|  ) =  j P(d j |  ) =  c (1-  ) N-c i.i.d assumptionGather c cherries together, then N-c limes

Maximum Likelihood Same as maximizing log likelihood L(d|  )= log P(d|  ) = c log  (N-c) log(1-  ) max  L(d|  ) => dL/d  = 0 => 0 = c/  – (N-c)/(1-  ) =>  = c/N

Maximum Likelihood for BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data Alarm EarthquakeBurglar E 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003

Maximum Likelihood for Gaussian Models Observe a continuous variable x 1,…,x N Fit a Gaussian with mean , std  –Standard procedure: write log likelihood L = N(C – log  ) –  j (x j -  ) 2 /(2  2 ) –Set derivatives to zero

Observe a continuous variable x 1,…,x N Results:  = 1/N  x j (sample mean)  2 = 1/N  (x j -  ) 2 (sample variance) Maximum Likelihood for Gaussian Models

Y is a child of X Data (x j,y j ) X is gaussian, Y is a linear Gaussian function of X –Y(x) ~ N(ax+b,  ) ML estimate of a, b is given by least squares regression,  by standard errors Maximum Likelihood for Conditional Linear Gaussians X Y

Back to Coin Flips What about Bayesian or MAP learning? Motivation –I pick a coin out of my pocket –1 flip turns up heads –Whats the MLE?

Back to Coin Flips Need some prior distribution P(  ) P(  |d) = P(d|  )P(  ) =  c (1-  ) N-c P(  ) Define, for all , the probability that I believe in  10  P(  )

MAP estimate Could maximize  c (1-  ) N-c P(  ) using some optimization Turns out for some families of P(  ), the MAP estimate is easy to compute 10  P(  ) Beta distributions (Conjugate prior)

Beta Distribution Beta a,b (  ) =   a-1 (1-  ) b-1 –a, b hyperparameters –  is a normalization constant –Mean at a/(a+b)

Posterior with Beta Prior Posterior  c (1-  ) N-c P(  ) =   c+a-1 (1-  ) N-c+b-1 MAP estimate  =(c+a)/(N+a+b) Posterior is also a beta distribution! –See heads, increment a –See tails, increment b –Prior specifies a “virtual count” of a heads, b tails

Does this work in general? Only specific distributions have the right type of prior –Bernoulli, Poisson, geometric, Gaussian, exponential, … Otherwise, MAP needs a (often expensive) numerical optimization

How to deal with missing observations? Very difficult statistical problem in general E.g., surveys –Did the person not fill out political affiliation randomly? –Or do independents do this more often than someone with a strong affiliation? Better if a variable is completely hidden

Expectation Maximization for Gaussian Mixture models Data have labels to which Gaussian they belong to, but label is a hidden variable Clustering: N gaussian distributions E step: compute probability a datapoint belongs to each gaussian M step: compute ML estimates of each gaussian, weighted by the probability that each sample belongs to it

Learning HMMs Want to find transition and observation probabilities Data: many sequences {O 1:t (j) for 1  j  N} Problem: we don’t observe the X’s! X0X0 X1X1 X2X2 X3X3 O1O1 O2O2 O3O3

Learning HMMs X0X0 X1X1 X2X2 X3X3 O1O1 O2O2 O3O3 Assume stationary markov chain, discrete states x 1,…,x m Transition parameters  ij = P(X t+1 =x j |X t =x i ) Observation parameters  i = P(O|X t =x i )

Assume stationary markov chain, discrete states x 1,…,x m Transition parameters  ij = P(X t+1 =x j |X t =x i ) Observation parameters  i = P(O|X t =x i ) Initial states i = P(X 0 =xi) Learning HMMs x1x1 x3x3 x2x2 O  13,  31 33 22

Expectation Maximization Initialize parameters randomly E-step: infer expected probabilities of hidden variables over time, given current parameters M-step: maximize likelihood of data over parameters x1x1 x3x3 x2x2 O  13,  31 33 22            32  33,  1,  2,  3 ) P(initial state)P(transition ij)P(emission)

Expectation Maximization x1x1 x3x3 x2x2 O  13,  31 33 22 Initialize   E: Compute E[P(Z=z|  (0),O)] x1x2x3x2 x1 x2 x1x3x2 Z: all combinations of hidden sequences Result: probability distribution over hidden state at time t M: compute  (1) = ML estimate of transition / obs. distributions            32  33,  1,  2,  3 )

Expectation Maximization x1x1 x3x3 x2x2 O  13,  31 33 22            32  33,  1,  2,  3 ) Initialize   E: Compute E[P(Z=z|  (0),O)] x1x2x3x2 x1 x2 x1x3x2 Z: all combinations of hidden sequences Result: probability distribution over hidden state at time t M: compute  (1) = ML estimate of transition / obs. distributions This is the hard part…

E-Step on HMMs Computing expectations can be done by: –Sampling –Using the forward/backward algorithm on the unrolled HMM (R&N pp. 546) The latter gives the classic Baum-Welch algorithm Note that EM can still get stuck in local optima or even saddle points

Next Time Machine learning

Statistical Learning (From data to distributions).

Similar presentations

Presentation on theme: "Statistical Learning (From data to distributions)."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Learning (From data to distributions).

Similar presentations

Presentation on theme: "Statistical Learning (From data to distributions)."— Presentation transcript:

Similar presentations

About project

Feedback