# Lecture 5: Learning models using EM

## Presentation on theme: "Lecture 5: Learning models using EM"— Presentation transcript:

Lecture 5: Learning models using EM
Intro to Comp Genomics Lecture 5: Learning models using EM

Mixtures of Gaussians We have experimental results of some value
We want to describe the behavior of the experimental values: Essentially one behavior? Two behaviors? More? In one dimension it may look very easy: just looking at the distribution will give us a good idea.. We can formulate the model probabilistically as a mixture of normal distributions. As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution. If the data is multi dimensional, the problem is becoming non trivial.

Inference Let’s represent the model as:
What is the inference problem in our model? Inference: computing the posterior probability of a hidden variable given the data and the model parameters. For p0=0.2, p1=0.8, m0=0, m1=1, s0=1,s1=0.2, what is Pr(s=0|0.8) ?

Estimation/parameter learning
Generic optimization techniques: Gradient ascent: Find Simulation annealing Genetic algorithms And more.. Given data, how can we estimate the model parameters? Transform it into an optimization problem! Likelihood: a function of the parameters. Defined given the data. Find parameters that maximize the likelihood: the ML problem Can be approached heuristically: using any optimization technique. But it is a non linear problem which may be very difficult

The EM algorithm for mixtures
Continue iterating until convergence. The EM theorem: the algorithm will converge and will improve likelihood monotonically But: No Guarantee of finding the optimum Or of finding anything meaningful The initial conditions are critical: Think of starting from m0=0, m1=10, s1,2=1 Solutions: start from “reasonable” solutions Try many starting points -1 1 We start by guessing parameters: We now go over the samples and compute their posteriors (i.e., inference): We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients:

Hidden Markov Models Emission space Caution! This is NOT
the HMM Bayes Net 1.Cycles 2.States are NOT random vars! Hidden Markov Models Emission space Observing only emissions of states to some probability space E Each state is equipped with an emission distribution (x a state, e emission)

Simple example: Mixture with “memory”
We sample a sequence of dependent values At each step, we decide if we continue to sample from the same distribution or switch with probability p B A We can compute the probability directly only given the hidden variables. P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?) There is an exponential number of h assignments, can we still solve the problem efficiently?

Inference in HMM Forward formula: Backward formula: Start States
Finish Backward formula: Emissions Start States Finish Emissions

Inference in HMM Forward formula: Backward formula: Start States
Finish Backward formula: Emissions Start States Finish Emissions

EM for HMMs Emissions States Finish Start The posterior probability for emitting the i’th character from state s? The posterior probability for transition from s’ to s after character i? With multiple sequence, assume independence (accumulate stats) Claim: HMM EM is monotonically improving the likelihood

The EM theorem for mixtures simplified
Assume that we know which distribution generated each sample (samples Si generated from distribution i) We want to maximize the model’s likelihood, given this extra information: “multinomial estimator” solve using Lagrange multipliers: Solve separately:

The EM theorem for mixtures simplified
Assume that we know which distribution generated each sample (samples Si generated from distribution i) We want to maximize the model’s likelihood, given this extra information: Normal distribution estimator: using observed sufficient statistics (an exponential family) Solve separately: We found the global optimum of the likelihood in the case of full data.

The EM theorem for mixtures simplified
Assume now that each sample i is known to be from distribution j with probability Pij. We can write down: Same maximization holds. In the EM algorithm we used: Solve separately: Deriving the EM formula. In this case Q is dependent on the current parameters, so we call it: What is missing? Q is not L!

Expectation-Maximization
Dempster Relative entropy>=0 EM maximization

KL-divergence Entropy (Shannon) Kullback-leibler divergence
Not a metric!! KL

Bayesian learning vs. Maximum likelihood
Maximum likelihood estimator Introducing prior beliefs on the process (Alternatively: think of virtual evidence) Computing posterior probabilities on the parameters No prior beliefs Parameter Space PME Beliefs MAP MLE Parameter Space

Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17 Cut the data into segments of 50,000 data points Modeling: Use EM to build a probabilistic model for the peak signals and the background Use heuristics for peak finding to initialize the EM Analysis: Test if your model for single peak structure is as good as the model for two peak structures. Compute the distribution of peaks relative to transcription start sites Your Task Preparations: Background on ChIP-seq CTCF and PolII Modeling ChIP-seq, binning

Preparations: Get your hand on the ChIP-seq profiles of CTCF and PolII in hg chr17, bin-size = 50bp Cut the data into segments of 50,000 data points Modeling: Use EM to build a probabilistic model for the peak signals and the background. Use heuristics for peak finding to initialize the EM Analysis: Test if your model for single peak structure is as good as the model for two peak structures. Compute the distribution of peaks relative to transcription start sites Your Task Modeling S P1 P2 B P3 F P.. The model use k-states for the peak and one state for the background Use K=40.