Download presentation

Presentation is loading. Please wait.

1
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph SS’ O’ GG G0G0 Unrolled network S0S0 O0O0 S0S0 S1S1 O1O1 S2S2 O2O2 S3S3 O3O3

2
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph S1S1 S2S2 S3S3 S4S4 s1s1 s2s2 s3s3 s4s4 s1s1 0.20.800 s2s2 0010 s3s3 0.4000.6 s4s4 00.50 P(S’|S) State transition representation

3
Joint Probability Distribution Unrolled network S0S0 S1S1 O1O1 S2S2 O2O2 S3S3 O3O3

4
Exact Inference Variable Elimination Inference in a simple chain Computing P(X 2 ) X1X1 X2X2 All the numbers for this computation are in the CPDs of the original Bayesian network O(|X 1 ||X 2 |) operations X3X3

5
Exact Inference Variable Elimination Inference in a simple chain Computing P(X 2 ) Computing P(X 3 ) X1X1 X2X2 X3X3 P(X 3 |X 2 ) is a given CPD P(X 2 ) was computed above O(|X 1 ||X 2 |+|X 2 ||X 3 |) operations

6
Exact Inference Variable Elimination Inference in a general chain Computing P(X n ) Compute each P(X i+1 ) from P(X i ) k 2 operations for each computation (assuming |X i |=k) O(nk 2 ) operations for the inference Compare to k n operations required in summing over all possible entries in the joint distribution over X 1,...X n Inference in a general chain can be done in linear time! X1X1 X2X2 X3X3 XnXn...

7
Exact Inference Variable Elimination X1X1 X2X2 X3X3 X4X4 Pushing summations = Dynamic programming

8
Inference Unrolled network S0S0 S1S1 O1O1 S2S2 O2O2 S3S3 O3O3 Computing P(S i )

9
Inference Computing P(S i )

10
Inference: Forward-Backward Algorithm Computing P(S i |O 1,...,O n ) ForwardBackward Normalization factor

11
Computing the Forward Step Define Initialization: Induction step:

12
Computing the Backward Step Define Initialization: Induction step:

13
Computing Evidence Probability Since Then: Since Then:

14
Assignment 3 Part 1: Constructing and evaluating a nucleosome probability model Model 1: zero order Markov model Model 2: first order Markov model Both models have two components: P N : Position-dependent distribution over nucleotides P L : Position-independent distribution over nucleotides P=P N /P L

15
Assignment 3 P N : Markov order 0: Markov order 1: Estimating P N Create an alignment from all nucleosome reads and the reverse complement of each read Estimate P N,i from counts in the data Example for Markov order 1: where #(S k =i|S k-1 =j) is the number of times that the nucleotide at position k in the alignment is i, AND the nucleotide at position k-1 in the alignment is j

16
Assignment 3 P L : Markov order 0: Markov order 1: Estimating P L For Markov order 0: compute the average number of reads that cover each of the possible 4 basepairs in the genome For Markov order 1: compute the average number of reads that cover each of the possible 16 dinucleotides in the genome Estimate P L from counts in the data Example for Markov order 1: where A(S k =i|S k-1 =j) is the average coverage of the dinucleotide i,j, computed as explained above

17
Assignment 3 Evaluating the model Construct the model in a cross validation scheme, i.e., create it only using the data of chromosomes 1-8 Test the model (order 0 & 1) on the held-out chromosomes Compute the log-likelihood of all held-out nucleosome reads (work in log-space!) Compare to the log-likelihood of a random selection of sequences from the genome Compare to the log-likelihood of permutations of the sequences

18
Assignment 3 Evaluating the model (cont.) Test the model (order 0 & 1) on the held-out chromosomes Create an ROC evaluation Select a threshold t, equal to the average number of reads per basepair in the genome Define ‘positive’ regions as maximal contiguous regions in which every basepair is above t. Remove regions whose size is <50bp Define ‘negative’ regions as maximal contiguous regions in which every basepair is below t. Remove regions whose size is <50bp Use the model to score each region, as the average score of the basepairs it contains, where the score of each basepair is the average score of all 147 scores that cover that basepair Create an ROC score using these positive and negative regions. This is done by ranking all regions according to the model scores (above), and plotting, at each rank, the false positive rate (x-axis) vs. true positive rate (y-axis) Compute the AUC (area under the curve)

19
Assignment 3 Use the model in an HMM framework and compute the average nucleosome occupancy at each basepair Easiest to view as a generalized HMM with two states S i =0: no nucleosome starts at position i S i =1: nucleosome starts at position i Notes Emission probability given S=1 is taken from nucleosome model Emission probability given S=0 is uniform over all basepairs Placing a nucleosome ‘emits’ 147 basepairs Implement a uniform non-normalized transition probability between the two states, i.e., W(S=0)=1, W(S=1)=1 Compute P(S i =0|O) and P(S i =1|O) for every basepair Compute the average occupancy at each basepair as

20
Assignment 3 Evaluating the HMM model Generate a plot of average occupancy of the real data and the model predictions at a 2000bp region of your choice Perform the same ROC analysis as with the previous model, except that scores of the positive and negative regions are now computed as the average nucleosome occupancy of those regions according to your genome-wide computation

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google