Generalizations of Markov model to characterize biological sequences

Generalizations of Markov model to characterize biological sequences
Authors: Junwen Wang and Sridhar Hannenhalli CISC841: Bioinformatics Presented By: Nikhil Shirude November 20, 2007

Outline Motivation Model Implementation - Training - Testing Results
Challenges Conclusion

Motivation Markov Model – statistical technique to model sequences such that the probability of a sequence element is based on a limited context preceding the element Current kth order Markov Model – Generates a single base (model unit size=1) according to a probability distribution depending on ‘k’ bases immediately preceding the generated base (gap=0) Used in DNA sequence recognition problems such as promoter and gene prediction

Motivation cont’d Longer range dependencies and joint dependency of neighboring bases have been observed in protein and DNA sequences CG di-nucleotide characterizes CpG islands So, model with unit size of 2 is appropriate to characterize this joint dependency Longer range dependencies (gap>0) are useful to model the periodicity of the helix pattern

Model Implementation Generalized Markov Model (GMM)  Configurable tool to allow for generalizations Posterior bases - bases whose probability is to be computed Prior bases - bases upon which the above probability is calculated 6 parameters to specify the Markov Model Other parameters include – type of biological sequence, threshold for min. count of prior for k-mer elimination, pseudo count for k-mer absent in training set

Model Implementation cont’d
Prior Posterior L1 L1 L1 g1 G U1 U2 Uo X1 X2 XL2 ... ... g2 Parameters: L1  model unit size in prior O  Order or the number of units g1  spacing between units L2  model unit size in posterior g2  spacing between bases G  gap parameter Prior Posterior

Model Implementation cont’d
Examples A gap of length 2 within a posterior model in an amino acid captures the joint dependency for the first and fourth amino acid residue It is likely to form a hydrogen bond which is vital for the protein helix structure For a model where each tri-nucleotide depends on the previous 4 bases, configurable parameters can be set as: L1=4, O=1, L2=3, g1=g2=G=0 To use the 4 bases after ignoring the immediate preceding 3 bases, set G=3

Training K-mer Refers to specific nucleic or amino acid sequences that can be used to identify certain regions within bio-molecules like DNA or protein For statistical robustness consider k-mers above a certain threshold in positive sequences For the current model, default frequency threshold for positive sequences set at 300 For nucleosome sequences, the default frequency threshold is set at 50 due to the smaller size of the data set

Training cont’d Slide window one base at a time along the training sequence. Window size is given by user-defined parameters: For each window, extract the words corresponding to prior and posterior Window Size = L1*O + g1*(O-1) + G + L2 + g2 * (L2-1) User Defined parameters: L1=1, O=6, L2=2, g1=0, G=1, g2=1 Window Size = 10 Say the sequence to be  ACTGATGCAG The di-nucleotide CG represents the posterior

Training (cont’d) Increment the k-mer counts
 ACTGATCG (6th order), CTGATCG (5th order),…., and so on till CG (0th order) Thus, 7 sub-models are present, one for each order After processing the training sequences, calculate the transition probabilities from the k-mer counts - for 0th order, probability is composition of the L2-mers - for higher order, compute the sum of frequencies of all the k-mers of that form (eg, for 4th order TGATCG, compute the sum of frequencies of all hexamers of the form TGAT**)

Training cont’d - if (sum > threshold)
- calculate prob. by dividing the count of that sequence form by the sum - else the program automatically uses the (k-1)-mer Finally, convert the probability for each k-mer into a log-odds score

Testing Program reads the model  k-mer log-odds score
Scoring - proceeds in the sliding window fashion - to score a window consider the highest order - if string exists, then use the score - else look for string corresponding to a lower order Sequence score is obtained by adding all the window scores To score ACTGATGCAG, first look at 6th order dependence i.e., ACTGATCG in the 8-mer table Look for 5th order and so on till the 0th order

Results Tested on: - Human Promoter Sequences - CpG poor promoters
- All promoters - Human Exon Dataset - Nucleosome positioning sequences

Model Evaluation 10 fold cross-validations to train & test the models
Sequences were partitioned into 10 equal parts Each part was tested after training on the 9 other parts Once models were trained, a score was calculated on the training set using the models A cutoff was obtained based on the Specificity-Sensitivity curve Choose a score cutoff that results in the best Correlation Coefficient for the training set

Model Evaluation cont’d
Score the independent test set & apply this cutoff to obtain the CC values Calculate the mean and standard deviation over the 10 CC values Sensitivity (Sn) = TP / (TP + FN) Specificity (Sp) = TP / (TP + FP) CC = (TP*TN – FP*FN)/√(TP+FP)*(TP+FN) *(TN+FP)*(TN+FN)

Total number of prior bases = 6 for all 3 models Classification accuracy for the three sequence classes was tested using the above 3 configurations 6th order single nucleotide model: L1 = L2 = 1, O=6, g1=0, G=0, g2=0 3rd order di-nucleotide model: L1 = L2 = 2, O=3, g1=0, G=0, g2=0 2nd order tri-nucleotide model: L1 = L2 = 3, O=2, g1=0, G=0, g2=0

Classification of CpG poor promoters Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide CpG-poor Promoters(1,466) 0.24 ± 0.05 0.28 ± 0.03 0.34 ± 0.04

Classification of all promoters Classification of Exons Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide All Promoters (12,333) 0.54 ± 0.02 0.54 ± 0.03 0.56 ± 0.02 Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide All Exons (219,624) 0.63 ± 0.00 0.64 ± 0.00 0.67 ± 0.00

Classification of nucleosome positioning sequences (112) Best classification accuracy at G = 4, 15 & 25 Worst classification accuracy at G = 7 & 18

Compare Run-time for the three models Training time for single nucleotide model was 55.8 minutes Training time reduced to 23.8 minutes for di-nucleotide model Training time reduced to 18.9 minutes for tri-nucleotide model Time for testing reduces from 22.9 minutes to 15.4 and 14.0 minutes for di-nucleotide and tri-nucleotide models respectively

Conclusion Configurable tool to explore the generalizations of Markov models incorporating the joint and long range dependencies of sequence elements Evaluation done to 4 classes of sequences Compared two special cases i.e., the di-nucleotide model and the tri-nucleotide model vs. the traditional single nucleotide model Evaluation shows improved classification accuracy for di and tri nucleotide models Improved running time of software for di and tri nucleotide models

Thank You!!!!

Generalizations of Markov model to characterize biological sequences

Similar presentations

Presentation on theme: "Generalizations of Markov model to characterize biological sequences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalizations of Markov model to characterize biological sequences

Similar presentations

Presentation on theme: "Generalizations of Markov model to characterize biological sequences"— Presentation transcript:

Similar presentations

About project

Feedback