Hidden Markov Models for Sequence Analysis 4

Slides:



Advertisements
Similar presentations
A retrospective look at our models First we saw the finite state automaton The rigid non-stochastic nature of these structures ultimately limited their.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Learning HMM parameters
Hidden Markov Model.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Lecture 5: Learning models using EM
Hidden Markov Models: an Introduction by Rachel Karchin.
Center for Genes, Environment, and Health Biological Sequences and Hidden Markov Models CBPS7711 Sept 9, 2010 Sonia Leach, PhD Assistant Professor Center.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Introduction to Profile Hidden Markov Models
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
HMM for multiple sequences
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Comp. Genomics Recitation 3 The statistics of database searching.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Introduction to Profile HMMs
Hidden Markov Models BMI/CS 576
Free for Academic Use. Jianlin Cheng.
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Hidden Markov Models (HMMs)
CSE 5290: Algorithms for Bioinformatics Fall 2009
Alignment IV BLOSUM Matrices
Presentation transcript:

Hidden Markov Models for Sequence Analysis 4 BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011

Choice of model topology The structure (topology) and parameters together determine a HMM. The parameters of a HHM can be determined by the Baum-Welch algorithm and other optimization methods, The design of the topology of a HMM is based on the understanding of the problem and the data available to solve the problem.

Profile HMM for sequence families Profile HMMs are a special type of HMMs used to mode multiple alignments of protein families. Matches Indels Once a profile HMM is constructed for a protein family, it can be used to evaluate if a new sequence belongs to the family or not (the scoring problem). The most probable path of a sequence generated by the model can be used to align the sequence to the members of the family (the decoding problem).

Profile HMM for sequence families Given a block of ungapped multiple alignment of a protein family, we can use the following HMM to model the block. e1(bi) ej(bi) eL(bi) Begin M1 … Mj … Mn End Here, Mj corresponds to the ungapped column j in the alignment, and is called a match state. Mj emits amino acid bi with probability ej(bi). The transition probability between two adjacent match states Mj-1 and Mj is 1, i.e., a(j-1) j = 1, because Mj cannot transit to any other states or back to itself.

Profile HMM for sequence families Since we know the path of a sequence generated by the model, the probability that a sequence x is generated by the model is, To make this probability more meaningful, we can compare it with a background probability. The probability that the sequence x is generated randomly (by random model R) is, The log odds ratio is, which essentially is a position specific scoring/weigh matrix (PSSM). Therefore, this HMM is equivalent to a PSSM, and we score the sequence x with the PSSM of the block, which is more sensitive than using a general-purposed scoring matrices such as PAM and BLOSSUM in a pair-wise alignment.

Profile HMM for sequence families To model insertions after the match state Mj, we introduce in the model an insertion state Ij. Ij Begin M1 … Mj Mj+1 … Mn End In this case, Mj can transit to the next match state Mj+1 or Ij, and Ij can move to Mj+1 or remain in Ij. Ij emits an amino acid b with probability, which is usually set to the background frequency of the amino acid b, qb. The log odds-ratio for generating an insertion sequence of length k is, This is equivalent to an affine penalty function, but it is position dependent, therefore is more accurate.

Profile HMM for sequence families To model deletions at some match states, we use a deletion state Dj at each position j. k deletions Begin M1 Mj End Mj+k-1 Mj+1 Mj-1 Dj Dj-1 D1 Dj+1 Dj+k-1 Mj+k Dj+k … The deletion state Dj does not emit any signal, so it is called a silent state. The penalty score for a deletion of length k starting at j will be, Therefore, the penalty for deletions is not equivalent to an affine penalty function, but again, it is position dependent.

Profile HMM for sequence families The complete profile HMM has the following structure if transitions between insertion and deletion states are not considered. Leaving them out has little effect on scoring sequences, but may have problem when training the model. The following profile model considers transitions between insertion and deletion states. In this model each Mj, Dj and Ij has three transitions, except for the last position at which each state has only two transitions.

Derive profile HMMs from multiple alignments Given a multiple alignment of a protein family, we first determine how many match states should be used to model the family. A general rule is to treat columns in which less than 50% of sequences have a deletion as match states. A segment of multiple alignment of hemoglobin proteins Using this rule we model this segment of alignment by a mode having eight match states.

Derive profile HMMs from multiple alignments Based on the general design of profile HHMs, we have the following model for the segment of the alignment. A HMM of length 8 From the alignment, we know the path of each sequence, therefore, the transition and emission probabilities can be estimated by the general formula,

Pseudocounts When counting events, to avoid zero probabilities, we usually add pseudo-counts to the total counts. The most simplest way to add pseudocounts is to add one to each frequency, this is called Laplace’s rule, e.g., using this rule, we have, A slightly more sophisticated method is to add a quantity proportional to the background frequency. For example, if we add A sequences to the alignment, we expect that Aqb of them will have a b at the position, then the emission probability of b at Mk is where, qb is the background frequency of amino acid b in the alignment.

Dirichlet prior distribution This means that we add our prior knowledge to the counts, and it is equivalent to that we compute the posterior probability of the theoretic value of eM(b), q after we see some counts of EM(b), n out of a total of K counts, ie., To see this, we need to do some mathematical derivation. If we consider the frequencies of 20 amino acids in a column in an alignment, these 20 frequencies are summed to 1. These values will change for different columns in the alignment, so they are random variables, and they follow a Dirichlet distribution, where Z is a normalization factor, and a1,…, ai,…,a20 are the parameters that determine the shape of the distribution. Interestingly, it can be shown that the mean of qi is,

Dirichlet prior distribution Let ai=Aqi, then we have a Dirichlet prior distribution, The mean of qi is qi, Therefore, if we do not know the frequencies of the 20 amino acids in a column, we can use such a Dirichlet distribution to model the prior of these frequencies. The average frequency of amino acid i is qi. Although the parameter A does not affect the average of frequency qi, it affects the shape of the distribution. To see the effect of A on the shape of Dirichlet, let’s consider only one type of amino acids (e.g., acidic amino acids) with a prior frequency q, and a mean frequency q. The prior frequency of all the other amino acids is 1- q,, and its mean is 1-q.

Dirichlet prior distribution The Dirichlet distribution of this frequency q is, When the average of frequency of this type of amino acids is q=0.05, changing A, we have the following shapes of the Dirichlet distributions. Although the means of q are the same, the larger the value of A, the narrow the shape of the distribution In general, when we have a high confidence of q, we use a large A value, otherwise, we should use a small A value.

Dirichlet prior distribution Now let’s consider posterior distribution after observing data using a Dirichlet prior distribution. Let K be the total number of observed amino acids in a column, of which n are of the type that we are considering. The likelihood for this to happen can be computed by a binomial distribution, The posterior distribution is, Through normalization, we have,

Dirichlet prior distribution Therefore the posterior probability also follows a Dirichlet distribution, but with different parameters. The mean of the posterior distribution of q is, When K is large, adding prior Aq has little effect on the probability, but when K is small, the effects could be big. This gives the justification that we can use pseudocounts Aqb to estimate the posterior frequency of the amino acids b. The figure shows the posterior distribution p(q|n) when q=0.05 but the real frequency is 0.5.

Application of profile HMMs Once a profile HMM is constructed for a protein family, it can be used to score a new sequence. The sequence can be also aligned to the family using the path decoded by the Viterbi algorithm or the forward and backward algorithm. The two popular tools for profile HMM applications are free on line: Hmmer: http://hmmer.janelia.org/ Sam: http://compbio.soe.ucsc.edu/sam.html Developed by Sean Eddy and colleagues in earlier 1990s. It contains tools for building a HMM based on a multiple alignment, and tools for searching a HMM database. Hammer is also associated with the Pfam protein family database at the same site The first profile HMMs tools developed by David Haussler and Andrew Krogh in earlier 1990s.