A retrospective look at our models First we saw the finite state automaton The rigid non-stochastic nature of these structures ultimately limited their.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
Measuring the degree of similarity: PAM and blosum Matrix
Hidden Markov Models Eine Einführung.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models: an Introduction by Rachel Karchin.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hw1 Shown below is a matrix of log odds column scores made from an alignment of a set of sequences. (A) Calculate the alignment score for each of the four.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Introduction to Profile Hidden Markov Models
Conditional & Joint Probability A brief digression back to joint probability: i.e. both events O and H occur Again, we can express joint probability in.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
HMM for multiple sequences
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

A retrospective look at our models First we saw the finite state automaton The rigid non-stochastic nature of these structures ultimately limited their usefulness to us as models of DNA S  ggggcgctc c a The finite state automaton above is non- deterministic, but is NOT probabilistic Sequences under test are either elements of the set of accepted sequences, or are rejected

Markov Model What do we need to probabilistically model DNA sequences? A C G T States Transition probabilities Here each observation necessarily corresponded to the underlying biological state…which also limited the usefulness and generality of the concept as a model for DNA or amino acid sequences

HMM Topology Many different HMM topologies are possible…   R S  Careful selection of alternative HMM topologies will allow us to address a variety of different problems using essentially the same algorithmic toolkit

HMM Topology Duration modeling with HMMs Consider our C p G island model with two underlying states…. How long does our model dwell in a particular state? This should be familiar as an exponentially decaying value CpG +CpG + S  C p G - P(L residues) = (1-p)·p L-1 p

HMM Topology Duration modeling with HMMs What if this doesn’t accurately reflect the real distribution of lengths? CpG +CpG + S  C p G - P(L residues) = (1-p)·p L-1 p P L

HMM Topology Duration modeling with HMMs Assuming that each state shown here has the same emission frequencies, this forms an example of a submodel topology that will guarantee a minimum of four symbols drawn from the same underlying emission distribution, but with a length distribution that is still geometric. The duration of the model has been tweaked. There are many more complex possibilities depending on the properties of the length distribution you seek to model etc. Note: this and numerous other figures that follow in this presentation are after Durbin et al. etc.

HMM Topology Duration modeling with HMMs The topology shown in this variation can model any desired distribution of lengths varying between 2 and 6 etc.

HMM Topology Duration modeling with HMMs Without belaboring the derivation, we’ll just mention that this negative binomial distribution offers a great deal of flexibility in modeling length distributions p etc. ppp 1-p Where n = number of repetitive states, so here n = 4

HMM Topology Duration modeling with HMMs By selecting appropriate values for the number of repeating states and the probability of re-entering each state, you can flexibly model the length distribution math_toolkit/dist/dist_ref/dists/negative_binomial_dist.html

HMM Topology Duration modeling with HMMs In practical HMMs we would probably use n much smaller than in these examples, but with probability of state re-entry much closer to 1 math_toolkit/dist/dist_ref/dists/negative_binomial_dist.html math_toolkit/dist/dist_ref/dists/negative_binomial_dist.html

HMM Topology A trivial HMM is equivalent to a Position Specific Scoring Matrix... M1M1 M2M2 MLML S  The transitions are deterministic and Pr{a Mi→Mi+1 }= 1 but the emissions correspond to the estimated amino acid or nucleotide frequencies in the columns of a PSSM We refer collectively to M 1 …M j as match states This simple model cannot account for indels and does not capture all of the information present in a multiple alignment

HMM Topology Insertions are handled with special states with background-like emission M1M1 MjMj M j+1 S  Insertion states correspond to states that do not match anything in the model. They almost always have emission probabilities matching the background distribution We still need a way to deal with deletions…... IjIj

HMM Topology How do insertions affect the overall probability of a sequence? M1M1 MjMj M j+1 S  Assuming log-odds scoring relative to some background distribution, emissions from I j will not contribute to the overall score, and only the transitions will matter... IjIj log a M j →I j + (k-1)·log a I j →I j + log a I j →M j+1 For an insertion of length k:

HMM Topology How do insertions affect the overall probability of a sequence? M1M1 MjMj M j+1 S  This is equivalent to the familiar gap opening + gap extension penalties used in many sequence alignment methods. This is therefore a form of affine gap scoring... IjIj log a Mj→Ij + (k-1)·log a Ij→Ij + log a Ij→Mj+1 For an insertion of length k:

HMM Topology How best to handle deletions? S  The problem with allowing arbitrarily long gaps in any practically sized model is that we would need to estimate far too many transition probabilities Estimation is the hardest HMM problem, so we should avoid an unnecessary proliferation of transitions wherever possible!

HMM Topology In general we handle deletions by adding silent states… S  We can use a sequence of silent-state transitions to move from any match state to any other match state! Note that not all D j →D j+1 transitions need have the same probability MjMj DjDj

The Profile HMM If we combine all these features, we have the famous profile HMM S  We can use a sequence of silent-state transitions to move from any match state to any other match state! Profile HMMs fully generalise the concept of pairwise alignment! MjMj DjDj IjIj

The Profile HMM If we combine all these features, we have the famous profile HMM S  Profile HMMs are extensively used for the identification of new members of conserved sequence families This relies on the ability to estimate parameters for the profile HMM on the basis of multiple sequence alignments MjMj DjDj IjIj

Basic estimation for profile HMMs We can now count simply count the transitions and emissions to calculate our maximum likelihood estimators from frequencies. But what about missing observations? Consider a multiple sequence alignment #pos >glob1 VGA--HAGEY >glob2 V----NVDEV >glob3 VEA--DVAGH >glob4 VKG------D >glob5 VYS--TYETS >glob6 FNA--NIPKH >glob7 IAGADNGAGV *** ***** First heuristic: positions with more than 50% gaps will be modeled as inserts, the remainder as matches. In this example only the starred columns will correspond to matches

Basic estimation for profile HMMs Let’s start by simply applying Laplace’s rule: add one pseudocount to every frequency calculation Parameter estimation from multiple alignments Second heuristic: use pseudocounts to fill in missing observations.. Here B is the total number of pseudocounts, and q represents the fraction of the total number that have been allocated to that particular transition or emission

Basic estimation for profile HMMs These are the estimates when we apply Laplace’s rule for adding pseudocounts.. Consider a multiple sequence alignment #pos >glob1 VGA--HAGEY >glob2 V----NVDEV >glob3 VEA--DVAGH >glob4 VKG------D >glob5 VYS--TYETS >glob6 FNA--NIPKH >glob7 IAGADNGAGV *** ***** e M1 (V) = 6/27, e M1 (I) = e M1 (F) = 2/27, e M1 (all other aa) = 1/27 a M1→M2 = 7/10, a M1→D2 = 2/10, a M1→I1 = 1/10, etc.

Scoring with profile HMMs We could calculate separate scores for the models, then combine, but this is inefficient Since the state path is unknown we have two basic options: Employ the Viterbi algorithm to determine the joint probability of the observed sequence and the most probable state path Employ the forward (or backward) algorithm to determine the full probability summed over all possible state paths The state path of sequences under test are always unknown Either way, one fly in the ointment is that we are usually interested in a score expressed as a log-odds ratio relative to our random model:

Scoring with profile HMMs In other words, we need only adjust and “ log_floatize ” our emissions, which is something we can do up front in the self.emissions dict rather than in the algorithms We could rewrite the recurrences to accommodate log-odds scoring. For example, Viterbi might look like this: Modifying Viterbi and Forward for profile HMMs This would be a serious nuisance except we already have a log_float class!!! So, we can instead just alter our emission terms to the form below:

Handling multiple sequences in Python class FastA(list): # note derivation from list, not object def __init__(self): #stuff def read_file(self, filepath): # stuff def calculate_MLE(self, pseudocounts) # stuff return (transitions, emissions) I’ll provide more information about the key methods in the form of a Python docstring Inheritance is a powerful tool for deriving new classes Our goal is to develop a class, FastA, that behaves like a python list that serves as a container for sequences read from a file. Each element will contain a tuple of the annotation and the sequence stored as a string. It should also be able to use these sequences to then calculate “dict-of-dict” transition and emission distributions suitable for use in our increasingly sophisticated HMM class