Hidden Markov Models in Bioinformatics Applications

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Model.
Measuring the degree of similarity: PAM and blosum Matrix
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Statistical NLP: Lecture 11
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Albert Gatt Corpora and Statistical Methods Lecture 8.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models: an Introduction by Rachel Karchin.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence similarity.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Class 3: Estimating Scoring Rules for Sequence Alignment.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Introduction to Profile Hidden Markov Models
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
Sequence Alignment.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
CONTEXT DEPENDENT CLASSIFICATION
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Hidden Markov Models in Bioinformatics Applications CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

How Markov Models Work, In Sketch Observe the way something has occurred in known cases – for example: the profile of common patterns of amino acids in a family of proteins the way certain combinations of letters are pronounced by most people the way certain amino acid sequences typically fold Use this statistical information to generate probabilities for future behavior or for other instances whose structure or behavior is unknown.

The Markov Assumption Named after Russian statistician Andrei Markov. Picture a system abstractly as a sequence of states and associated probabilities that you’ll move from any given state to any of the others. In a 1st order Markov model, the probability that the system will move to a given state X depends only on the state immediately preceding X. (In a 2nd order chain, it depends on the two previous states, and so forth.) The independence of one state transition from another is what characterizes the Markov model. It’s called the Markov assumption.

Application of the Markov Assumption to PAM Matrices The Markov assumption underlies the PAM matrices, modeling the evolution of proteins. In the PAM matrix design, the probability that amino acid X will mutate to Y is not affected by X’s previous state in evolutionary time.

Markov Model Defined by a set of states S and the probability of moving from one state to the next. In computer science language, it’s a stochastic finite state machine. A Markov sequence is a sequence of states through a given Markov model. It can also be called a Markov chain, or simply a path.

Markov Model Imagine a weird coin that tends to keep flipping heads once it gets the first head and, not quite so persistently, tails once it gets the first tail.

Review of Probability – Concepts and Notation P(A) ≡ the probability of A. Note that P(A) ≥ 0 and P(A,B) ≡ the probability that both A and B are true P(A | B) ≡ the probability that A is true given that B is true = P(A,B) / P(B) P(A,B) = P(A) * P(B) when A and B are independent events Bayes’ Rule: P(A | B) = [P(B | A)*P(A)]/P(B)

Hidden Markov Models (HMMs) A Hidden Markov Model M is defined by a set of states X a set A of transition probabilities between the states, an |X| x |X| matrix. aij ≡ P(Xj | Xi) The probability of going from state i to state j. States of X are “hidden” states. an alphabet Σ of symbols emitted in states of X, a set of emission probabilities E, an X x Σ matrix ei(b) ≡ P(b | Xi). The probability that b is emitted in state i. (Emissions are sometimes called observations.) States “emit” certain symbols according to these probabilities. Again, it’s a stochastic finite state automaton.

Hidden Markov Model Imagine having two coins, one that is fair and one that is biased in favor of heads. Once the thrower of the coin starts using a particular coin, he tends to continue with that coin. We, the observers, never know which coin he is using.

What’s “Hidden” in a Hidden Markov Model? There’s something you can observe (a sequence of observations O, from Σ) and something you can’t observe directly but that you’ve modeled by the states in M (from the set X). Example 1: X consists of the states of “raining or not raining” at different moments in time (assuming you can’t look outside for yourself). O consists of observations of someone carrying an umbrella (or not) into the building. Example 2: The states modeled by M constitute meaningful sentences. The observations in O constitute digital signals produced by human speech. Example 3: The states in M constitute the structure of a “real” family of proteins. The observations in O constitute the experimentally-determined structure. Example 4: The states in M model the structure of a “real” family of DNA sequences, divided into genes. The observations in O constitute an experimentally-determined DNA sequence.

Steps for Applying HMMs to MSA 1. A model is constructed consisting of states, an emission vocabulary, transition probabilities, and emission probabilities. For multiple sequence alignment: The emission vocabulary Σ is the set of 20 amino acids The states correspond to different regions in the structure of a family of proteins, one column for each such region. In each column, there are three kinds of states: match states (rectangles), insertion states (diamonds), and deletion states (circles). (See next slide.) The match states and insertion states emit amino acids. Deletion states do not. Each position has a probability distribution indicating the probability that each of the amino acids will occur in that position.

States in the HMM for MSA “To understand the relevance of this architecture, imagine a family of proteins with different sequences which have a similar 3D structure….The structure imposes severe constraints on the sequences. For example: The structure might start with an α-helix about 30 aa long, followed by a group that binds to TT dimers, followed by about 20 aa with hydrophobic residues, etc. Basically, we walk along the sequence and enter into different regions in which the probabilities to have different amino acids are different (for example, it is very unlikely for members of the family to have an amino acid with hydrophilic residue in the hydrophobic region, or gly and pro are very likely to be present at sharp bends of the polypeptide chain, etc.). Different columns in the graph correspond to different positions in the 3D structure. Each position has its own probability distribution…giving the probabilities for different amino acids to occur at this position. Each position can be skipped by some members of the family. This is accounted for by the delete states. There might also be members of the family that have additional amino acids relative to the consensus structure. This is allowed by the insertion states.” From “Multiple Alignment with Hidden Markov Models” by Kalin Vetsigian, http://guava.physics.uiuc.edu/~nigel/courses/498BIO/498BIOonline-essays/hw3/files/HW3-Vetsigian.pdf

Steps for Applying HMMs to MSA (continued) 2. If the HMM model corresponding to a family of proteins is given to you, you can use it to: Score an observation sequence O, computing the probability that an HMM called M would produce O. That is, calculate P(O | M). Find an optimal alignment of an observation sequence O={O1…Ok} to the model (i.e., the most likely sequence of states that would produce such a sequence). That is, find the sequence of states Q that maximizes P(Q | O1…Ok) Given an observation sequence O, find the HMM that best fits the sequence. That is, calculate the Hidden Markov Model M that maximizes P(O | M). This is a step in training.

Hidden Markov Model for Multiple Sequence Alignment From “Multiple Alignment with Hidden Markov Models” by Kalin Vetsigian, http://guava.physics.uiuc.edu/~nigel/courses/498BIO/498BIOonline-essays/hw3/files/HW3-Vetsigian.pdf

Hidden Markov Model for Multiple Sequence Alignment From “Multiple Alignment with Hidden Markov Models” by Kalin Vetsigian, http://guava.physics.uiuc.edu/~nigel/courses/498BIO/498BIOonline-essays/hw3/files/HW3-Vetsigian.pdf

Steps for Applying HMMs to MSA (continued) 3. If you have to construct the HMM model from the ground up: The HMM will have a number of columns equal to the number of amino acids + gaps in the family of proteins. One way to get the emission and transition probabilities is to begin with a profile alignment for the family of proteins and build the HMM from the profile. Another way to get the probabilities is to start from scratch and train the HMM.

Creating a Hidden Markov Model: Notation and Assumptions ei(a) ≡ P(a|Xi) The probability that in state i amino acid a will be observed. P(Xt|X0:t-1) = P(Xt | Xt-1) This is the Markov assumption with regard to transition probabilities. P(Σt | X0:t, Σ 0:t-1) = P(Σ t | Xt) This means that the emission probabilities depend only on the state in which they’re emitted, not on any previous history.

Creating a Hidden Markov Model from a Profile The probabilities for emissions in match states of the HMM will be derived from the frequencies of each amino acid in the respective columns of the profile. Probabilities for emissions in insertion states will be based on the probability that each amino acid appears anywhere in the sequence. Delete states don’t need emission probabilities “The transition probabilities between matching and insertion states can be defined in the affine gap penalty model by assigning aMI, aIM, and aII in such a way that log(aMI) + log(aIM) equals the gap creation penalty and log(aII) equals the gap extension penalty.”

How would you initialize a Hidden Markov Model from this?

Creating a Hidden Markov Model from Scratch Construct the states as before. Begin with arbitrary transition and emission probabilities, or hand-chosen ones. Train the model to adjust the probabilities. That is, calculate the score of each sequence in the training set. Then adjust the probabilities. Repeat until the training set score can’t be improved any more. Guaranteed to get a local optimum but not a global one. To increase the chance of getting a global optimum, try again with different starting values.

Computing Probability that a Certain Emission Sequence Could be Observed by a Given HMM Compute P(O | M) for O = {O1…Ok}. We don’t know the true state sequence. We must look at all paths that could produce the sequence O, on the order of NK where N is the number of states. For even small n and k this is too time-consuming. For N = 5 and K = 100, it would require around 1072 computations.

From http://www.it.iitb.ac.in/vweb/engr/cs/dm/Dm_dw/hmmtut.pdf

Forward Recursion Forward recursion makes the problem more manageable, reducing the computational complexity to on the order of N2*K operations. The basic idea is not to recompute parts that are used in more than one term.

Forward Recursion αk(i) = P(O1, Q1=Si) = πiei(O1), 1<=i<=N For our example, α1(1) = π1*e1(H) α1(2) = π2*e2(H) α2 (1) = α1 (1)*a11*b1(H) + α1 (2)*a21*b1(H) = Π1*e1(H) *a11*b1(H) + π2*e2(H) *a21*b1(H) α2 (2) = α1(1)*a12b2(H) + α1(2)a22b2(H) = π1*e1(H) *a12b2(H) + π2*e2(H) *a22b2(H)

HMM For Gene Prediction From Bioinformatics by David W. Mount, page 350.