Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov Models in Bioinformatics Applications

Similar presentations


Presentation on theme: "Hidden Markov Models in Bioinformatics Applications"— Presentation transcript:

1 Hidden Markov Models in Bioinformatics Applications
CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

2 How Markov Models Work, In Sketch
Observe the way something has occurred in known cases – for example: the profile of common patterns of amino acids in a family of proteins the way certain combinations of letters are pronounced by most people the way certain amino acid sequences typically fold Use this statistical information to generate probabilities for future behavior or for other instances whose structure or behavior is unknown.

3 The Markov Assumption Named after Russian statistician Andrei Markov.
Picture a system abstractly as a sequence of states and associated probabilities that you’ll move from any given state to any of the others. In a 1st order Markov model, the probability that the system will move to a given state X depends only on the state immediately preceding X. (In a 2nd order chain, it depends on the two previous states, and so forth.) The independence of one state transition from another is what characterizes the Markov model. It’s called the Markov assumption.

4 Application of the Markov Assumption to PAM Matrices
The Markov assumption underlies the PAM matrices, modeling the evolution of proteins. In the PAM matrix design, the probability that amino acid X will mutate to Y is not affected by X’s previous state in evolutionary time.

5 Markov Model Defined by a set of states S and the probability of moving from one state to the next. In computer science language, it’s a stochastic finite state machine. A Markov sequence is a sequence of states through a given Markov model. It can also be called a Markov chain, or simply a path.

6 Markov Model Imagine a weird coin that tends to keep flipping heads once it gets the first head and, not quite so persistently, tails once it gets the first tail.

7 Review of Probability – Concepts and Notation
P(A) ≡ the probability of A. Note that P(A) ≥ 0 and P(A,B) ≡ the probability that both A and B are true P(A | B) ≡ the probability that A is true given that B is true = P(A,B) / P(B) P(A,B) = P(A) * P(B) when A and B are independent events Bayes’ Rule: P(A | B) = [P(B | A)*P(A)]/P(B)

8 Hidden Markov Models (HMMs)
A Hidden Markov Model M is defined by a set of states X a set A of transition probabilities between the states, an |X| x |X| matrix. aij ≡ P(Xj | Xi) The probability of going from state i to state j. States of X are “hidden” states. an alphabet Σ of symbols emitted in states of X, a set of emission probabilities E, an X x Σ matrix ei(b) ≡ P(b | Xi). The probability that b is emitted in state i. (Emissions are sometimes called observations.) States “emit” certain symbols according to these probabilities. Again, it’s a stochastic finite state automaton.

9 Hidden Markov Model Imagine having two coins, one that is fair and one that is biased in favor of heads. Once the thrower of the coin starts using a particular coin, he tends to continue with that coin. We, the observers, never know which coin he is using.

10 What’s “Hidden” in a Hidden Markov Model?
There’s something you can observe (a sequence of observations O, from Σ) and something you can’t observe directly but that you’ve modeled by the states in M (from the set X). Example 1: X consists of the states of “raining or not raining” at different moments in time (assuming you can’t look outside for yourself). O consists of observations of someone carrying an umbrella (or not) into the building. Example 2: The states modeled by M constitute meaningful sentences. The observations in O constitute digital signals produced by human speech. Example 3: The states in M constitute the structure of a “real” family of proteins. The observations in O constitute the experimentally-determined structure. Example 4: The states in M model the structure of a “real” family of DNA sequences, divided into genes. The observations in O constitute an experimentally-determined DNA sequence.

11 Steps for Applying HMMs to MSA
1. A model is constructed consisting of states, an emission vocabulary, transition probabilities, and emission probabilities. For multiple sequence alignment: The emission vocabulary Σ is the set of 20 amino acids The states correspond to different regions in the structure of a family of proteins, one column for each such region. In each column, there are three kinds of states: match states (rectangles), insertion states (diamonds), and deletion states (circles). (See next slide.) The match states and insertion states emit amino acids. Deletion states do not. Each position has a probability distribution indicating the probability that each of the amino acids will occur in that position.

12 States in the HMM for MSA
“To understand the relevance of this architecture, imagine a family of proteins with different sequences which have a similar 3D structure….The structure imposes severe constraints on the sequences. For example: The structure might start with an α-helix about 30 aa long, followed by a group that binds to TT dimers, followed by about 20 aa with hydrophobic residues, etc. Basically, we walk along the sequence and enter into different regions in which the probabilities to have different amino acids are different (for example, it is very unlikely for members of the family to have an amino acid with hydrophilic residue in the hydrophobic region, or gly and pro are very likely to be present at sharp bends of the polypeptide chain, etc.). Different columns in the graph correspond to different positions in the 3D structure. Each position has its own probability distribution…giving the probabilities for different amino acids to occur at this position. Each position can be skipped by some members of the family. This is accounted for by the delete states. There might also be members of the family that have additional amino acids relative to the consensus structure. This is allowed by the insertion states.” From “Multiple Alignment with Hidden Markov Models” by Kalin Vetsigian,

13 Steps for Applying HMMs to MSA (continued)
2. If the HMM model corresponding to a family of proteins is given to you, you can use it to: Score an observation sequence O, computing the probability that an HMM called M would produce O. That is, calculate P(O | M). Find an optimal alignment of an observation sequence O={O1…Ok} to the model (i.e., the most likely sequence of states that would produce such a sequence). That is, find the sequence of states Q that maximizes P(Q | O1…Ok) Given an observation sequence O, find the HMM that best fits the sequence. That is, calculate the Hidden Markov Model M that maximizes P(O | M). This is a step in training.

14 Hidden Markov Model for Multiple Sequence Alignment
From “Multiple Alignment with Hidden Markov Models” by Kalin Vetsigian,

15 Hidden Markov Model for Multiple Sequence Alignment
From “Multiple Alignment with Hidden Markov Models” by Kalin Vetsigian,

16 Steps for Applying HMMs to MSA (continued)
3. If you have to construct the HMM model from the ground up: The HMM will have a number of columns equal to the number of amino acids + gaps in the family of proteins. One way to get the emission and transition probabilities is to begin with a profile alignment for the family of proteins and build the HMM from the profile. Another way to get the probabilities is to start from scratch and train the HMM.

17 Creating a Hidden Markov Model: Notation and Assumptions
ei(a) ≡ P(a|Xi) The probability that in state i amino acid a will be observed. P(Xt|X0:t-1) = P(Xt | Xt-1) This is the Markov assumption with regard to transition probabilities. P(Σt | X0:t, Σ 0:t-1) = P(Σ t | Xt) This means that the emission probabilities depend only on the state in which they’re emitted, not on any previous history.

18 Creating a Hidden Markov Model from a Profile
The probabilities for emissions in match states of the HMM will be derived from the frequencies of each amino acid in the respective columns of the profile. Probabilities for emissions in insertion states will be based on the probability that each amino acid appears anywhere in the sequence. Delete states don’t need emission probabilities “The transition probabilities between matching and insertion states can be defined in the affine gap penalty model by assigning aMI, aIM, and aII in such a way that log(aMI) + log(aIM) equals the gap creation penalty and log(aII) equals the gap extension penalty.”

19 How would you initialize a Hidden Markov Model from this?

20 Creating a Hidden Markov Model from Scratch
Construct the states as before. Begin with arbitrary transition and emission probabilities, or hand-chosen ones. Train the model to adjust the probabilities. That is, calculate the score of each sequence in the training set. Then adjust the probabilities. Repeat until the training set score can’t be improved any more. Guaranteed to get a local optimum but not a global one. To increase the chance of getting a global optimum, try again with different starting values.

21 Computing Probability that a Certain Emission Sequence Could be Observed by a Given HMM
Compute P(O | M) for O = {O1…Ok}. We don’t know the true state sequence. We must look at all paths that could produce the sequence O, on the order of NK where N is the number of states. For even small n and k this is too time-consuming. For N = 5 and K = 100, it would require around 1072 computations.

22 From

23 Forward Recursion Forward recursion makes the problem more manageable, reducing the computational complexity to on the order of N2*K operations. The basic idea is not to recompute parts that are used in more than one term.

24

25

26

27

28 Forward Recursion αk(i) = P(O1, Q1=Si) = πiei(O1), 1<=i<=N
For our example, α1(1) = π1*e1(H) α1(2) = π2*e2(H) α2 (1) = α1 (1)*a11*b1(H) + α1 (2)*a21*b1(H) = Π1*e1(H) *a11*b1(H) + π2*e2(H) *a21*b1(H) α2 (2) = α1(1)*a12b2(H) + α1(2)a22b2(H) = π1*e1(H) *a12b2(H) + π2*e2(H) *a22b2(H)

29

30

31

32

33

34 HMM For Gene Prediction
From Bioinformatics by David W. Mount, page 350.


Download ppt "Hidden Markov Models in Bioinformatics Applications"

Similar presentations


Ads by Google