Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov Models: an Introduction by Rachel Karchin.

Similar presentations


Presentation on theme: "Hidden Markov Models: an Introduction by Rachel Karchin."— Presentation transcript:

1 Hidden Markov Models: an Introduction by Rachel Karchin

2 Outline Stochastic Modeling Discrete time series Simple Markov models Hidden Markov models Summary of key HMM algorithms Modeling protein families with linear profile HMMs

3 Outline Overfitting and regularization. To come.

4 BME100 9/28/01 References Lectures from David Haussler’s CMPS243 class (Winter 1998)

5 BME100 9/28/01 Stochastic Modeling Stochastic modeling. For phenomenon that exhibit random behavior. Random doesn’t mean arbitrary. Random events can be modeled by some probability distribution. Represented with random variables that take on values (numeric or symbolic) according to event outcomes.

6 BME100 9/28/01 General Discrete Time Series Chain of random variables X 1,X 2,X 3,..., X n Sequence of observed values x=x 1,x 2,x 3,..., x n When we observe x, say that: X 1 = x 1,X 2 = x 2,X 3 = x 3,..., X n = x n

7 BME100 9/28/01 Simple Markov model of order k Probability distribution for X t depends only on values of previous k random variables: X t-1,X t-2,..., X t-k

8 BME100 9/28/01 Simple Markov model of order k Example with k=1 and X t = {a,b} Observed sequence: x = abaaababbaa Model: PrevNextProb a0.5 b0.5 SStart probs aa0.7 ab0.3 ba0.5 bb0.5 P(x) = 0.5 * 0.3 * 0.5 * 0.7 * 0.3* 0.5* 0.3* 0.5 * 0.7

9 BME100 9/28/01 What is a hidden Markov model? Finite set of hidden states. At each time t, the system is in a hidden state, chosen at random depending on state chosen at time t-1. At each time t, observed letter is generated at random, depending only on current hidden state.

10 BME100 9/28/01 HMM for random toss of fair and biased coins 0.8 0.2 0.8 P(H)=0. 5 P(T)=0. 5 Fair P(H)=0. 1 P(T)=0. 9 Biased Start 0.5 Sequence of states: q = FFFFBBBFFFFF Observed sequence: x = HTTHTTTTHHTH

11 BME100 9/28/01 HMM for random toss of fair and biased coins Sequence of states is a first -order Markov model but usually is hidden to us. We observe the effect, which is statistically correlated with the state. Use the correlations to decode the state sequence.

12 BME100 9/28/01 HMM for fair and biased coins Sequence of states: q = FFFFBBBFFFFF Observed sequence: x = HTTHTTTTHHTH With complete information, can compute: P(x,q) = 0.5 * 0.5 * 0.8 * 0.5 * 0.8 * 0.5 * 0.8 * 0.5 * 0.2 * 0.9... Otherwise:

13 BME100 9/28/01 Three key HMM algorithms Forward algorithm. Given observed sequence x and an HMM M, calculate P(x|M). Viterbi algorithm. Given x and M, calculate the most likely state sequence q. Forward-backward algorithm. Given many observed sequences, estimate the parameters of the HMM.

14 BME100 9/28/01 Some HMM Topologies

15 BME100 9/28/01 Modeling protein families with linear profile HMMs Observed sequence is the amino acid sequence of a protein. Typically want to model a group of related proteins. Model states and transitions will be based on a multiple alignment of the group. No transitions from right to left.

16 BME100 9/28/01 From multiple alignment to profile HMM Good model of these proteins must reflect: –highly conserved positions in the alignment –variable regions in the alignment –varying lengths of protein sequences NF.....A- DF.....SY NYrqsanS- NFapistAY DFvlamrSF

17 BME100 9/28/01 From multiple alignment to profile HMM NF.....A- DF.....SY NYrqsanS- NFapistAY DFvlamrSF P(N)=0.6 P(D)=0.4 P(R)=0.13 P(Q)=0.07 P(A)=0.2 Three kinds of states: match insert silent P(F)=0.8 P(Y)=0.2 1.0 0.8 0.6 P(S)=0.6 P(A)=0.4 P(Y)=0.67 P(F)=0.33 0.4 - 0.2 0.4 0.6 0.0 Start 0.0 1.0

18 BME100 9/28/01 Finding probability of a sequence with an HMM Once we have an HMM for a group of proteins, we are often interested in how well a new sequence fits the model. We want to compute a probability for our sequence with respect to the model.

19 BME100 9/28/01 One sequence many paths A protein sequence can be represented by many paths through the HMM. P(N)=0.6 P(D)=0.4 - P(F)=0.8 P(Y)=0.2 P(S)=0.6 P(A)=0.4 P(Y)=0.67 P(F)=0.33 P(R)=0.13 P(Q)=0.07 P(A)=0.2 1.00.4 0.6 0.8 0.2 0.4 0.0 DYAF Start 0.0 1.0

20 BME100 9/28/01 One sequence many paths A protein sequence can be represented by many paths through the HMM. P(N)=0.6 P(D)=0.4 P(F)=0.8 P(Y)=0.2 P(S)=0.6 P(A)=0.4 P(Y)=0.67 P(F)=0.33 P(R)=0.13 P(Q)=0.07 P(A)=0.2 1.00.4 0.6 0.8 0.1 0.4 - 0.0 0.1 0.0 DYAF Start 0.0 1.0

21 BME100 9/28/01 Finding the probability of a sequence with an HMM Not knowing the state sequence q, we’ll have to use either the forward or the Viterbi algorithm. Basic recurrence relation for Viterbi: P(v t ) def. Prob of most probable path ending in state q t with obs x t P(v o ) = 1 P(v t ) = max P(v t-1 ) * P(q t | q t-1 ) * P(x t ) Compute with dynamic programming.

22 BME100 9/28/01 M1 I1 M2 I2 M3 I3 M4 I4 D 0.4 0 0 0 0 0 0 0 Y 0 0.008 0 0 0.021 0 A 0 0 0 0.051.038 0 0 F 0 0 0.32 0 0 0.01 0 Most probable path: Viterbi algorithm P(N)=0.6 P(D)=0.4 - P(F)=0.8 P(Y)=0.2 P(S)=0.6 P(A)=0.4 P(Y)=0.67 P(F)=0.33 P(R)=0.13 P(Q)=0.07 P(A)=0.2 1.00.4 0.6 0.8 0.2 0.4 0.0 DYAF Start 0.0 1.0 for t=1 to n P(v t ) = max P(v t-1 ) * P(q t | q t-1 ) * P(x t ) D in M11.0*1.0*0.4 D in I11.0*0*0 Y in M11.0*1.0*0 Y in I11.0*0*0 A in M11.0*1.0*0 A in I11.0*0*0 F in M11.0*1.0*0 F in I11.0*0*0 P(v 1 ) P(v 2 ) P(v 3 )P(v 4 ) D in M20.4*1.0*0 D in I20.4*0*0 Y in M20.4*1.0*0.2 Y in I20.4*0*0 A in M20.4*1.0*0 A in I20.4*0*0 F in M20.4*1.0*0.8 F in I20.4*0*0 P(v 2 ) D in M30.32*0.4*0 D in I30.32*0.6*0 Y in M30.32*0.4*0 Y in I30.32*0.6*0 A in M30.32*0.4*0.4 A in I30.32*0.6*0.2 F in M30.32*0.4*0 F in I30.32*0.6*0 P(v 3 ) D in M40.051*0.6*0 D in I40.051*0*0 Y in M40.051*0.6*0.67 Y in I40.051*0*0 A in M40.051*0.6*0 A in I40.051*0*0 F in M40.051*0.6*0.33 F in I40.051*0*0 P(v 4 )

23 BME100 9/28/01 Overfitting problems Our toy example illustrates a problem with estimating probability distributions from small samples. P(aa other than D or N)=0 at position 1. Family members which don’t begin with D or N can’t be recognized by the model. Probability distribution in Match State 1

24 BME100 9/28/01 Model regularization Use pseudocounts. If an amino acid does not appear in a column of the alignment, give it a fake count. NF.....A- DF.....SY NYrqsanS- NFifistAY DFvlpmrSF Observed counts of A in column 1 Pseudocounts of A in column 1 Observed counts over all amino acids in column 1 Pseuodcounts over all amino acids in column 1 Observed counts of N in column 1 Pseudocounts of N in column 1 Observed counts over all amino acids in column 1 Pseuodcounts over all amino acids in column 1

25 BME100 9/28/01 Model regularization Pseudocounts smooth the column probability distributions In practice, often pseudocounts are added by fitting the column to a set of typical amino acid distributions found in the columns of protein multiple alignments. Probability distribution in Match State 1

26 BME100 9/28/01 To come: HMMs can be used to automatically produce a high-quality multiple alignment. Active areas of research: –Building HMMs that can recognize very distantly related proteins –Multi-track HMMs


Download ppt "Hidden Markov Models: an Introduction by Rachel Karchin."

Similar presentations


Ads by Google