Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Similar presentations


Presentation on theme: "CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:"— Presentation transcript:

1 CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.www.cse.sc.edu HAPPY CHINESE NEW YEAR

2 Roadmap Probablistic Models of Sequences Introduction to HMM Profile HMMs as MSA models Measuring Similarity between Sequence and HMM Profile model Summary 9/18/20152

3 Multiple Sequence Alignment Alignment containing multiple DNA / protein sequences Look for conserved regions → similar function Example: #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT--- GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT--- GAGGAGAAGTCTGC #Oppossum ATGGTGCACTTGACTTTT--- GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT--- GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT--- CAGCT 3

4 Probablistic Model: Position-specific scoring matrices (PSSM) Limitations of PSSM?

5 Difficulty in biological sequences Variation in a family of sequences ◦ Gaps of variable lengths ◦ Conserved segments with different degrees ◦ PSSM cannot handle variable-length gaps ◦ Need a statistical sequence model 5

6 Regular Expressions Model Regular expressions ◦ Protein spelling is much more free that English spelling ◦ ◦ [AT] [CG] [AC] [ACGT]* A [TG] [GC] 6 Limitation of Regular expression model?

7 Roadmap Probablistic Models of Sequences Introduction to HMM Profile HMMs as MSA models Measuring Similarity between Sequence and HMM Profile model Summary 9/18/20157

8 Hidden Markov Model (HMM) HMM is: ◦ Statistical model ◦ Well suited for many tasks in molecular biology Using HMM in molecular biology ◦ Probabilistic profile (profile HMM)  From a family of proteins, for searching a database for other members of the family  Resemble the profile and weight matrix methods ◦ Grammatical structure  Gene finding  Recognize signals  Prediction (must follow the rules of a gene) 8

9 Detect Cheating in Coin Toss Game Fair and biased coins could be used Question: is it possible to determine whether a biased coin has been used based on the output sequence of the Head/Tail sequence? HTTTHTHTHTTHHHHT HTHTHTHHHHTHT

10 EXAMPLE : Fair Coin Toss Consider the single coin scenario We could model the process producing the sequence of H’s and T’s as a Markov model with two states, and equal transition probabilities: TH 0.5 Only one fair coin is used here

11 Example: Fair and Biased Coins Consider the scenario where there are two coins: Fair coin and Biased coin Visible state do not correspond to hidden state - Visible state : Output of H or T - Hidden state : Which coin was tossed HTTTHTHTHTTHHHHTHTHTHTHHHHTHT

12 12 Hidden Markov Models

13 13 Ingredients of a HMM Collection of states:{S 1, S 2,…,S N } State transition probabilities (transition matrix) A ij = P(q t+1 = S i | q t = S j ) Initial state distribution  i = P(q 1 = S i ) Observations:{O 1, O 2,…,O M } Observation probabilities: B j (k) = P(v t = O k | q t = S j )

14 14 Ingredients of Our HMM States:{S sunny, S rainy, S snowy } State transition probabilities (transition matrix) A = Initial state distribution  i = (.7.25.05) Observations:{O 1, O 2,…,O M } Observation probabilities (emission matrix): B =.08.15.05.38.6.02.75.05.2.08.15.05.38.6.02.75.05.2

15 15 Probability of a Sequence of Events P(O) = P(O gloves, O gloves, O umbrella,…, O umbrella ) =  P(O | Q)P(Q) =  P(O | q 1,…,q 7 ) = 0.7 x 0.8 6 x 0.3 2 x 0.1 4 x 0.6 + … all Q q 1,…q 7

16 16 Typical HMM Problems Annotation Given a model M and an observed string S, what is the most probable path through M generating S Classification Given a model M and an observed string S, what is the total probability of S under M Consensus Given a model M, what is the string having the highest probability under M Training Given a set of strings and a model structure, find transition and emission probabilities assigning high probabilities to the strings

17 Roadmap Probablistic Models of Sequences Introduction to HMM Profile HMMs as MSA models Measuring Similarity between Sequence and HMM Profile model Summary 9/18/201517

18 HMM Profiles as Sequence Models Given the multiple alignment of sequences, we can use HMM to model the sequences Each column of the alignment may be represented by a hidden state that produced that column Insertions and deletions may be represented by other states

19 Profile HMMs HMM with a structure that in a natural way allows position-dependent gap penalties ◦ Main states  model the columns of the alignment ◦ Insert states  model highly variable regions ◦ Delete states  to jump over one or more columns  i.e. to model the situation when just a few of the sequences have a “-” in the multiple alignment at a position 19

20 HMM Sequences Continued

21 Profile HMM Example Consider the following six sequences shown below A multiple sequence alignment of these sequences is the first step towards the processing of inducing the hidden markov model SEQ1 G C C C A SEQ2 A G C SEQ3 A A G C SEQ4 A G A A SEQ5 A A A C SEQ6 A G C

22 Profile HMM Topology The topology of HMM is established using consensus sequence The structure of a Profile HMM is shown below:- The square box represent match states Diamonds represent insert states Circles represent delete states

23 Profile HMM Example Continued The aligned columns correspond to either emissions from the match state or to emissions from the insert state The consensus columns are used to define the match states M 1,M 2,M 3 for the HMM After defining the match states, the corresponding insert and delete states are used to define the complete HMM topology

24 Transition Probabilities The values of the transition probabilities are computed using the frequency of the transitions as each sequence is considered The model parameters are computed using the state transition sequences shown in the figure below:-

25 Transition Probabilities Continued The frequency of each of the transitions and the corresponding emission probabilities are shown below State 0 1 2 3 MMMDMIMMMDMI 4 5 6 4 1 0 0 - 1 0 0 2 IMIDIIIMIDII 0 0 0 - 0 0 0 2 DMDDDIDMDDDI - 1 0 0 - 0 0 - - 0 0 0

26 Emission Probabilities The emission probability is computed using the formula:- The emission probability specifies the probability of emitting each of the symbols in |∑ | in the state k

27 Emission Probabilities Continued The emission probability for each state is computed as shown below:

28 Searching the Profile HMM Sequences can be searched against the HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm Full probability of a sequence aligning to the profile HMM determined using the forward algorithm

29 How A Sequence Fit a Model? ◦ Probability depends on the length of the sequence ◦ Not suitable to use as a score 29

30 Length-independent Score Log-odds score ◦ The logarithm of the probability of the sequence divided by the probability according to a null model ◦ 30

31 Length-independent Score HMM using log-odds ◦ 31

32 Summary HMM How to build Profile HMM model Scoring Fit between Sequence and HMM model

33 Next Lecture Gene-finding Reading: ◦ Textbook (CG) chapter 4 ◦ Textbook (EB) chapter 8


Download ppt "CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:"

Similar presentations


Ads by Google