# Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.

## Presentation on theme: "Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003."— Presentation transcript:

Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003

Definition A Hidden Markov Model (HMM) is a discrete-time finite-state Markov chain coupled with a sequence of letters emitted when the Markov chain visits its states. Q States ( Q ): q 1 q 2 q 3... O Letters ( O ): O 1 O 2 O 3

Definition (Cont’d) O QThe sequence O of emitted letters is called “the observed sequence” because we often know it while not knowing the state sequence Q, which is in this case called “hidden”. The triple represents the full set of parameters of the HMM, where P is the transition probability matrix of the Markov chain, B is the emission probability matrix, and denotes the initial distribution vector of the Markov chain. =(P, B,)

Important Calculations O Given any observed sequence O = (O 1,…,O T ) Oand, efficiently calculate P(O | ) Q Q QOand, efficiently calculate the hidden sequence Q = (q 1,…,q T ) that is most likely to have occurred; i.e. find argmax Q P(Q | O) and assuming a fixed graph structure of the underlying Markov chain, find the parameters O = maximizing P(O | ) (P, B,)

Applications of HMM Modeling protein families: (1) construct multiple sequence alignments (2) determine the family of a query sequence Gene finding through semi-Hidden Markov Models (semiHMM)

HMM for Sequence Alignment Consider the following Markov chain underlying a HMM, with three types of states:   “match”;  “insert”;   “delete”

HMM for Sequence Alignment (Con’t) The alphabet A consists of the 20 amino acids and a “delete” symbol ( ) Delete states output only with probability 1 Each insert & match state has its own distribution over the 20 amino acids and does not output

HMM for Sequence Alignment (Con’t) There are two extreme situations depending on the HMM parameters: The emission probs for the match & insert states are uniform over the 20 amino acids the model produces random sequences Each state emits one specific amino acid with prob 1 & m i  m i+1 with prob 1 the model produces the same sequence always

HMM for Sequence Alignment (Con’t) Between the two extremes consider a “family” of somewhat similar sequences: A “tight” family of very similar sequences A “loose” family with little similarity Similarity may be confined to certain areas of the sequences – if some match states emit a few amino acids, while other match states emit all amino acids uniformly/randomly

HMM for Sequence Alignments: Procedure (A) Start with “training”, or estimating, the parameters of the model using a set of training sequences from the protein family (B) Next, compute the path of states most likely to have produced each sequence (C) Amino acids are aligned if both are produced by the same match state in their paths (D) Finally, indels are inserted appropriately for insertions and deletions

Important Calculations O Given any observed sequence O = (O 1,…,O T ) Oand, efficiently calculate P(O | ) Q Q QOand, efficiently calculate the hidden sequence Q = (q 1,…,q T ) that is most likely to have occurred; i.e. find argmax Q P(Q | O) and assuming a fixed graph structure of the underlying Markov chain, find the parameters O = maximizing P(O | ) (P, B,)

Example Consider: CAEFDDH, CDAEFPDDH Suppose the model has length 10, and the most likely paths for the two sequences are: m 0 m 1 m 2 m 3 m 4 d 5 d 6 m 7 m 8 m 9 m 10 and m 0 m 1 i 1 m 2 m 3 m 4 d 5 m 6 m 7 m 8 m 9 m 10

Example (Cont’d) The alignment induced is found by aligning positions generated by the same match state: m 0 m 1 m 2 m 3 m 4 d 5 d 6 m 7 m 8 m 9 m 10 C A E F D D H C D A E F P D D H m 0 m 1 i 1 m 2 m 3 m 4 d 5 m 6 m 7 m 8 m 9 m 10

Example (End) This leads to the following alignment: C– AEF–DDH CDAEFPDDH

HMM: Strengths & Weaknesses HMM aligns many sequences with little computing power HMM allows the sequences themselves to guide the alignment Alignments by HMM are sometimes ambiguous and some regions are left unaligned in the end HMM weaknesses come from their strengths: the Markov property and stationarity

Thank you.

Download ppt "Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003."

Similar presentations