Presentation is loading. Please wait.

Presentation is loading. Please wait.

Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.

Similar presentations


Presentation on theme: "Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to."— Presentation transcript:

1 Profile Hidden Markov Models PHMM 1 Mark Stamp

2 Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to hidden Markov models”  Executive summary of HMMs o HMM is a machine learning technique… o …and a discrete hill climb technique o Train model based on observation sequence o Score any given sequence to determine how closely it matches the model o Efficient algorithms and many useful apps 2 PHMM

3 HMM Notation  Recall, HMM model denoted λ = (A,B,π)  Observation sequence is O  Notation: 3 PHMM

4 Hidden Markov Models  Among the many uses for HMMs…  Speech analysis  Music search engine  Malware detection  Intrusion detection systems (IDS)  And more all the time 4 PHMM

5 Limitations of HMMs  Positional information not considered o HMM has no “memory” beyond previous state o Higher order models have more “memory” o But no explicit use of positional information  With HMM, no insertions or deletions  These limitations are serious problems in some applications o In bioinformatics string comparison, sequence alignment is critical o Also, insertions and deletions can occur 5 PHMM

6 Profile HMM  Profile HMM (PHMM) designed to overcome limitations on previous slide o In some ways, PHMM easier than HMM o In some ways, PHMM more complex  The basic idea of PHMM ? o Define multiple B matrices o Almost like having an HMM for each position in sequence 6 PHMM

7  In bioinformatics, begin by aligning multiple related sequences o Multiple sequence alignment (MSA) o Analogous to training phase for HMM  Generate PHMM based on MSA o This is easy, once MSA is known o Again, hard part is generating MSA  Then can score sequences using PHMM o Use forward algorithm, similar to HMM 7 PHMM

8 Training: PHMM vs HMM  Training PHMM o Determine MSA  challenging o Determine PHMM matrices  easy  Training HMM o Append training sequences  trivial o Determine HMM matrices  challenging  PHMM and HMM are, in this sense, almost opposites… PHMM 8

9 Generic View of PHMM  Have delete, insert, and match states o Match states correspond to HMM states  Arrows are possible transitions o Each transition has a probability  Transition probabilities are A matrix  Emission probabilities are B matrices o In PHMM, observations called emissions o Match and insert states have emissions 9 PHMM

10 PHMM without Gaps  If no gaps, PHMM is simple  Illustrate such a PHMM as  Here, M i is i th “match state” o This diagram neglects the B matrices o Recall, that there is a distinct B matrix for each match state PHMM 10

11 PHMM with Insertions  If we also allow for insertions, diagram is of the form  Allows for multiple insertions PHMM 11

12 PHMM with Deletions  If instead, we allow for deletions, we obtain the following diagram  Note that a deletion skips over the corresponding match state PHMM 12

13 Generic View of PHMM  Circles are delete states, diamonds are insert states, squares are match states  Note the many possible transitions 13 PHMM

14 PHMM Notation  Notation 14 PHMM

15  Match state probabilities easily determined from MSA a Mi,Mi+1 transitions between match states e Mi (k) emission probability at match state  Many other transition probabilities o For example, a Mi,Ii and a Mi,Di+1  Emissions at all match & insert states o Remember, “emission” == “observation” 15 PHMM

16 Multiple Sequence Alignment  First we show MSA construction o This is the difficult part o Lots of ways to do this o “Best” way depends on specific problem  Then construct PHMM from MSA o This is the easy part o Standard algorithm to generate PHMM  How to score a sequence? o Forward algorithm, similar to HMM 16 PHMM

17 MSA  How to construct MSA? o Construct pairwise alignments o Combine pairwise alignments into MSA  Allow gaps to be inserted o To make better matches  Gaps tend to weaken PHMM scoring o So, tradeoff between number of gaps and strength of score 17 PHMM

18 Global vs Local Alignment  For these pairwise alignment examples… o “ - ” is gap o “ | ” means elements aligned o “ * ” for omitted beginning/ending symbols 18 PHMM

19 Global vs Local Alignment  Global alignment is lossless o But gaps tend to proliferate o And gaps increase when we do MSA o More gaps, more random sequences match… o …and result is less useful for scoring  We usually only consider local alignment o That is, omit ends for better alignment  For simplicity, we do global alignment in examples presented here 19 PHMM

20 Pairwise Alignment  Allow gaps when aligning  How to score an alignment? o Based on n x n substitution matrix S o Where n is number of symbols  What algorithm(s) to align sequences? o Usually, dynamic programming o Sometimes, HMM is used o Other?  Local alignment? Additional issues arise… 20 PHMM

21 Pairwise Alignment  Example  Tradeoff: Gaps vs misaligned elements o Depends on matrix S and gap penalty function 21 PHMM

22 Substitution Matrix  For example, masquerade detection o Detect imposter using computer account  Consider 4 different operations o E == send email o G == play games o C == C programming o J == Java programming  How similar are these to each other? 22 PHMM

23 Substitution Matrix  Consider 4 different operations: o E, G, C, J  Possible substitution matrix:  Diagonal  matches o High positive scores  Which others most similar? o J and C, so substituting C for J is a high score  Game playing/programming, very different o So substituting G for C is a negative score 23 PHMM

24 Substitution Matrix  Depending on problem, might be easy or very difficult to find useful S matrix  Consider masquerade detection based on UNIX commands o Sometimes difficult to say how “close” 2 commands are  Suppose instead, aligning DNA sequences o Biologically valid reasons for S matrix 24 PHMM

25 Gap Penalty  Generally must allow gaps to be inserted  But gaps make alignment more generic o Less useful for scoring, so we penalize gaps  How to penalize gaps? Two common ways  Linear gap penalty function: g(x) = ax (constant penalty for every gap)  Affine gap penalty function g(x) = a + b(x – 1) o Gap opening penalty a and constant penalty of b for each extension of existing gap 25 PHMM

26 Pairwise Alignment Algorithm  We use dynamic programming o Based on S matrix, gap penalty function  Notation: 26 PHMM

27 Pairwise Alignment DP  Initialization:  Recursion: where 27 PHMM

28 MSA from Pairwise Alignments  Given pairwise alignments…  How to construct MSA?  Generally use “progressive alignment” o Select one pairwise alignment o Select another and combine with first o Continue to add more until all are used  Relatively easy (good)  Gaps proliferate, and it’s unstable (bad) 28 PHMM

29 MSA from Pairwise Alignments  Lots of ways to improve on generic progressive alignment o Here, we mention one such approach o Not necessarily “best” or most popular  Feng-Dolittle progressive alignment o Compute scores for all pairs of n sequences o Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores o Then generate a minimum spanning tree o For MSA, add sequences in the order that they appear in the spanning tree 29 PHMM

30 MSA Construction  Create pairwise alignments o Generate substitution matrix S o Dynamic program for pairwise alignments  Use pairwise alignments to make MSA o Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) o Add sequences in spanning tree order (from high score, insert gaps as needed) o Note: gap penalty is used here 30 PHMM

31 MSA Example  Suppose we have 10 sequences, with the following pairwise alignment scores 31 PHMM

32 MSA Example: Spanning Tree  Spanning tree based on scores  So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) 32 PHMM

33 MSA Snapshot  Intermediate step and final o Use “+” for neutral symbol o Then “-” for gaps in MSA  Note increase in gaps 33 PHMM

34 PHMM from MSA  In PHMM, determine match and insert states & probabilities from MSA  “Conservative” columns are match states o Half or less of symbols are gaps  Other columns are insert states o Majority of symbols are gaps  Delete states are a separate issue 34 PHMM

35 PHMM States from MSA  Consider a simpler MSA…  Columns 1,2,6 are match states 1,2,3, respectively o Since less than half gaps  Columns 3,4,5 are combined to form insert state 2 o Since more than half gaps o Match states between insert 35 PHMM

36 Probabilities from MSA  Emission probabilities o Based on symbol distribution in match and insert states  State transition probs o Based on transitions in the MSA 36 PHMM

37 Probabilities from MSA  Emission probabilities:  But 0 probabilities are bad o Model overfits the data o So, use “add one” rule o Add one to each numerator, add total to denominators 37 PHMM

38 Probabilities from MSA  More emission probabilities:  But 0 probabilities still bad o Model overfits the data o Again, use “add one” rule o Add one to each numerator, add total to denominators 38 PHMM

39 Probabilities from MSA  Transition probabilities:  We look at some examples o Note that “ - ” is delete state  First, consider begin state:  Again, use add one rule 39 PHMM

40 Probabilities from MSA  Transition probabilities  When no information in MSA, set probs to uniform  For example I 1 does not appear in MSA, so 40 PHMM

41 Probabilities from MSA  Transition probabilities, another example  What about transitions from state D 1 ?  Can only go to M 2, so  Again, use add one rule: 41 PHMM

42 PHMM Emission Probabilities  Emission probabilities for the given MSA o Using add-one rule 42 PHMM

43 PHMM Transition Probabilities  Transition probabilities for the given MSA o Using add-one rule 43 PHMM

44 PHMM Summary  Construct pairwise alignments o Usually, use dynamic programming  Use these to construct MSA o Lots of ways to do this  Using MSA, determine probabilities o Emission probabilities o State transition probabilities  Then we have trained a PHMM o Now what??? 44 PHMM

45 PHMM Scoring  Want to score sequences to see how closely they match PHMM  How did we score using HMM? o Forward algorithm  How to score sequences with PHMM? o Forward algorithm (surprised?)  But, algorithm is a little more complex o Due to more complex state transitions 45 PHMM

46 Forward Algorithm  Notation o Indices i and j are columns in MSA o x i is i th observation (emission) symbol o q xi is distribution of x i in “random model” o Base case is o is score of x 1,…,x i up to state j (note that in PHMM, i and j may not agree) o Some states undefined o Undefined states ignored in calculation 46 PHMM

47 Forward Algorithm  Compute P(X|λ) recursively  Note that depends on, and o And corresponding state transition probs 47 PHMM

48  We will see examples of PHMM used in security application later  In particular, o Malware detection based on opcodes o Masquerade detection based on UNIX commands o Malware detection based on dynamically extracted API calls 48 PHMM

49 References  Durbin, et al, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic AcidsBiological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids  L. Huang and M. Stamp, Masquerade detection using profile hidden Markov models, Computers & Security, 30(8):732-747, 2011Masquerade detection using profile hidden Markov models  S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models for metamorphic virus detection, Journal in Computer Virology, 5(2):151-169, 2009Profile hidden Markov models for metamorphic virus detection 49 PHMM


Download ppt "Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to."

Similar presentations


Ads by Google