Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Measuring the degree of similarity: PAM and blosum Matrix
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Hidden Markov Models in Bioinformatics Applications
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expected accuracy sequence alignment
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Metamorphic Malware Research
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence similarity.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
BNFO 602 Multiple sequence alignment Usman Roshan.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Pairwise Alignment of Metamorphic Computer Viruses Student:Scott McGhee Advisor:Dr. Mark Stamp Committee:Dr. David Taylor Dr. Teng Moh.
Pairwise alignment Computational Genomics and Proteomics.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
A Revealing Introduction to Hidden Markov Models
Masquerade Detection Mark Stamp 1Masquerade Detection.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
A Revealing Introduction to Hidden Markov Models
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
Hidden Markov Models for Software Piracy Detection Shabana Kazi Mark Stamp HMMs for Piracy Detection 1.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PHMMs for Metamorphic Detection Mark Stamp 1PHMMs for Metamorphic Detection.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Expected accuracy sequence alignment Usman Roshan.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Expected accuracy sequence alignment Usman Roshan.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov Models BMI/CS 576
Multiple sequence alignment (msa)
Hidden Markov Models Part 2: Algorithms
Presentation transcript:

Profile Hidden Markov Models PHMM 1 Mark Stamp

Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to hidden Markov models”  Executive summary of HMMs o HMM is a machine learning technique… o …and a discrete hill climb technique o Train model based on observation sequence o Score any given sequence to determine how closely it matches the model o Efficient algorithms, many, many useful apps 2 PHMM

HMM Notation  Recall, HMM model denoted λ = (A,B,π)  Observation sequence is O  Notation: 3 PHMM

Hidden Markov Models  Among the many uses for HMMs…  Speech analysis  Music search engine  Malware detection  Intrusion detection systems (IDS)  And more all the time 4 PHMM

Limitations of HMMs  Positional information not considered o HMM has no “memory” beyond previous state o Higher order models have more “memory” o But no explicit use of positional information  With HMM, no insertions or deletions  These limitations are serious problems in some applications o In bioinformatics string comparison, sequence alignment is critical o Also, insertions and deletions can occur 5 PHMM

Profile HMM  Profile HMM (PHMM) designed to overcome limitations on previous slide o In some ways, PHMM easier than HMM o In some ways, PHMM more complex  The basic idea of PHMM ? o Define multiple B matrices o Almost like having an HMM for each position in sequence 6 PHMM

 In bioinformatics, begin by aligning multiple related sequences o Multiple sequence alignment (MSA) o Analogous to training phase for HMM  Generate PHMM based on given MSA o This is easy, once MSA is known o Again, hard part is generating MSA  Then can score sequences using PHMM o Use forward algorithm, similar to HMM 7 PHMM

Training: PHMM vs HMM  Training PHMM o Determine MSA  nontrivial o Determine PHMM matrices  trivial  Training HMM o Append training sequences  trivial o Determine HMM matrices  nontrivial  PHMM and HMM are, in this sense, opposites… PHMM 8

Generic View of PHMM  Have delete, insert, and match states o Match states correspond to HMM states  Arrows are possible transitions o Each transition has a probability  Transition probabilities are A matrix  Emission probabilities are B matrices o In PHMM, observations are emissions o Match and insert states have emissions 9 PHMM

Generic View of PHMM  Circles are delete states, diamonds are insert states, squares are match states  Also, begin and end states 10 PHMM

PHMM Notation  Notation 11 PHMM

 Match state probabilities easily determined from MSA a Mi,Mi+1 transitions between match states e Mi (k) emission probability at match state  Many other transition probabilities o For example, a Mi,Ii and a Mi,Di+1  Emissions at all match & insert states o Remember, emission == observation 12 PHMM

Multiple Sequence Alignment  First we show MSA construction o This is the difficult part o Lots of ways to do this o “Best” way depends on specific problem  Then construct PHMM from MSA o This is the easy part o Standard algorithm for this  How to score a sequence? o Forward algorithm, similar to HMM 13 PHMM

MSA  How to construct MSA? o Construct pairwise alignments o Combine pairwise alignments into MSA  Allow gaps to be inserted o To make better matches  Gaps tend to weaken PHMM scoring o So, tradeoff between number of gaps and strength of score 14 PHMM

Global vs Local Alignment  In these pairwise alignment examples o “ - ” is gap o “ | ” means elements aligned o “ * ” for omitted beginning/ending symbols 15 PHMM

Global vs Local Alignment  Global alignment is lossless o But gaps tend to proliferate o And gaps increase when we do MSA o More gaps, more random sequences match… o …and result is less useful for scoring  We usually only consider local alignment o That is, omit ends for better alignment  For simplicity, assume global alignment in examples presented here 16 PHMM

Pairwise Alignment  Allow gaps when aligning  How to score an alignment? o Based on n x n substitution matrix S o Where n is number of symbols  What algorithm(s) to align sequences? o Usually, dynamic programming o Sometimes, HMM is used o Other?  Local alignment? Additional issues arise… 17 PHMM

Pairwise Alignment  Example  Tradeoff gaps vs misaligned elements o Depends on matrix S and gap penalty 18 PHMM

Substitution Matrix  For example, masquerade detection o Detect imposter using computer account  Consider 4 different operations o E == send o G == play games o C == C programming o J == Java programming  How similar are these to each other? 19 PHMM

Substitution Matrix  Consider 4 different operations: o E, G, C, J  Possible substitution matrix:  Diagonal  matches o High positive scores  Which others most similar? o J and C, so substituting C for J is a high score  Game playing/programming, very different o So substituting G for C is a negative score 20 PHMM

Substitution Matrix  Depending on problem, might be easy or very difficult to find useful S matrix  Consider masquerade detection based on UNIX commands o Sometimes difficult to say how “close” 2 commands are  Suppose instead, aligning DNA sequences o Biological reasons for S matrix 21 PHMM

Gap Penalty  Generally must allow gaps to be inserted  But gaps make alignment more generic o Less useful for scoring, so we penalize gaps  How to penalize gaps?  Linear gap penalty function: g(x) = ax (constant penalty for every gap)  Affine gap penalty function g(x) = a + b(x – 1) o Gap opening penalty a and constant penalty of b for each extension of existing gap 22 PHMM

Pairwise Alignment Algorithm  We use dynamic programming o Based on S matrix, gap penalty function  Notation: 23 PHMM

Pairwise Alignment DP  Initialization:  Recursion: where 24 PHMM

MSA from Pairwise Alignments  Given pairwise alignments…  How to construct MSA?  Generally use “progressive alignment” o Select one pairwise alignment o Select another and combine with first o Continue to add more until all are combined  Relatively easy (good)  Gaps proliferate, and it’s unstable (bad) 25 PHMM

MSA from Pairwise Alignments  Lots of ways to improve on generic progressive alignment o Here, we mention one such approach o Not necessarily “best” or most popular  Feng-Dolittle progressive alignment o Compute scores for all pairs of n sequences o Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores o Then generate a minimum spanning tree o For MSA, add sequences in the order that they appear in the spanning tree 26 PHMM

MSA Construction  Create pairwise alignments o Generate substitution matrix S o Dynamic program for pairwise alignments  Use pairwise alignments to make MSA o Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) o Add sequences in spanning tree order (from high score, insert gaps as needed) o Note: gap penalty is used here 27 PHMM

MSA Example  Suppose 10 sequences, with the following pairwise alignment scores 28 PHMM

MSA Example: Spanning Tree  Spanning tree based on scores  So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) 29 PHMM

MSA Snapshot  Intermediate step and final o Use “+” for neutral symbol o Then “-” for gaps in MSA  Note increase in gaps 30 PHMM

PHMM from MSA  In PHMM, determine match and insert states & probabilities from MSA  “Conservative” columns == match states o Half or less of symbols are gaps  Other columns are insert states o Majority of symbols are gaps  Delete states are a separate issue 31 PHMM

PHMM States from MSA  Consider a simpler MSA…  Columns 1,2,6 are match states 1,2,3, respectively o Since less than half gaps  Columns 3,4,5 are combined to form insert state 2 o Since more than half gaps o Match states between insert 32 PHMM

Probabilities from MSA  Emission probabilities o Based on symbol distribution in match and insert states  State transition probs o Based on transitions in the MSA 33 PHMM

Probabilities from MSA  Emission probabilities:  But 0 probabilities are bad o Model overfits the data o So, use “add one” rule o Add one to each numerator, add total to denominators 34 PHMM

Probabilities from MSA  More emission probabilities:  But 0 probabilities still bad o Model overfits the data o Again, use “add one” rule o Add one to each numerator, add total to denominators 35 PHMM

Probabilities from MSA  Transition probabilities:  We look at some examples o Note that “ - ” is delete state  First, consider begin state:  Again, use add one rule 36 PHMM

Probabilities from MSA  Transition probabilities  When no information in MSA, set probs to uniform  For example I 1 does not appear in MSA, so 37 PHMM

Probabilities from MSA  Transition probabilities, another example  What about transitions from state D 1 ?  Can only go to M 2, so  Again, use add one rule: 38 PHMM

PHMM Emission Probabilities  Emission probabilities for the given MSA o Using add-one rule 39 PHMM

PHMM Transition Probabilities  Transition probabilities for the given MSA o Using add-one rule 40 PHMM

PHMM Summary  Construct pairwise alignments o Usually, use dynamic programming  Use these to construct MSA o Lots of ways to do this  Using MSA, determine probabilities o Emission probabilities o State transition probabilities  Then we have trained a PHMM o Now what??? 41 PHMM

PHMM Scoring  Want to score sequences to see how closely they match PHMM  How did we score using HMM? o Forward algorithm  How to score sequences with PHMM? o Forward algorithm (surprised?)  But, algorithm is a little more complex o Due to more complex state transitions 42 PHMM

Forward Algorithm  Notation o Indices i and j are columns in MSA o x i is i th observation (emission) symbol o q xi is distribution of x i in “random model” o Base case is o is score of x 1,…,x i up to state j (note that in PHMM, i and j may not agree) o Some states undefined o Undefined states ignored in calculation 43 PHMM

Forward Algorithm  Compute P(X|λ) recursively  Note that depends on, and o And corresponding state transition probs 44 PHMM

 We will see examples of PHMM later  In particular, o Malware detection based on opcodes o Masquerade detection based on UNIX commands 45 PHMM

References  Durbin, et al, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic AcidsBiological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids  L. Huang and M. Stamp, Masquerade detection using profile hidden Markov models, Computers & Security, 30(8): , 2011Masquerade detection using profile hidden Markov models  S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models for metamorphic virus detection, Journal in Computer Virology, 5(2): , 2009Profile hidden Markov models for metamorphic virus detection 46 PHMM