Introduction to Profile Hidden Markov Models

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Measuring the degree of similarity: PAM and blosum Matrix
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models in Bioinformatics Applications
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Metamorphic Malware Research
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
BNFO 602 Multiple sequence alignment Usman Roshan.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Pairwise Alignment of Metamorphic Computer Viruses Student:Scott McGhee Advisor:Dr. Mark Stamp Committee:Dr. David Taylor Dr. Teng Moh.
Pairwise alignment Computational Genomics and Proteomics.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
A Revealing Introduction to Hidden Markov Models
Masquerade Detection Mark Stamp 1Masquerade Detection.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
A Revealing Introduction to Hidden Markov Models
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
Hidden Markov Models for Software Piracy Detection Shabana Kazi Mark Stamp HMMs for Piracy Detection 1.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PHMMs for Metamorphic Detection Mark Stamp 1PHMMs for Metamorphic Detection.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Expected accuracy sequence alignment Usman Roshan.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov Models BMI/CS 576
Hidden Markov Models Part 2: Algorithms
Presentation transcript:

Introduction to Profile Hidden Markov Models Mark Stamp PHMM

Hidden Markov Models Here, we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique Train model based on observation sequence Score given sequence to see how closely it matches the model Efficient algorithms, many useful applications PHMM

HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation: PHMM

Hidden Markov Models Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time PHMM

Limitations of HMMs Positional information not considered HMM has no “memory” Higher order models have some memory But no explicit use of positional information Does not handle insertions or deletions These limitations are serious problems in some applications In bioinformatics string comparison, sequence alignment is critical Also, insertions and deletions occur PHMM

Profile HMM Profile HMM (PHMM) designed to overcome limitations on previous slide In some ways, PHMM easier than HMM In some ways, PHMM more complex The basic idea of PHMM Define multiple B matrices Almost like having an HMM for each position in sequence PHMM

PHMM In bioinformatics, begin by aligning multiple related sequences Multiple sequence alignment (MSA) This is like training phase for HMM Generate PHMM based on given MSA Easy, once MSA is known Hard part is generating MSA Then can score sequences using PHMM Use forward algorithm, like HMM PHMM

Generic View of PHMM Circles are Delete states Diamonds are Insert states Rectangles are Match states Match states correspond to HMM states Arrows are possible transitions Each transition has associated probability Transition probabilities are A matrix Emission probabilities are B matrices In PHMM, observations are emissions Match and insert states have emissions PHMM

Generic View of PHMM Circles are Delete states, diamonds are Insert states, rectangles are Match states Also, begin and end states PHMM

PHMM Notation Notation PHMM

PHMM Match state probabilities easily determined from MSA, that is aMi,Mi+1 transitions between match states eMi(k) emission probability at match state Note: other transition probabilities For example, aMi,Ii and aMi,Di+1 Emissions at all match & insert states Remember, emission == observation PHMM

MSA First we show MSA construction Then construct PHMM from MSA This is the difficult part Lots of ways to do this “Best” way depends on specific problem Then construct PHMM from MSA The easy part Standard algorithm for this How to score a sequence? Forward algorithm, similar to HMM PHMM

MSA How to construct MSA? Allow gaps to be inserted Construct pairwise alignments Combine pairwise alignments to obtain MSA Allow gaps to be inserted Makes better matches But gaps tend to weaken scoring So there is a tradeoff PHMM

Global vs Local Alignment In these pairwise alignment examples “-” is gap “|” are aligned “*” omitted beginning and ending symbols PHMM

Global vs Local Alignment Global alignment is lossless But gaps tend to proliferate And gaps increase when we do MSA More gaps implies more sequences match So, result is less useful for scoring We usually only consider local alignment That is, omit ends for better alignment For simplicity, we assume global alignment here PHMM

Pairwise Alignment We allow gaps when aligning How to score an alignment? Based on n x n substitution matrix S Where n is number of symbols What algorithm(s) to align sequences? Usually, dynamic programming Sometimes, HMM is used Other? Local alignment --- more issues PHMM

Pairwise Alignment Example Note gaps vs misaligned elements Depends on S and gap penalty PHMM

Substitution Matrix Masquerade detection Detect imposter using an account Consider 4 different operations E == send email G == play games C == C programming J == Java programming How similar are these to each other? PHMM

Substitution Matrix Consider 4 different operations: E, G, C, J Possible substitution matrix: Diagonal --- matches High positive scores Which others most similar? J and C, so substituting C for J is a high score Game playing/programming, very different So substituting G for C is a negative score PHMM

Substitution Matrix Depending on problem, might be easy or very difficult to get useful S matrix Consider masquerade detection based on UNIX commands Sometimes difficult to say how “close” 2 commands are Suppose aligning DNA sequences Biological rationale for closeness of symbols PHMM

Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic So, less useful for scoring Therefore, we penalize gaps How to penalize gaps? Linear gap penalty function f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function f(g) = a + e(g – 1) Gap opening penalty a, then constant factor of e PHMM

Pairwise Alignment Algorithm We use dynamic programming Based on S matrix, gap penalty function Notation: PHMM

Pairwise Alignment DP Initialization: Recursion: PHMM

MSA from Pairwise Alignments Given pairwise alignments… …how to construct MSA? Generic approach is “progressive alignment” Select one pairwise alignment Select another and combine with first Continue to add more until all are combined Relatively easy (good) Gaps may proliferate, unstable (bad) PHMM

MSA from Pairwise Alignments Lots of ways to improve on generic progressive alignment Here, we mention one such approach Not necessarily “best” or most popular Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores Then generate a minimum spanning tree For MSA, add sequences in the order that they appear in the spanning tree PHMM

MSA Construction Create pairwise alignments Generate substitution matrix Dynamic program for pairwise alignments Use pairwise alignments to make MSA Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed) Note: gap penalty is used PHMM

MSA Example Suppose 10 sequences, with the following pairwise alignment scores: PHMM

MSA Example: Spanning Tree Spanning tree based on scores So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) PHMM

MSA Snapshot Intermediate step and final Note increase in gaps Use “+” for neutral symbol Then “-” for gaps in MSA Note increase in gaps PHMM

PHMM from MSA For PHMM, must determine match and insert states & probabilities from MSA “Conservative” columns are match states Half or less of symbols are gaps Other columns are insert states Majority of symbols are gaps Delete states are a separate issue PHMM

PHMM States from MSA Consider a simpler MSA… Columns 1,2,6 are match states 1,2,3, respectively Since less than half gaps Columns 3,4,5 are combined to form insert state 2 Since more than half gaps Match states between insert PHMM

PHMM Probabilities from MSA Emission probabilities Based on symbol distribution in match and insert states State transition probs Based on transitions in the MSA PHMM

PHMM Probabilities from MSA Emission probabilities: But 0 probabilities are bad Model “overfits” the data So, use “add one” rule Add one to each numerator, add total to denominators PHMM

PHMM Probabilities from MSA More emission probabilities: But 0 probabilities are bad Model “overfits” the data Again, use “add one” rule Add one to each numerator, add total to denominators PHMM

PHMM Probabilities from MSA Transition probabilities: We look at some examples Note that “-” is delete state First, consider begin state: Again, use add one rule PHMM

PHMM Probabilities from MSA Transition probabilities When no information in MSA, set probs to uniform For example I1 does not appear in MSA, so PHMM

PHMM Probabilities from MSA Transition probabilities, another example What about transitions from state D1? Can only go to M2, so Again, use add one rule: PHMM

PHMM Emission Probabilities Emission probabilities for the given MSA Using add-one rule PHMM

PHMM Transition Probabilities Transition probabilities for the given MSA Using add-one rule PHMM

PHMM Summary Construct pairwise alignments Use these to construct MSA Usually, use dynamic programming Use these to construct MSA Lots of ways to do this Using MSA, determine probabilities Emission probabilities State transition probabilities In effect, we have trained a PHMM Now what??? PHMM

PHMM Scoring Want to score sequences to see how closely they match PHMM How did we score sequences with HMM? Forward algorithm How to score sequences with PHMM? But, algorithm is a little more complex Due to complex state transitions PHMM

Forward Algorithm Notation Indices i and j are columns in MSA xi is ith observation symbol qxi is distribution of xi in “random model” Base case is is score of x1,…,xi up to state j (note that in PHMM, i and j may not agree) Some states undefined Undefined states ignored in calculation PHMM

Forward Algorithm Compute P(X|λ) recursively Note that depends on , and And corresponding state transition probs PHMM

PHMM We will see examples of PHMM later In particular, Malware detection based on opcodes Masquerade detection based on UNIX commands PHMM

References Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169 PHMM