JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
Hidden Markov Models.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
MNW2 course Introduction to Bioinformatics
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Profiles for Sequences
Hidden Markov Models in Bioinformatics Applications
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Markov Models Charles Yan Spring Markov Models.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models: an Introduction by Rachel Karchin.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Master’s course Bioinformatics Data Analysis and Tools
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Hidden Markov Models.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Hidden Markov Models for Sequence Analysis 4
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
CSE182-L9 Modeling Protein domains using HMMs. Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x|
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG.
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Hidden Markov Models BMI/CS 576
Genome Annotation (protein coding genes)
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

JM Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC

JM - Outline of the lecture Multiple alignments, family profiles and probabilistic models of biological sequences From simple Markov models to Hidden Markov Models (HMMs) Profile HMMs: topology and parameter optimization Finding optimal alignments: the Viterbi algorithm Other applications of HMMs

JM - Web watch: personalized predictive medicine Targeting crucial signal transduction pathway in lung cancer: an inhibitor of the Epidermal Growth Factor Receptor (EGFR) catalytic activity that binds EGFRs with specific mutations. Genotyping the EGFR gene appears to be sufficient to predict the outcome of the therapy. Paez JG et. al. Science 304

JM - Hidden Markov Models for biological sequences Problems with grammatical structure, such as gene finding, family profiles and protein function prediction, transmembrane domains prediction In general, one may think of different biases in different fragments of the sequence (due to functional role for example) or of different states emitting these fragments using different probability distributions Durbin et. al., Chapters 3 to 6

JM - Example: Markov chain model for CpG islands Motivation: CpG dinucleotides (and not the C-G bas pairs across the two strands) are frequently methylated at C, with methyl-C mutating with a higher rate into a T; however, the methylation process is suppressed around regulatory sequences (e.g. promoters) where CpG islands occur more often. A CG T Transition probabilities: t T,G =P(a i =G | a i-1 =T) etc. The overall probability of a sequence defined as product of transition probabilities

JM - Example: Hidden Markov model for CpG islands A* C*G* T* Adding four more states (A*,C*,T*,G*) to represent the “island” model, as opposed to non-island model with unlikely transitions between the models one obtains a “hidden” MM for CpG islands. There is no longer one-to-one correspondence between the states and the symbols and knowing the sequence we cannot tell state the model was in when generating subsequent letters in the sequence. A CG T

JM - Probabilistic models of biological sequences For any probabilistic model the total probability of observing a sequence a 1 a 2 …a n may be written as: P(a 1 a 2 …a n ) = P(a n | a n-1 … a 1 ) P(a n-1 | a n-2 … a 1 ) … P(a 1 ) In Markov chain models we simply have: P(a 1 a 2 …a n ) = P(a n | a n-1 ) P(a n-1 | a n-2 ) … P(a 1 ) HMMs are generalization of Markov chain models, with some “hidden” states that “emit” sequence symbols according to certain probability distributions and (Markov) transitions between pairs of hidden states

JM - HMMs as probabilistic linguistic models HMMs may be in fact regarded as probabilistic, finite automata that generate certain “languages”: sets of words (sentences etc.) with specific “grammatical” structure. For example, promoter, start, exon, splice junction, intron, stop “states” will appear in a linguistic model of a gene, whereas column (sequence position), insert and deletion states will be employed in a linguistic model of a (protein) family profile.

JM - HMMs for gene prediction: an exon model

JM - HMMs and the supervised learning approach Given a training set of aligned sequences find optimal transition and emission probabilities that maximize probability of observing the training sequences – Baum- Welch (Expectation Maximization) or Viterbi training algorithm In recognition phase, having the optimized probabilities, we ask what is the likelihood that a new sequence belongs to a family i.e. it is generated by the HMM with sufficiently high probability. The Viterbi algorithm, which is in fact dynamic programming in a suitable formulation, is used to find an optimal path through the states, which defines the optimal alignment

JM - Ungapped profiles and the corresponding HMMs BegMjMj End …… Example AGAAACT AGGAATT TGAATCT P( AGAAACT )=16/81 P( TGGATTT )=1/ A2/ T1/ C000002/30 G011/30000 Each blue square represents a match state that “emits” each letter with certain probability e j (a) which is defined by frequency of a at position j: Typically, pseudo-counts are added in HMMs to avoid zero probabilities.

JM - HMMs and likelihood optimization

JM - Likelihood optimization …

JM - Insertions and deletions in profile HMMs BegMjMj End IjIj Insert states emit symbols just like the match states, however, the emission probabilities are typically assumed to follow the background distribution and thus do not contribute to log-odds scores. Transitions I j -> I j are allowed and account for an arbitrary number of inserted residues that are effectively unaligned (their order within an inserted region is arbitrary).

JM - Insertions and deletions in profile HMMs BegMjMj End DjDj Deletions are represented by silent states which do not emit any letters. A sequence of deletions (with D -> D transitions) may be used to connect any two match states, accounting for segments of the multiple alignment that are not aligned to any symbol in a query sequence (string). The total cost of a deletion is the sum of the costs of individual transitions (M->D, D->D, D->M) that define this deletion. As in case of insertions, both linear and affine gap penalties can be easily incorporated in this scheme.

JM - Gap penalties: evolutionary and computational considerations Linear gap penalties:  (k) = - k d for a gap of length k and constant d Affine gap penalties:  (k) = - [ d + (k -1) e ] where d is opening gap penalty and e an extension gap penalty.

JM - Profile HMMs as a model for multiple alignments BegMjMj End IjIj DjDj Example AG---C A-AG-C AG-AA- --AAAC AG---C ** *

JM - Observed emission and transition counts C0C1C2C3 A-400 C-004 G-030 T-000 BegMjMj End IjIj DjDj AG...C A-AG.C AGAA.- --AAAC AG...C C0C1C2C3 A0060 C0000 G0010 T0000 Match emissionsInsert emissions C0 C1 C2 C3

JM - Computing emission and transition probabilities

JM - Optimal alignment corresponds to a path with the highest probability (or log-odds score) BegMjMj End IjIj DjDj Problem Given the above model, with emission and transition probabilities obtained previously, find the optimal path (alignment) for the query sequence AGAC Problem Find emission and transition counts assuming that the 4 th column in the example of multiple alignment in slide 15 corresponds to another match state (and not an insert state)

JM - Outline of the Viterbi algorithm BegMjMj End IjIj DjDj

JM - Profile HMMs for local alignments MjMj IjIj DjDj Beg End QQ The trick consists of adding additional insert states Q that model flanking unaligned sequences using background frequencies q a and large t Q,Q

JM - Summary In general, when the states generating training sequences (alignments) are not known an iterative procedure Problem with local minima, topology choice (length of the profile) Excellent results in family assignment (SAM, PFAM), gene prediction, trans-membrane domain recognition etc.

JM - Outline of the lecture