Hidden Markov Models in Bioinformatics

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Hidden Markov Model.
Hidden Markov Models in Bioinformatics
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models Eine Einführung.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Matlab Simulations of Markov Models Yu Meng Department of Computer Science and Engineering Southern Methodist University.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
數據分析 David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Applying Hidden Markov Models to Bioinformatics. Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Hidden Markov Models In BioInformatics
Introduction to Profile Hidden Markov Models
THE HIDDEN MARKOV MODEL (HMM)
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
John Lafferty Andrew McCallum Fernando Pereira
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Gil McVean Department of Statistics, Oxford
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006

Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm HMM Advantages HMM Disadvantages Conclusions

Markov Chain Definition: A Markov chain is a triplet (Q, {p(x1 = s)}, A), where: Q is a finite set of states. Each state corresponds to a symbol in the alphabet p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t  Q. For each s, t  Q the transition probability is: ast ≡ P(xi = t|xi-1 = s) Output: The output of the model is the set of states at each instant time => the set of states are observable Property: The probability of each symbol xi depends only on the value of the preceding symbol xi-1 : P (xi | xi-1,…, x1) = P (xi | xi-1)

Example of a Markov Model

HMM (Hidden Markov Model Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, |Q|=N V is a finite set of observation symbols per state, |V|=M p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t  Q. For each s, t  Q the transition probability is: ast ≡ P(xi = t|xi-1 = s) E is a probability emission matrix, esk ≡ P (vk at time t | qt = s) Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “hidden”

Example of a Hidden Markov Model

The HMMs can be applied efficently to well known biological problems The HMMs can be applied efficently to well known biological problems. That why HMMs gained popularity in bioinformatics, and are used for a variety of biological problems like: protein secondary structure recognition multiple sequence alignment gene finding

What HMMs do? A HMM is a statistical model for sequences of discrete simbols. Hmms are used for many years in speech recognition. HMMs are perfect for the gene finding task. Categorizing nucleotids within a genomic sequence can be interpreted as a clasification problem with a set of ordered observations that posses hidden structure, that is a suitable problem for the application of hidden Markov models.  

Hidden Markov Models in Bioinformatics The most challenging and interesting problems in computational biology at the moment is finding genes in DNA sequences. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.

Gene Finding Gene finding refers to identifying stretches of nucleotide sequences in genomic DNA that are biologically functional. Computational gene finding deals with algorithmically identifying protein-coding genes. Gene finding is not an easy task, as gene structure can be very complex.

Objective: Motivation: To find the coding and non-coding regions of an unlabeled string of DNA nucleotides Motivation: Assist in the annotation of genomic data produced by genome sequencing methods Gain insight into the mechanisms involved in transcription, splicing and other processes

Structure of a gene

The gene is discontinous, coding both: exons (a region that encodes a sequence of amino acids). introns (non-coding polynucleotide sequences that interrupts the coding sequences, the exons, of a gene) .

In gene finding there are some important biological rules: Translation starts with a start codon (ATG). Translation ends with a stop codon (TAG, TGA, TAA). Exon can never follow an exon without an intron in between. Complete genes can never end with an intron.

Gene Finding Models When using HMMs first we have to specify a model. When choosing the model we have to take into consideration their complexity by: The number of states and allowed transitions. How sophisticated the learning methods are. The learning time.

The Model consists of a finite set of states, each of which can emit a symbol from a finite alphabet with a fixed probability distribution over those symbols, and a set of transitions between states, which allow the model to change state after each symbol is emitted. The models can have different complexity, and different built in biological knowledge.

The model for the Viterbi algorithm

states = ('Begin', 'Exon', 'Donor', 'Intron') observations = ('A', 'C', 'G', 'T')

The Model Probabilities Transition probability: transition_probability = { 'Begin' : {'Begin' : 0.0, 'Exon' : 1.0, 'Donor' : 0.0, 'Intron' : 0.0}, 'Exon' : {'Begin' : 0.0, 'Exon' : 0.9, 'Donor' : 0.1, 'Intron' : 0.0}, 'Donor' : {'Begin' : 0.0, 'Exon' : 0.0, 'Donor' : 0.0, 'Intron' : 1.0}, 'Intron' : {'Begin' : 0.0, 'Exon' : 0.0, 'Donor' : 0.0, 'Intron' : 1.0} }

Emission probability: 'Begin' : {'A' :0.00 , 'C' :0.00, 'G' :0.00, 'T' :0.00}, 'Exon' : {'A' :0.25 , 'C' :0.25, 'G' :0.25, 'T' :0.25}, 'Donor' : {'A' :0.05 , 'C' :0.00, 'G' :0.95, 'T' :0.00}, 'Intron' : {'A' :0.40 , 'C' :0.10, 'G' :0.10, 'T' :0.40} }

Viterbi algorithm Dynamic programming algorithm for finding the most likely sequence of hidden states. The Vitebi algorithm finds the most probable path – called the Viterbi path .

The main idea of the Viterbi algorithm is to find the most probable path for each intermediate state, until it reaches the end state. At each time only the most likely path leading to each state survives.

The steps of the Viterbi algorithm

The arguments of the Viterbi algorithm viterbi(observations, states, start_probability, transition_probability, emission_probability)

The working of the Viterbi algorithm The algorithm works on the mappings T and U. The algorithm calculates prob, v_path, and v_prob where prob is the total probability of all paths from the start to the current state, v_path is the Viterbi path, and v_prob is the probability of the Viterbi path, and The mapping T holds this information for a given point t in time, and the main loop constructs U, which holds similar information for time t+1.

The algorithm computes the triple (prob, v_path, v_prob) for each possible next state. The total probability of a given next state, total is obtained by adding up the probabilities of all paths reaching that state. More precisely, the algorithm iterates over all possible source states. For each source state, T holds the total probability of all paths to that state. This probability is then multiplied by the emission probability of the current observation and the transition probability from the source state to the next state. The resulting probability prob is then added to total.

For each source state, the probability of the Viterbi path to that state is known. This too is multiplied with the emission and transition probabilities and replaces valmax if it is greater than its current value. The Viterbi path itself is computed as the corresponding argmax of that maximization, by extending the Viterbi path that leads to the current state with the next state. The triple (prob, v_path, v_prob) computed in this fashion is stored in U and once U has been computed for all possible next states, it replaces T, thus ensuring that the loop invariant holds at the end of the iteration.

Example Input DNA sequence: CTTCATGTGAAAGCAGACGTAAGTCA Result: Total: 2.6339193049977711e-17 – the sum of all the calculated probabilities

Viterbi Path: ['Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Donor', 'Intron', 'Intron', 'Intron', 'Intron', 'Intron', 'Intron', 'I ntron', 'Intron'] Viterbi probability: 7.0825171238258092e-18

HMM Advantages Statistics Modularity HMMs are very powerful modeling tools Statisticians are comfortable with the theory behind hidden Markov models Mathematical / theoretical analysis of the results and processes Modularity HMMs can be combined into larger HMMs

Transparency Prior Knowledge People can read the model and make sense of it The model itself can help increase understanding Prior Knowledge Incorporate prior knowledge into the architecture Initialize the model close to something believed to be correct Use prior knowledge to constrain training process

HMM Disadvantages State independence States are supposed to be independent, P(y) must be independent of P(x), and vice versa. This usually isn’t true Can get around it when relationships are local Not good for RNA folding problems

Over-fitting Local maximums Speed You’re only as good as your training set More training is not always good Local maximums Model may not converge to a truly optimal parameter set for a given training set Speed Almost everything one does in an HMM involves: “enumerating all possible paths through the model” Still slow in comparison to other methods

Conclusions HMMs have problems where they excel, and problems where they do not You should consider using one if: The problem can be phrased as classification The observations are ordered The observations follow some sort of grammatical structure If an HMM does not fit, there’s all sorts of other methods to try: Neural Networks, Decision Trees have all been applied to Bioinformatics

Bibliography Pierre Baldi, Soren Brunak: The machine learning approach http://www1.imim.es/courses/BioinformaticaUPF/Ttreballs/ programming/donorsitemodel/index.html http://en.wikipedia.org/wiki/Viterbi_algorithm

Thank you.