# Hidden Markov Models in Bioinformatics

## Presentation on theme: "Hidden Markov Models in Bioinformatics"— Presentation transcript:

Hidden Markov Models in Bioinformatics
Example Domains: Gene Finding & Protein Family Modeling Colin Cherry

My Credentials By day, I do NLP work with Dekang Lin
Worked with HMMs in the 1st term of my 1st year here Did a review paper with Guohui Lin. Today’s presentation is borrowed from that Worked on speeding up sequence alignment to profile HMMs with Martin Müller

5 Second Overview Today’s goal: Introduce HMMs as general tools in bioinformatics I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain I will use the problem of Protein Family Modeling as an example of a clever way to fit HMMs to a problem

Learning Objectives When I’m done you should know:
When is an HMM a good fit for a problem space? What materials are needed before work can begin with an HMM? What are the advantages and disadvantages of using HMMs?

Outline HMMs as Statistical Models The example tasks at a glance

Statistical Models HMMs are just a little more complicated…
Definition: Any mathematical construct that attempts to parameterize a random process Example: A normal distribution Assumptions Parameters Estimation Usage HMMs are just a little more complicated…

HMM Assumptions Observations are ordered
Random process can be represented by a stochastic finite state machine with emitting states.

HMM Parameters Using weather example Modeling daily weather for a year
Ra Ra Su Su Su Ra.. Lots of parameters One for each table entry Represented in two tables. One for emissions One for transitions

HMM Estimation Called training, it falls under machine learning
Feed an architecture (given in advance) a set of observation sequences The training process will iteratively alter its parameters to fit the training set The trained model will assign the training sequences high probability

Evaluate the probability of an observation sequence given the model (Forward) Find the most likely path through the model for a given observation sequence (Viterbi)

Gene Finding (An Ideal HMM Domain)
Our Objective: To find the coding and non-coding regions of an unlabeled string of DNA nucleotides Our Motivation: Assist in the annotation of genomic data produced by genome sequencing methods Gain insight into the mechanisms involved in transcription, splicing and other processes

Gene Finding Terminology
A string of DNA nucleotides containing a gene will have separate regions (lines): Introns – non-coding regions within a gene Exons – coding regions Separated by functional sites (boxes) Start and stop codons Splice sites – acceptors and donors

Gene Finding Challenges
Need the correct reading frame Introns can interrupt an exon in mid-codon There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak

Protein Family Modeling (A clever fit of HMMs)
I have a protein sequence. What family is it in? Can you give me a quick alignment to the other members of the family? These amino acids here, do they match the families consensus positions, or are they inserts?

Profile HMM Square: Match (consensus) state
Diamond: Insert state – notice the loops Circle: Delete state – allows you to “jump” a match

What makes a good HMM problem space?
Characteristics: Classification problems There are two main types of output from an HMM: Scoring of sequences (Protein family modeling) Labeling of observations within a sequence (Gene Finding)

HMM Problem Characteristics Continued
The observations in a sequence should have a clear, and meaningful order Unordered observations will not map easily to states It’s beneficial, but not necessary for the observations follow some sort of grammar Makes it easier to design an architecture Gene Finding Protein Family Modeling Fortunately, most materials in bioinformatics can be interpreted as sequences with well-defined order, be they gene sequences, structure sequences or nucleotide sequences

HMM Requirements So you’ve decided you want to build an HMM,
here’s what you need: An architecture Probably the hardest part Should be biologically sound & easy to interpret A well-defined success measure Necessary for any form of machine learning

HMM Requirements Continued
Training data Labeled or unlabeled – it depends You do not always need a labeled training set to do observation labeling, but it helps Amount of training data needed is: Directly proportional to the number of free parameters in the model Inversely proportional to the size of the training sequences

Why HMMs might be a good fit for Gene Finding
Classification: Classifying observations within a sequence Order: A DNA sequence is a set of ordered observations Grammar / Architecture: Our grammatical structure (and the beginnings of our architecture) is right here: Success measure: # of complete exons correctly labeled Training data: Available from various genome annotation projects

Why HMMs can be made to fit Protein Family Modeling
Classification: What model fits a sequence best? Order: An amino acid sequence is well ordered Grammar: Any two matches can be separated by a series of inserts and deletes… okay, maybe the word “grammar” is a bit of a stretch Success Measure: How many sequences can we correctly label after training?

Statisticians are comfortable with the theory behind hidden Markov models Freedom to manipulate the training and verification processes Mathematical / theoretical analysis of the results and processes HMMs are still very powerful modeling tools – far more powerful than many statistical methods

Modularity HMMs can be combined into larger HMMs Transparency of the Model Assuming an architecture with a good design People can read the model and make sense of it The model itself can help increase understanding

Incorporation of Prior Knowledge Incorporate prior knowledge into the architecture Initialize the model close to something believed to be correct Use prior knowledge to constrain training process Start the model close to something that is correct – this takes advantage of model transparency Use prior knowledge to constrain the training process – takes advantage of statistical grounding

How does Gene Finding make use of HMM advantages?
Statistics: Many systems alter the training process to better suit their success measure Modularity: Almost all systems use a combination of models, each individually trained for each gene region Prior Knowledge: A fair amount of prior biological knowledge is built into each architecture Point 1 (statistics): Find a good way to phrase that a state sequence that fits the observations with highest probability is not necessarily the state sequence that assigns the highest probability to complete exons and introns Point 4 (modularity): Each sub-model’s architecture is constructed to encapsulate biological knowledge about that region of the gene

States are supposed to be independent P(y) must be independent of P(x), and vice versa This usually isn’t true Can get around it when relationships are local Not good for RNA folding problems P(x) P(y)

…and then there are the standard Machine Learning Problems Watch out for local maxima Model may not converge to a truly optimal parameter set for a given training set Avoid over-fitting You’re only as good as your training set More training is not always good

Speed!!! Almost everything one does in an HMM involves: “enumerating all possible paths through the model” There are efficient ways to do this Still slow in comparison to other methods

HMM Gene Finders: VEIL A straight HMM Gene Finder
Takes advantage of grammatical structure and modular design Uses many states that can only emit one symbol to get around state independence

HMM Gene Finders: HMMGene
Uses an extended HMM called a CHMM CHMM = HMM with classes Takes full advantage of being able to modify the statistical algorithms Uses high-order states Trains everything at once

HMM Gene Finders: Genie
Uses a generalized HMM (GHMM) Edges in model are complete HMMs States can be any arbitrary program States are actually neural networks specially designed for signal finding

Conclusions HMMs have problems where they excel, and problems where they do not You should consider using one if: Problem can be phrased as classification Observations are ordered The observations follow some sort of grammatical structure (optional)