1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Hidden Markov Model.
Hidden Markov Models in Bioinformatics
Measuring the degree of similarity: PAM and blosum Matrix
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models: an Introduction by Rachel Karchin.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Hidden Markov Models.
Hw1 Shown below is a matrix of log odds column scores made from an alignment of a set of sequences. (A) Calculate the alignment score for each of the four.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Hidden Markov Models In BioInformatics
Introduction to Profile Hidden Markov Models
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
(H)MMs in gene prediction and similarity searches.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Genome Annotation (protein coding genes)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hidden Markov Models - Training
Ab initio gene prediction
Protein structure prediction.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

2 Biological Motivation: Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein.

3 But wait, that’s hard! There’s physics, chemistry, secondary structure, tertiary structure and all sorts of other nasty stuff to deal with. Let’s rephrase the problem: Given a target amino acid sequence of unknown structure, we want to identify the structural family of the target sequence through identification of a homologous sequence of known structure.

4 It still sounds hard… In other words: We find a similar protein with a structure that we understand, and we see if it makes sense to fold our target into the same sort of shape. If not, we try again with the second most similar structure, and so on. What we’re doing is taking advantage of the wealth of knowledge that has been collected in protein and structure databases.

5 So, the next question is: How do we find a known protein that is similar to our target sequence? One method happens to be: Hidden Markov Models! (Profile Hidden Markov Models, to be precise)

6 Lecture Objectives Once I’m done, you should know: 1. The standard architecture for a profile HMM. 2. The three major uses in bioinformatics for a profile HMM. 3. The high-level concepts behind the algorithms to train and use a profile HMM. 4. The two different starting points for training an HMM. 5. How to avoid over-fitting a profile HMM. 6. The high-level ideas behind using profile HMMs to determine protein structure. Please feel free to interrupt at any point.

7 Outline This talk be broken into the following sections: Methods for Characterizing A Protein Family The Architecture of a Profile HMM Alignment and Training with Profile HMMs Preventing Over-fitting Determining Structure Conclusion

8 Methods for Characterizing a Protein Family Objective: Given a number of related sequences, encapsulate what they have in common in such a way that we can recognize other members of the family. Some standard methods for characterization: Multiple Alignments Regular Expressions Consensus Sequences Hidden Markov Models

9 A Characterization Example How could we characterize this (hypothetical) family of nucleotide sequences? Keep the Multiple Alignment Try a regular expression [AT] [CG] [AC] [ACTG]* A [TG] [GC] But what about?  T G C T - - A G G vrs  A C A C - - A T C Try a consensus sequence: A C A A T C Depends on distance measure Example borrowed from Salzberg, 1998

10 HMMs to the rescue! Transition probabilities Emission Probabilities

11 Insert (Loop) States

12 Scoring our simple HMM #1 - “T G C T - - A G G” vrs: #2 - “A C A C - - A T C” Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]): #1 = Member #2: Member HMM: #1 = Score of % #2 Score of 4.7% (Probability) #1 = Score of #2 Score of 6.7 (Log odds)

13 Standard Profile HMM Architecture Three types of states: Match Insert Delete One delete and one match per position in model One insert per transition in model Start and end “dummy” states Example borrowed from Cline, 1999

14 Match States Example borrowed from Cline, 1999

15 Insert States Example borrowed from Cline, 1999

16 Delete States Example borrowed from Cline, 1999

17 Aligning and Training HMMs Training from a Multiple Alignment Aligning a sequence to a model Can be used to create an alignment Can be used to score a sequence Can be used to interpret a sequence Training from unaligned sequences

18 Training from an existing alignment This process what we’ve been seeing up to this point. Start with a predetermined number of states in your HMM. For each position in the model, assign a column in the multiple alignment that is relatively conserved. Emission probabilities are set according to amino acid counts in columns. Transition probabilities are set according to how many sequences make use of a given delete or insert state.

19 Remember the simple example Chose six positions in model. Highlighted area was selected to be modeled by an insert due to variability. Can also do neat tricks for picking length of model, such as model pruning.

20 Aligning sequences to a model Now that we have a profile model, let’s use it! Try every possible path through the model that would produce the target sequence Keep the best one and its probability. Viterbi alg. has been around for a while Dynamic Programming based method Time: O(N*M)Space: O(N*M) (Assuming a constant # of transitions per state) N = Length of sequence, M = # of states in HMM

21 So… what do we do with an alignment to a model? Align a bunch of sequences to the model, and get a new multiple alignment. Align a single sequence to the model and get a numerical score stating how well it fits the model “Find me all sequences in the database that match this family profile X with a log odds score of at least Y” Align a single sequence to the model, and get a description of its columns “Columns 124 and 125 map to insert states of family Y, I wonder what that means?”

22 Training from unaligned sequences One method: Start with a model whose length matches the average length of the sequences and with random emission and transition probabilities. Align all the sequences to the model. Use the alignment to alter the emission and transition probabilities Repeat. Continue until the model stops changing By-product: It produced a multiple alignment

23 Training from unaligned continued Advantages: You take full advantage of the expressiveness of your HMM. You might not have a multiple alignment on hand. Disadvantages: HMM training methods are local optimizers, you may not get the best alignment or the best model unless you’re very careful. Can be alleviated by starting from a logical model instead of a random one.

24 For those of you keeping score… Lecture Objectives 1. The standard architecture for a profile HMM. 2. The three major uses in bioinformatics for a profile HMM. 3. The high-level concepts behind the algorithms to train and use a profile HMM. 4. The two different starting points for training an HMM. 5. How to avoid over-fitting a profile HMM. 6. The high-level ideas behind using profile HMMs to determine protein structure.

25 Preventing Over-fitting Prior Information (Dirichlet Mixtures) Combines prior information regarding amino acid frequencies at each step of training Prior distribution used at each step depends on: Number of examples seen in a given column Distribution of examples seen Sequence Weighting Some sequences are more frequent than others Your model should not reflect this Give the less frequent sequences more weight Get more data

26 Finally, Protein Structure Determination! Two good routes to take: A database of profile HMMs Make a profile of the target and search Or, do both

27 Database of Profiles Make an HMM for each protein in the database with known structure Collect its homologs from the database Build a model with the homologs Match your protein sequence against every model in the database Predict the structure of whichever model scores the highest

28 Profile of Target SAM-T98: Best method that made use of no direct structural information at CASP 3 (Current Assessment of Structure Prediction) Create a model of your target sequence Search a database of proteins using that model Whichever sequence scores highest, predict that structure

29 How do we build a model using only one sequence?

30 Profile HMM Effectiveness Overview Advantages: Very expressive profiling method Transparent method: You can view and interpret the model produced Very effective at detecting remote homologs Disadvantages: Slow – full search on a database of 400,000 sequences can take 15 hours Have to avoid over-fitting and locally optimal models

31 Score Board: 1. What is the standard architecture for a profile HMM? 2. What are the three major uses in bioinformatics for a profile HMM? 3. What are the high-level concepts behind the algorithms to train and use a profile HMM? Dynamic Programming and iterative alignments 1.Aligning a sequence 2.Scoring a sequence 3.Interpreting a sequence

32 Score Board: 4. What are the 2 different starting points for model training? Either from aligned or unaligned sequences 5. How do I avoid over-fitting a profile HMM? Prior values, sequence weighting, get more data 6. What are the high-level ideas behind using profile HMMs to determine protein structure? Use a profile of your target or many profiles of your sequences to match you target to a known structure

33 Any questions? ? More info available at: