Presentation on theme: "1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry"— Presentation transcript:
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry colinc@cs
2 Biological Motivation: Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein.
3 But wait, that’s hard! There’s physics, chemistry, secondary structure, tertiary structure and all sorts of other nasty stuff to deal with. Let’s rephrase the problem: Given a target amino acid sequence of unknown structure, we want to identify the structural family of the target sequence through identification of a homologous sequence of known structure.
4 It still sounds hard… In other words: We find a similar protein with a structure that we understand, and we see if it makes sense to fold our target into the same sort of shape. If not, we try again with the second most similar structure, and so on. What we’re doing is taking advantage of the wealth of knowledge that has been collected in protein and structure databases.
5 So, the next question is: How do we find a known protein that is similar to our target sequence? One method happens to be: Hidden Markov Models! (Profile Hidden Markov Models, to be precise)
6 Lecture Objectives Once I’m done, you should know: 1. The standard architecture for a profile HMM. 2. The three major uses in bioinformatics for a profile HMM. 3. The high-level concepts behind the algorithms to train and use a profile HMM. 4. The two different starting points for training an HMM. 5. How to avoid over-fitting a profile HMM. 6. The high-level ideas behind using profile HMMs to determine protein structure. Please feel free to interrupt at any point.
7 Outline This talk be broken into the following sections: Methods for Characterizing A Protein Family The Architecture of a Profile HMM Alignment and Training with Profile HMMs Preventing Over-fitting Determining Structure Conclusion
8 Methods for Characterizing a Protein Family Objective: Given a number of related sequences, encapsulate what they have in common in such a way that we can recognize other members of the family. Some standard methods for characterization: Multiple Alignments Regular Expressions Consensus Sequences Hidden Markov Models
9 A Characterization Example How could we characterize this (hypothetical) family of nucleotide sequences? Keep the Multiple Alignment Try a regular expression [AT] [CG] [AC] [ACTG]* A [TG] [GC] But what about? T G C T - - A G G vrs A C A C - - A T C Try a consensus sequence: A C A - - - A T C Depends on distance measure Example borrowed from Salzberg, 1998
10 HMMs to the rescue! Transition probabilities Emission Probabilities
12 Scoring our simple HMM #1 - “T G C T - - A G G” vrs: #2 - “A C A C - - A T C” Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]): #1 = Member #2: Member HMM: #1 = Score of 0.0023% #2 Score of 4.7% (Probability) #1 = Score of -0.97 #2 Score of 6.7 (Log odds)
13 Standard Profile HMM Architecture Three types of states: Match Insert Delete One delete and one match per position in model One insert per transition in model Start and end “dummy” states Example borrowed from Cline, 1999
14 Match States Example borrowed from Cline, 1999
15 Insert States Example borrowed from Cline, 1999
16 Delete States Example borrowed from Cline, 1999
17 Aligning and Training HMMs Training from a Multiple Alignment Aligning a sequence to a model Can be used to create an alignment Can be used to score a sequence Can be used to interpret a sequence Training from unaligned sequences
18 Training from an existing alignment This process what we’ve been seeing up to this point. Start with a predetermined number of states in your HMM. For each position in the model, assign a column in the multiple alignment that is relatively conserved. Emission probabilities are set according to amino acid counts in columns. Transition probabilities are set according to how many sequences make use of a given delete or insert state.
19 Remember the simple example Chose six positions in model. Highlighted area was selected to be modeled by an insert due to variability. Can also do neat tricks for picking length of model, such as model pruning.
20 Aligning sequences to a model Now that we have a profile model, let’s use it! Try every possible path through the model that would produce the target sequence Keep the best one and its probability. Viterbi alg. has been around for a while Dynamic Programming based method Time: O(N*M)Space: O(N*M) (Assuming a constant # of transitions per state) N = Length of sequence, M = # of states in HMM
21 So… what do we do with an alignment to a model? Align a bunch of sequences to the model, and get a new multiple alignment. Align a single sequence to the model and get a numerical score stating how well it fits the model “Find me all sequences in the database that match this family profile X with a log odds score of at least Y” Align a single sequence to the model, and get a description of its columns “Columns 124 and 125 map to insert states of family Y, I wonder what that means?”
22 Training from unaligned sequences One method: Start with a model whose length matches the average length of the sequences and with random emission and transition probabilities. Align all the sequences to the model. Use the alignment to alter the emission and transition probabilities Repeat. Continue until the model stops changing By-product: It produced a multiple alignment
23 Training from unaligned continued Advantages: You take full advantage of the expressiveness of your HMM. You might not have a multiple alignment on hand. Disadvantages: HMM training methods are local optimizers, you may not get the best alignment or the best model unless you’re very careful. Can be alleviated by starting from a logical model instead of a random one.
24 For those of you keeping score… Lecture Objectives 1. The standard architecture for a profile HMM. 2. The three major uses in bioinformatics for a profile HMM. 3. The high-level concepts behind the algorithms to train and use a profile HMM. 4. The two different starting points for training an HMM. 5. How to avoid over-fitting a profile HMM. 6. The high-level ideas behind using profile HMMs to determine protein structure.
25 Preventing Over-fitting Prior Information (Dirichlet Mixtures) Combines prior information regarding amino acid frequencies at each step of training Prior distribution used at each step depends on: Number of examples seen in a given column Distribution of examples seen Sequence Weighting Some sequences are more frequent than others Your model should not reflect this Give the less frequent sequences more weight Get more data
26 Finally, Protein Structure Determination! Two good routes to take: A database of profile HMMs Make a profile of the target and search Or, do both
27 Database of Profiles Make an HMM for each protein in the database with known structure Collect its homologs from the database Build a model with the homologs Match your protein sequence against every model in the database Predict the structure of whichever model scores the highest
28 Profile of Target SAM-T98: Best method that made use of no direct structural information at CASP 3 (Current Assessment of Structure Prediction) Create a model of your target sequence Search a database of proteins using that model Whichever sequence scores highest, predict that structure
29 How do we build a model using only one sequence?
30 Profile HMM Effectiveness Overview Advantages: Very expressive profiling method Transparent method: You can view and interpret the model produced Very effective at detecting remote homologs Disadvantages: Slow – full search on a database of 400,000 sequences can take 15 hours Have to avoid over-fitting and locally optimal models
31 Score Board: 1. What is the standard architecture for a profile HMM? 2. What are the three major uses in bioinformatics for a profile HMM? 3. What are the high-level concepts behind the algorithms to train and use a profile HMM? Dynamic Programming and iterative alignments 1.Aligning a sequence 2.Scoring a sequence 3.Interpreting a sequence
32 Score Board: 4. What are the 2 different starting points for model training? Either from aligned or unaligned sequences 5. How do I avoid over-fitting a profile HMM? Prior values, sequence weighting, get more data 6. What are the high-level ideas behind using profile HMMs to determine protein structure? Use a profile of your target or many profiles of your sequences to match you target to a known structure
33 Any questions? ? More info available at: www.cs.ualberta.ca/~colinc