Download presentation
Presentation is loading. Please wait.
1
Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY
2
ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTA TTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAG CAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGT AGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCA TGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATAT GCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCG GATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGT AGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTC AGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAG GCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCA AAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATC GGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGC GCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGT CAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACG TACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCA GTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGA TCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTA CGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACT GCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATG CCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATAC TGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACC ACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGT GGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCC CACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGG TCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAV ATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGT CAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGC ATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCA GTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACT GCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGA CTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATAC TGCCCAAAAAACGACTTA Bioinformatics = sequence analysis Biological sequences come in two types: DNA and protein DNA has a four-letter alphabet Protein has a 20-letter alphabet Sequences are an abstraction. As such, they are treated abstractly... Sequence alignment Phylogenetic trees Gene finding Data mining
3
"A free-standing reality" ATGCATCAGG ACTAGCTATCA GAATC Any DNA sequence REPRESENTS a physical object, and some DNA sequences translate to protein serquences, which also REPRESENT physical objects. behind the abstraction...
4
Sequence = Structure Structure = Function Function = Life __________________ Sequence = Life
5
The protein folding problem Unfolded Folded This happens spontaneously (in water). Sequence = Structure
6
The problem with the protein folding problem. Number of amino acids residues in a typical protein: 100 Approximate number of degrees of freedom per residue: 3 Estimated total number of conformations (=3 100 ): 10 45 Time required to fold if all conformations are sampled at the rate of 1 per 10 -15 s: 10 20 y Time since the Big Bang: ~13 x 10 9 y
7
pathways
8
folding pathways must exist The protein is unfolded......something happens first......then something else happens.
9
Early events eliminate alternative pathways
10
What happens first? Helix/coil transition10-100ns Beta-hairpin 0.1-1.0 s transient intermediates < 1ms equilibrium0.001-1.0 s
11
Local structure usually isn't stable Helices and turns form quickly but just as quickly fall apart. Most short peptides (<20aa) do not show structural stability in NMR studies. Exceptions: A few short peptides have been shown to be conformationally stable (for example Met-enkephalin = YGGFM)
12
Interesting parallels between bioinformatics and semantics languageproteins lettersamino acids wordsmotifs phrasesmodules sentenceswhole proteins meaningstructure literaturegenome grammarfolding??
13
ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTA TTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAG CAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGT AGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCA TGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATAT GCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCG GATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGT AGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTC AGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAG GCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCA AAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATC GGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGC GCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGT CAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACG TACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCA GTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGA TCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTA CGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACT GCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATG CCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATAC TGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACC ACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGT GGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCC CACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGG TCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAV ATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGT CAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGC ATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCA GTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACT GCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGA CTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATAC TGCCCAAAAAACGACTTA Does anyone know the words? What if we use the enormous database of protein sequences to find recurrent short patterns? Those short patterns would be the words. But, are they "meaningful words"? (Does the sequence correlate with the local structure?)
14
Maybe, protein folding pathways can be found in protein sequence "grammar" 1. Letters 2. Words 3. Phrases 4. Sentences
15
Amino acids can be grouped
16
Sequence alignments show evolutionary diversity
17
VIVAANRSA VIVSAARTA VIASAVRTA VIVDAGRSA VIASGVRTA VIVAAKRTA VIVSAVRTP VIVSAARTA VIVSAVRTP VIVDAGRTA VIVSGARTP VIVDFGRTP VIVSATRTP VIVGALRTP VIVSATRTP VIASAARTA VIVDAIRTP VIVAAYRTA VIVSAARTP VIVDAIRTP VIVSAVRTA VIVAAHRTA Sequence alignment Sequence profile P ij w k s kj aa i k seqs w k k Sequence profiles are condensed sequence alignments Red = high prob ratio (>3) Green = background prob ratio(~1) Blue = low prob ratio (< 1/3) (Gribskov)
18
l 1 |P ijl P ikl | i 1,20 “distance” between two points = each dot represents a different 1-residue profile Clustering profiles Resulting clusters: K Q R A S T A CS W Y F A P G D E N I L V M H Y did it! "Kmeans" clustering
19
Protein sequence grammar 1. Letters: amino acid profiles 2. Words 3. Phrases 4. Sentences
20
Protein sequence grammar 1. Letters: amino acid profiles 2. Words 3. Phrases 4. Sentences
21
l 1,L |P ijl P ikl | i 1,20 “distance” from i to k = each dot represents a different short profile ~120,000 segments Clustering profile segments, length L ~800 clusters for each L L=3,15
22
Learning the structure of each sequence cluster the database Search the database for the 400 nearest neighbors remove all cluster members that do not conform with the paradigm profile of cluster cluster of nearest neighbors After convergence, a cross- validation test is done.
23
I-sites library of sequence structure motifs 1000's of sequence clusters supervised learning Cross-validation 262 motifs Number of different motifs after removing register variants: 31
24
Example of a motif Sequences that match sequence profile.......tend to have the same structure......and this is it.
25
Clustering finds previously known sequence-structure motifs amphipathic -helix amphipathic -strand -helix N-cap pnppn nSEnpnn
26
Many new motifs are found diverging type-2 turn Serine hairpin Type-I hairpin Frayed helix Proline helix C-cap alpha-alpha corner glycine helix N-cap
28
Why are there motifs in proteins? Ancient conserved regions? Selection for stability? Folding initiation sites?
29
Structural features seem to drive clustering. 1. glycine at strained angles 3. negative design against alternative structures (helix) 2. conserved sidechain contacts
30
Number of Pattern sites / 100 positions Average boundariesof conserved Motifclustersoverall confid. > 0.60 mda°dmermsd (len) non-polar residues 1Amphipathic -helix133.10.9560.710.78 (15)1-4-8, 1-5-8 2Non-polar -helix60.90.12540.580.40 (11)1-4-8, 1-5-8 3Schellman cap Type 160.090.07811.011.02 (15)1-6-9-11 4Schellman cap Type 2100.30.14760.940.94 (15)1-6-8-9 5Proline -helix C cap101.80.6921.070.89 (13)1-2-5-8 6Frayed helix21.20.13750.960.69 (15)1-5-9-13 7Helix N capping box101.10.6990.950.65 (15)1-6-9-13 8Amphipathic -strand86.82.1890.870.87 (6)1-3, 1-3-5 9Hydrophobic -strand52.30.31010.910.91 (7)1-2-3 10 -bulge20.50.151000.970.78 (7)1-4-6 11Serine -hairpin41.30.3940.760.81 (9)1-8 12Type-I hairpin20.070.04800.941.23 (13)1-7-8 13Diverging Type-II turn40.30.14871.041.00 (9)1-7-9 I-sites sequence patterns are distinct (Bystroff & Baker, J. Mol. Biol, 1998)
31
A hypothesis: I-sites sequence motifs are folding initiation sites. The I-sites sequence patterns are mutually exclusive. Each I-sites motif is found in a variety of contexts. Local structure forms fast. Early-folding units 'initiate' folding. One reason this hypothesis may be wrong: Database statistics may reflect bias in the data.
32
Alpha helices may fold by packing interactions. Dots show positions of alpha-carbons relative to the amphipathic helix motif. The hydrophobic side is up. maybe not...
33
How do we test this hypothesis? See if I-sites peptides fold in isolation from the rest of the protein.... by NMR.... by simulation.
34
NMR structure of a 7-residue I-sites motif in isolation (Yi et al, J. Mol. Biol, 1998) diverging turn
35
Partial literature search of peptide NMR structures I-sites motifAuthorsdate glycine helix capViguera1995 serine hairpinBlanco1994 Type-I hairpindeAlba1996 diverging turnSieber1996
36
Molecular dynamics... is a cheap substitute for an NMR spectrometer. What is MD? A simulation of the dynamic behavior of the molecule in water, using "first principles." Advantages? You can observe the system directly. Disadvantages? It's not a real system, just an approximation.
37
Helical peptide simulations AAALDRMR AALEALLR AANRSHMP AARYKFIE ADFKAAVA AFDGETEI AKELVVVY AKGVETAD ARFTKRLG ATLEEKLN CNGGHWIA DAVTRYWP DEAIDAYI DELTRHIR DYVRSKIA EDLVERLK EELKQALR EEMVSKLK EKLLESLE EKPFGTSY EQIKAAVK FHMYFMLR FSVMNDAS FYSSYVYL GQLMALKQ HNLIEAFE IEHTLNEK IQNGDWTF KAAIAQLR KKYRPETD KNPDNVVG KPMGPLLV KQAHPDLK KQDKHYGY KSYLRSLR LDLHQTYL NAVWAAIK NETHSGRK NFLEVGEY NPVKESRH PAIISAAE PLQHHNLL PRDANTSH QDDARKLM QGIIDKLD QKMKTYFN QTLAQLSV RDFEERMN RIILDRHR RLLLKAYR RPIARMLS RVLGRDLF SCDVKFPI TEVMKRLV TLNEKRIL YASLRSLV YESHVGCR Sequences AMBER (parm94) force field. Randomly chosen natural sequences Initially extended. 800-900 waters added. Ions added (Na, Cl) 7-30 ns at 340°K
38
The MD scheme Select random peptides and predict how much helix they will have, using the I-sites motif pattern. Run LONG simulations. Test to see whether they have reached equilibrium. If they have, find out how much of the time the peptide spent in a helical state. (by cluster analysis) Does the fraction helix correlate with the prediction?
39
Cluster analysis of trajectories 1) Define a node for every step in the trajectory, keep the backbone angles (q). 2) For each node, draw an edge to every other node for which max(Dq) < 60°. 3) The node with the most edges defines the first cluster. Remove it and all its neighbors. Then the node with the most edges is the second cluster. Etc.
40
Clusters in conformational space Our criterea for good clustering: no two clusters look alike, and no cluster looks like two. RPIARMLS
41
This is what a trajectory looks like if it has reached equilibrium ns cluster number Both halfs of the trajectory have about the same distribution.
42
This is what it looks like if it has not. ns cluster number
43
NAIIQELE movie A rough energy landscape.
44
There is a correlation between I-sites sequence score and the simulations r=0.48 (all peptides) r=0.61 (trajectories > 20ns long)
45
Sampling of sequence space 72 peptides were simulated. Is this a representative sample of the space of amphipathic helix sequences? I-sites motif 72 peptides, weighted by %helix 72 peptides, unweighted
46
What this means? The MD experiment separates the local effects from the non- local effects on helix formation. In the simulation, there are only local interations. So the propensity for amphipathic sequences to form helix is mostly intrinsic.
47
Outliers Simulation too short. We see only meta-stable states. I-sites scoring method is missing something. Using additive probabilities ignores statistical dependence between different positions. Part-helix was not counted as helix in this study. Helix caps are competing motifs. (+-) and (-+) look just like (++) and (--)
48
QVFMRIME (a helix in 1dldA) Predicted to be helix with confidence = 0.86 Zero helix found in 17ns trajectory. What does it fold into? an outlier
49
Protein sequence grammar 1. Letters: Amino acid profiles 2. Words: I-sites motifs 3. Phrases: 4. Sentences
50
Protein sequence grammar 1. Letters: Amino acid profiles 2. Words: I-sites motifs 3. Phrases: a hidden Markov model 4. Sentences
51
Motif “grammar”? Arrangement of I-sites motifs in proteins is highly non-random helix cap beta strand beta turn The dependencies can be modeled as a Markov chain
52
the mailman dog bit kicked back The dog bit the mailman. The mailman kicked the dog back. Markov model Sequence data Stochastic output The dog back. The mailman kicked the mailman kicked the dog bit the dog bit the dog bit the mailman kicked the dog.... How to make a Markov chain
53
A "hidden" Markov model What's "hidden" about it? An HMM is a Markov chain where the meaning of the Markov state is probabilistic.
54
the mailman 0.5 postman 0.5 dog bit 0.3 attacked 0.7 kicked 0.6 hit 0.4 back The dog bit the mailman. The mailman kicked the dog back. The dog attacked the postman. The postman hit the dog. hidden Markov model Sequence alignment data Stochastic output The dog back. The mailman kicked the postman kicked the dog bit the dog bit the dog attacked the mailman kicked the dog.... How to make a hidden Markov chain
55
One Markov state from HMMSTR a hi a ij a ik regions sequence profile One state emits one letter of each type (b,r,d,c) probabilitic meaning of the state amino acid symbols structure symbols b i = {ACDEF...} r i = {HGEBdblLex} d i = {HST} c i = {mnhd...} { previous letter(s) next letter(s)
56
Constructing a HMM by aligning motifs
57
Merging many motifs into one HMM
58
HMMSTR Hidden Markov Model for local protein STRucture 282 nodes 317 transitions Unified model for 31 distinct sequence- structure motifs (Bystroff & Baker, J. Mol. Biol., 2000)
59
Variations on a motif theme are modeled as parallel paths Multiple state-pathways for the helix N-cap motif
60
Common sub-graphs represent common sub-structures These peptide segments have the same state sequence (except shaded residues)
61
How an HMM works initiation probability transition probability emission probability We have S (the sequence). We want Q (the 1D structure), and P (how well S fits Q)
62
3-state secondary structure prediction 74.9% correct 74.6% correct
63
Predicting super-secondary context Results are for the independent test set.
64
Fully-automated tertiary structure prediction (1) Find homologues in the database (Psi-Blast) (2) Predict local structure (HMMSTR) (3) Assemble fragments (ROSETTA, D.Baker) sequence structure Protocol used for CAFASP2 experiment (2000)
65
Rosetta ab initio Scoring function: Bayesian classification of pairwise secondary structure contact types. Search function: Monte Carlo fragment insertion. A move consists of selecting a fragment at random from a set of local structure predictions. Coordinates are re-generated after swapping in the new fragment. (Simons et al, PNAS, 1997)
66
CASP3 Prediction results for Target 56 : DNA helicase Predicted structure of 66- residue fragment (23-88) True structure of same fragment
67
CAFASP Prediction results for Target 122: 1GEQ Tryptophan Synthase Predicted 97-residue fragment True structure of same fragment
68
Protein sequence grammer 1. Alphabet: amino acid profile 2. Words: I-sites motifs 3. Phrases: HMMSTR pathways 4. Sentences: contact maps the next step...
69
In progress: Data mining of contact maps HMMSTR predictions Protein sequences + contact maps Association-rule mining (M. Zaki) Rules for tertiary contacts
70
Predicting tertiary contacts Contact predictions for 2igd overall : 20% coverage w/20% accuracy Can the 2D map be translated to 3D?
71
I-sites/HMMSTR collaborators David BakerU. Washington Karen HanUCSF Vestienn ThorssonU.Washington Qian YiU. Washington Edward ThayerZymogenetics Shekhar GardeRPI Mohammed ZakiRPI Susan BaxterWadsworth (->Novartis) Chip LawrenceWadsworth/RPI Bobbie Jo WebbWadsworth Kim SimonsU. Washington (->Harvard) Bystroff Lab Yu Shao Xin Yuan Jerry Huang isites.bio.rpi.edu
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.