Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein folding Process of folding Modeling the process of folding Evolution vs. folding Impact of function on protein evolution.

Similar presentations


Presentation on theme: "Protein folding Process of folding Modeling the process of folding Evolution vs. folding Impact of function on protein evolution."— Presentation transcript:

1 Protein folding Process of folding Modeling the process of folding Evolution vs. folding Impact of function on protein evolution

2 Process Local Interactions Secondary Structure Elements (SSE) Assembly of SSE Equilibrium Structure

3 Protein folding

4 Protein folding Important thing to note It is possible that residues that are not doing anything in the folded protein were actually critically important to get the peptide folded in the first place.

5 Protein folding Simulation studies are demonstrating that the most common protein folds are those who can withstand the most sequence variation over time without affecting their topologies. The prion protein is a posterchild example of the opposite.

6 Protein Evolution Evolutionary meaning Most common folds are those able to withstand point mutations the best. These are known as designable folds.

7 Protein folding Marginal stability The most stable folds are not necessarily these with the lowest energy. But these that maximally penalize switching to an alternative conformation.

8 Protein Evolution Marginal stability Evolutionary implication(s) There is thus selective pressure on residues in protein not only to maintain important interaction, but also to make sure that some interaction NEVER happen.

9 Summary Proteins fold into energetically stable conformations. For one chain, there are a large number of possible conformations, however. The biological conformation is selected during folding: not necessarily the “best” conformation.

10 Role of biology on structures A few examples using mapping of rate of evolution. The fitness of a protein is ultimately its biological function, not its structure. We’ll have a look at their structural requirements.

11 Structural Biology

12 Outline How genetics encode structure. What make a protein fold. Role of biological function on preserving a fold. Comparing two structures for similarities.

13 Genetic information and proteins 3D information is encoded into (1D) sequences. STKKKPLTQEQLEDARRLKA IYEKKKNELGLSQESVADKM GMGQSGVGALFNGINALNAY NAALLAKILKVSVEEFSPSI AREIYEMYEA Protein structure of CRO repressor in phage Lambda, PDB: 1LMB ?

14 Genetic information and proteins The encoding can only be indirect Because there is nothing in the DNA that tells each amino acid where to go.

15 Genetic information and proteins However, There is a few types of physical interactions that are dominating the process of protein folding.

16 Amino-acids Components Main Chain Side Chains Responsible for the “name”. Can be clustered based on: - chemical properties - Structure This ultimately determine the evolutionary interchangeability.

17 Protein folding Van der Waal forces The electron clouds around the nuclei are more stable if they can lightly interact with other electron clouds. Makes atoms sticky relative to each other.

18 Protein folding Electrostatic forces Long range interactions. Pull/Push over longer distances.

19 Protein folding Hydrogen bonds Electrostatic. Short range, not flexible Can be seen as the velcro holding proteins together.

20 Protein folding Hydrophobic interactions Water molecules in liquid pack as to minimize their energies This implies that water molecules are more than often are doing H-bond with their neighbors.

21 Protein folding Hydrophobic interactions If you introduce a droplet of oil in solution, many hydrogen bonds will have to be broken at the interface, at an energy cost. This is why hydrophobic and hydrophilic groups look like they are avoiding each other.

22 Protein folding During folding, The polypeptide has to follow a strict sequence of event in order to find the correct conformation in a timely fashion.

23 Protein folding Secondary Structures Stable because of local h-bonds. Makes larger block with fewer freedom of movement

24 Protein folding Geometry plays a very important role. Because there are only a few angles that can change along the backbone, there is a limited number of ways a protein can fold onto itself.

25 Protein structures are organized in a Hierarchical fashion Secondary structures - Geometry Dihedral Angle Because most main chain atoms are constrained in a “amide bond”, the entire trajectory of the chain can be defined by the pair of angles (for each AA): This can be represented with a “Ramachandran Plot”. From which it is obvious that there are some kind of clustering going-on.   ll   ll

26 Protein structures are organized in a Hierarchical fashion Secondary structures – The alpha helix The Hydrogen Bond Again, a helix is an ideal setup to place our “velcro” H-bond always at the right place. Periodicity To the delight of statisticians and computer scientists.

27 Protein structures are organized in a Hierarchical fashion Secondary structures – The beta strand (beta sheets) Another periodical pattern ( ) Responsible for super-structure rigidity and some truly amazing patterns.

28 Protein structures are organized in a Hierarchical fashion Secondary structures – The myth of “random” coil. Random structures in protein are extremely rare. Many uses the expression anyway to refer to the “rest” of the protein. Other minor secondary structures Turns, loops, bridges. Although these don’t have the critical periodicity found in  and  structures.

29 Protein structures are organized in a Hierarchical fashion Tertiary structures – The reason why to care about 2 nd structures. Secondary structures are building blocks Detecting and predicting secondary structures is a key process in structural biology. Other uses Visualization, classification…

30 Protein Diversity The current release of PDB contains 28,000 structure entries. 26,000 are proteins There is an estimated possible unique protein folds.

31 PDB Overview Repository of structures Proteins, Nucleotides, complexes, mutants Quality improve over time Data validation tools are getting better. More redundant structure are available for cross-reference.

32 Small number of folds Does this means that all proteins are coming from a small set of ancestor molecule? Perhaps, but not necessarily.

33 Protein folding Process of folding Modeling the process of folding Evolution vs. folding Impact of function on protein evolution

34 Process Local Interactions Secondary Structure Elements (SSE) Assembly of SSE Equilibrium Structure

35 Protein folding

36 Protein folding Important thing to note It is possible that residues that are not doing anything in the folded protein were actually critically important to get the peptide folded in the first place.

37 Protein folding Simulation studies are demonstrating that the most common protein folds are those who can withstand the most sequence variation over time without affecting their topologies. The prion protein is a posterchild example of the opposite.

38 Protein Evolution Evolutionary meaning Most common folds are those able to withstand point mutations the best. These are known as designable folds.

39 Protein folding Marginal stability The most stable folds are not necessarily these with the lowest energy. But these that maximally penalize switching to an alternative conformation.

40 Protein Evolution Marginal stability Evolutionary implication(s) There is thus selective pressure on residues in protein not only to maintain important interaction, but also to make sure that some interaction NEVER happen.

41 Summary Proteins fold into energetically stable conformations. For one chain, there are a large number of possible conformations, however. The biological conformation is selected during folding: not necessarily the “best” conformation.

42 Role of biology on structures A few examples using mapping of rate of evolution. The fitness of a protein is ultimately its biological function, not its structure. We’ll have a look at their structural requirements.

43 Fast Slow Maximum-Likelihood Site-Rates are Biologically Relevant Rhodopsin-like G-protein receptors Pfam (dataset 1Tml_7) 69 taxa

44 Maximum-Likelihood Site-Rates are Biologically Relevant Tubulin  34 taxa  33 taxa The constraints imposed by co- evolution far outweigh the structural constraints. Fast Slow

45 Phylogenetic mapping of structures Predicting rates of evolution This experiment was conducted to see if we could predict the rate of evolution in the enzyme Enolase.

46 Phylogenetic mapping of structures Predicting rates of evolution The most important factor to predict evolutionary constraints was the presence of the active site. Evolutionarily constrained by the active site.

47 Summary Structures are rigid templates to provide some biological function. It takes a lot of structure to position a few atoms in an enzyme.

48 Structural Homology Because 1 structure is made of thousands of coherent interactions: The probability to see a new structure emerge from a random sequence is close to 0. Therefore: similar structures are likely to be homologous.

49 Use of structural similarity in evolutionary studies Homology can be detected via sequence identity Structures are drifting at a much smaller rate. In fact, are they drifting at all? Structural similarity can be used to detect homology, although there are evidences that convergence is much more common in structure than sequence.

50 Structural Convergence There are so many different ways to fold a dozen of secondary structure elements. Some fold are much more probable to evolve because they are more robust to mutations. Designability

51 Protein Similarity VAST Align secondary structure only. Consider the geometric transformation that brings as many helices and strands together. CE Break down each structures in peptide of 8 residues. Find the best match against a reference protein. The final alignment is the transformation that allow to align as many continuous residues as possible.

52 Comparing and aligning structures Expanding into detection methods What about for remote, yet significant similarities. Example on the right There is a significant similarity between a single domain in two distinct proteins (yellow and orange). Are they homologous?

53 Comparing and aligning structures Difficulties in aligning structures. In some cases, the order of the elements that superimpose have been shuffled by circular permutation. There are many cases of structurally similar proteins with no more than a random degree of identity at the sequence level.

54 Comparing and aligning structures VAST (Vector alignment sequence tool) Probably them most used service for protein alignment since it is running off the NCBI web site and has already been run on every available structures. 1 – Given two proteins A and B. 2 – Given that each structure has a collection of secondary structure element (SSE).

55 Comparing and aligning structures VAST (Vector alignment sequence tool) 3 – Find the rotation, translation to apply to each helices/strands to in A to align with each elements in B. These transformations can be summarized by a matrix 

56 Comparing and aligning structures VAST (Vector alignment sequence tool) 4 – If two structure are identical, each helix/strand will be part of a pair with a common  would just be the transformation to align the whole proteins. In remotely similar structure, not all helices/strands will have a match. The best set of rotation/translation will be the one that is shared by the largest number of secondary structures pairs.

57 Comparing and aligning structures VAST (Vector alignment sequence tool) 4 – Sharing has to be defined a bit more formally (where alpha would some kind of tolerance cutoff to determine if two transformations are identical):

58 Comparing and aligning structures VAST (Vector alignment sequence tool) 4a – Every time we have a “match”, we draw a link between two  The result would be a so-called graph with connection only between similar set of rotation/translation. ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii

59 Comparing and aligning structures VAST (Vector alignment sequence tool) 5 – Once the problem is abstracted into a ‘graph’, it is possible to use the computational bag-of-tricks to figure out which set of connected matrices forms the largest group. The average rotation/translation in this group would best superimpose protein A and B. ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii

60 Comparing and aligning structures VAST (Vector alignment sequence tool) 7 – The alignment is performed irrespective of the sequence order of the structural elements. This is good because it can catch circularly permuted proteins. But it also enhances the chances to find match by accident.

61 Comparing and aligning structures VAST (Vector alignment sequence tool) 8 – Statistical vallidation. This is a very important step since there is only a limited number of ways a small number of SSE will interact. Thus, sampling in a large database of random structure would still return a distribution of “hits”. This is second hand information: The p-value is the probability to observe a similar score by chance multiplied by the number of possible alternative substructures within the comparison. The default cutoff  = Which should be regarded as a noise reduction cutoff, not a bulletproof jacket.

62 Comparing and aligning structures CE (Combinatorial extension) CE doesn’t uses secondary structure elements as basic aligning unit. Instead, it seeks the optimal path amongst all possible n-mers between two query proteins. 1 – Given two proteins A and B of length n A and n B. CE will search for the longest continuous path P of aligned fragment peptides (AFP) of length m.

63 Comparing and aligning structures CE (Combinatorial extension) 4 – Some distance metric has to be made up to score AFP alignment Each residue is counted once. Each residue is counted against all. Using RMSD

64 Comparing and aligning structures CE (Combinatorial extension) 4 – Pathfinding There is a substantial decrease in the size of the search space by restricting the value of G 1 – Select all possible next AFP under a certain (self) threshold. 2 – Consider the path to chose the best next AFP. 3 – Choose whether to pursue the extension or not.

65 Comparing and aligning structures CE (Combinatorial extension) 4 – Statistics uses a z-score which compares path of similar length and score to a random sampling from a reference database. z-score of 3.5 -> p-value of 10E-3 So, given about 2000 different protein folds, such threshold would imply two fortuitous hits. Visual inspection must be done as well as a more restrictive threshold should be used.

66 Comparing and aligning structures CE (Combinatorial extension) Structural similarity between Acetylcholinesterase and Calmodulin found using CE (Tsigelny et al, Prot Sci, 2000, 9:180)

67 SCOP database Seen as the golden standard for protein structure classification Query for structures given a protein sequence Browse protein architecture organized in a hierarchical fashion. Keyword search for structures. Fold  Common topology for secondary structure  Superfamilies  probable common evolutionary origins, low sequence ID  Families (common evolutionary origins) domains individual domains

68 CATH database Involves manual inspection and classification, especially at more abstract levels such as the architecture-level. CLASS  secondary structure composition  Architecture  what would be know as fold in SCOP) Topology (What would be known as superfam.) Sequence-level

69 Summary Aligning protein structure can detect homologous relationship that are deeper that sequence alignment because structures are more stable over time. VAST abstracts proteins into SSE, or secondary structure elements and find the set of rotation/translation that maximize the number of paired SSE. CE looks for the best alignment frame to superimpose a protein into another. Statistics are important because it is likely that small unrelated structures will resemble each other.

70 Summary A distribution of random protein scores can be generated by aligning unrelated proteins in the databases. An alignment score must be significantly larger than score expected in this distribution. This type of analysis is used to classify protein folds and infer relationship between structural evolution and biological activity. Try to find structural neighbors of the protein 1AZT while browsing the NCBI website ( ).

71 Molecular Modeling Lecture 4

72 Why modeling proteins Example of applications Modeling the binding site of the anticodon on eRF3 Modeling substrate binding in the active site of Mandelate racemase. Solving X-ray and NMR structures. The theory behind the calculation Parametrizing protein models Molecular mechanics as an optimization problem Molecular mechanics as a time simulation Conceptual clash between protein folding and molecular mechanics.

73 Why modeling proteins Anticodon binding site on eRF3 2 possibilities. From phylogenetic information, a few residues were identified as players. Use molecular mechanics to “see” whether the surface of the protein ca accommodate an anti-codon.

74 Why modeling proteins Modeling a weird substrate into an active site. Mandelate racemase can bind a substrate with two rings! Is there room for this in the wild type active site? The answer is yes, although a bit counter-intuitive.

75 How do structures are viewed Pre-computer days Sir John Kendrew and his model of insulin, 1958

76 How do structures are resolved X-ray diffractionPrinciple Create a lattice of protein into a crystal. Collect thousands of diffraction pattern in all degree of freedom rotational space. Substract the phase between the layers in the lattice. Compile into a 3D volume based on density of reflective material (electrons in this case). Thread model into density map, optimize the geometry using the density map as an additional criterion.

77 How do structures are resolved NMR spectroscopy Principle Use magnetic fields and “radio” frequency photons to detect shifts in nuclear states. Assign shifts to a model along the chain. Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances). Optimize the trajectory of the modeled chain, given this list of constraints.

78 How do structures are resolved NMR spectroscopy Principle Use magnetic fields and “radio” frequency photons to detect shifts in nuclear states. Assign shifts to a model along the chain. Correlate the mutual effect amongst elements on each other to come up with a list of constraints (typically distances). Optimize the trajectory of the modeled chain, given this list of constraints.

79 Physical simulation in Molecular Modeling Jensen, F., 1999, Introduction to computational chemistry, Whiley, Chichester, UK Why is it useful to you? Modeling is used often by experimental biochemists and is a staple in structural biology. The complexity of the simulation is far beyond the complexity of the interface. This necessarily convey a false sense that the “defaults” settings will do fine.

80 Physical simulation in Molecular Modeling Limitations True atoms and bonds are probabilistic constructs. The computation of the resulting geometries is a very involved process for which the analytical equations are not fully worked out. Luckily, the observable behavior is much more predictable and thus can be modeled under a limiting set of assumptions.

81 Physical simulation in Molecular Modeling Assumptions Newtonian physics is used to simulate molecules under a set of restrictions which for proteins would be: 1.In solution ( or vacuum ). 2.Near room temperature. 3.Chemically inert.

82 Physical simulation in Molecular Modeling Abstraction Each atoms has a fix geometry constrained by a somewhat arbitrary energy scoring scheme. The problem thus boils down to find the best set of coordinates for all atoms to minimize the energy. There are no absolute correspondence between this scoring scheme and experimentally measurable energy values.

83 Molecular Modeling in Bioinformatics Modeling Although there is only a small subset of all possible atoms that end-up in biological molecules. Each atoms has a set of different states in which they exist. These states are referred to as types in molecular modeling.

84 Molecular Modeling in Bioinformatics Energy function The energy function is used to evaluate and calculate the derivatives use to optimize a structure.

85 Molecular Modeling in Bioinformatics Computational efficiency and limitations of the model The energy function is used to evaluate and calculate the derivatives use to optimize a structure.

86 Molecular Modeling in Bioinformatics Parameterization nightmare Can someone come up with all these numbers? Generalization How robust is the simulation in a range of conditions. Computational cost The longer it takes to perform a single task, the fewer iterations will be computed in the same amount of time.

87 Molecular Modeling in Bioinformatics Parameterization nightmare Can someone come up with all these numbers? For MM2 forcefield (71 atom types): TermParams(est.)Determined E(VdW)142 E(str) E(bnd) E(tors) E(cross) ?

88 Molecular Modeling in Bioinformatics Generalization How robust is the simulation in a range on conditions. In the example to the left, the EXP.-6 model causes nuclear fusion at unrealistic distances. Such unrealistic distance will be found in Monte-Carlo, Genetics Algorithms and Simulated annealing experiments.

89 Molecular Modeling in Bioinformatics Lennard-Jones Is actually a computational stunt so there is no need to compute R but rather use R n where n is an even factor.

90 Molecular Modeling in Bioinformatics Lennard-Jones In practice, Lennard-Jones is optimized to reproduce validated results (and works out satisfactorily).

91 Molecular Modeling in Bioinformatics Electrostatic Models … are real ugly. Why does this matter? Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions. Examples Coulomb’s Law

92 Molecular Modeling in Bioinformatics Electrostatic Models … are real ugly. Why does this matter? Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions. Examples Dipolar moment interactions

93 Molecular Modeling in Bioinformatics Computational cost of non-bonded energy (VdW, El) ~99.88% of computation in protein-sized models. Most of this is very small and does not contribute to the total energy significantly. Computational tricks Cutoff -> blending function -> neighbor list* *must be updated O(N 2 ) Validation 1.Reproduces Geometries 2.Reproduces Relative energies.

94 E is not G Real energies are temperature-dependant. Entropic contribution cannot be calculated from a snapshot.

95 Principle of optimization You start with a protein for which you know all coordinates. Evaluate the energy Find a better structure, usually with small changes Repeat until no better structure can be found. This task is usually NEVER straightfoward, unless the system would be made of a small number of atoms.

96 Molecular Modeling in Bioinformatics Optimization (local minima) Straightforward, although computationally expensive. 1 – A clear equation. 2 – A defined set of variables. 3 – “only” three dimension to worry about Steepest Descent (Robust, fast) Conjugate Gradient ( Improved convergence properties ) Newton-Raphson (Saddle points) Pseudo-NR (progressive Hessian estimate)

97 Molecular Modeling in Bioinformatics Optimization (Global minimum) In a simple, circular, system with 17 main-chain atoms. There are 262 distinct conformations within 3 kcal/mol from the global minimum (out of ~1.6E13 conformers). The size of proteins is 1-2 order of magnitude larger. Stochastic Methods (Monte-Carlo) Molecular Dynamics Simulated Annealing Genetic Algorithms Statistical Mechanics

98 Molecular Modeling in Bioinformatics Time dependent methods (Molecular Dynamics) Make use of classical mechanics equations such as: Verlet Algorithm Numerical solution to Newton’s equations

99 Molecular Modeling in Bioinformatics Verlet Algorithm Numerical solution to Newton’s equations Problems with this methods No explicit use of speed (which is needed to calculate the total energy):

100 Molecular Modeling in Bioinformatics Leapfrog Algorithm Numerical solution to Newton’s equations Timestep Reasonable: Femtoseconds Scope of simulation (ideal):Millisecond (practical):Microsecond 10 -6

101 Molecular Modeling in Bioinformatics Simulated Annealing Robustness vs. initial solution Variable contribution of the objective function. Broader Sampling. Both help to explore around a minimum. potential Blending function Kinetic Net Movement

102 Protein folding from Scratch Must be restrained to a limited scope Two genes: TC5b and TC3b Both have references structure for validation. Sequences NLYIQWLKDGGPSSGRPPPS (TC5b; 304 atoms) NLFIEWLKNGGPSSGAPPPS (TC3b; 289 atoms) Software: AMBER 6.0 Model: AMBER Solvation: Generalize Born/solvent-accessible surface area This means that the water molecules are not explicitly defined in the simulation and the effect of the solvent is treated as a macro property.

103 Protein folding from Scratch Must be restrained to a limited scope Understanding folding and design: Replica-exchange simulation of “Trp-Cage” miniproteins. Pitera, JW., Swope, W Proc. Natl. Acad. Sci. USA, 100:

104 Protein folding from Scratch Algorithm Initialization Input: A protein sequence Output: A starting structure for the main simulation. 1: Thread each character from the input sequence to a 3D corresponding model (extended). 2: Minimize with 5000 steps of steepest descent 3: for i = 1 to do Simulate with Molecular Dynamic if !(i%1000) then Readjust the temperature 298K. 4: Return equilibrated model. Required to prevent strong “jerking” motion in the first iteration of a simulation

105 Protein folding from Scratch Algorithm Simulation (simulated annealing variant) Input: P, An equilibrated protein model Output: A collection of coordinate snapshots (trajectory) for analysis. 1: T = a list of 23 simulation temperatures from 250K to 603K. 2: E = {}, an empty list of experiments 3: for i = 1 to |T| do 4: Pi = Copy P 5: Set the temperature of Pi to Ti 6: Add Pi to E 7: for i = 1 to 4,000,000 (4 ns) do 8: Simulate using MD |in parallel| 9:if i % 250 == 0 then take a Snapshot of coordinates. 10:if i % 5000 == 0 then 11: Swap temperature between process (Metropolis-style probabilities) 12:Adjust each E to their new simulation temperature 13: Discard all but the snapshot taken in the last nanosecond of simulation. 14: Pool all 23 experiments for analysis.

106 Protein folding from Scratch Computational cost Ridiculously small protein, no initial good guess. 19 days on MHz IBM POWER3 SP2 processor (R6000 series) Which, on the campus here, approaches the mean time between power outage!

107 Protein folding from Scratch Validation The root mean square deviation RMSD Which is a suitable distance metric for related structures.

108 Comparing and aligning structures Why? There is a need for a distance metric to compare similar protein structure. Simulation analysis. Similarity quantification. Pattern detection. RMSD Works well for closely similar structure.

109 Comparing and aligning structures RMSD Works well for closely similar structure. Absolutely require some kind of pair wise equivalence between the two compared entities,

110 Comparing and aligning structures RMSD Sequence identity falls quickly. Hard to separate weak hits from purely random proteins.

111 Protein folding from Scratch Validation The root mean square deviation RMSD ≤2.0 RMSD from any of 38 experimental structures ≤2.0 RMSD from the average low temperature structure.

112 Protein folding from Scratch Impact Impact of this paper Make good use of parallelism to conduct a heuristic search. Sampling-based method. Promising because in many cases the folding of a large protein can be approximated to the folding of its components. (Remember, domains are independent units in most cases)

113 Building a large machine for molecular modeling IBM Blue Gene project Architecture 64K FPU 20K FPU (protein folding) FPU 700 MHz (low cost, low heat) 64 compute nodes (256 MB) per I/O nodes (512MB) MPI library 3D torus network for fast neighbor to neighbor communication.

114 High Performance achievement in MD NAMD Open source University of Illinois, Dept. of theoretical physics Benchmark system (their big one)

115 High Performance achievement in MD NAMD Open source There is no need to use this system to study protein folding. Instead, MD were used in this case to study the conversion of torque into energy that can be stored in molecular batteries: the ATP molecule.

116 Overview Protein folding and parallel computing. Homology modeling and statistical mechanics. Secondary structure prediction and artificial intelligence.

117 Spectrum of strategies Physics Knowledge Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling

118 Parallel computing and Molecular dynamics Folding protein from an extended conformation is a difficult problem because of the crossing of energy barriers. The following slides describe how crossing barrier can be achieved using a technique called parallelization.

119 Parallel computing It takes 1500 days to complete a thesis for one student If the student is helped by someone, the work may go 2X as fast: 750 days. What if 1500 students are working on the same thesis? Overhead Communication Load balancing

120 Parallel computing Factors that complicate parallelization: Some work have to be executed in a sequence Communicating the task and the results becomes an increasingly important time step as the task become small. Each individual process have to wait for the slowest one to finish, leading to a loss of efficiency.

121 Time scale in protein folding In the order of micro to milliseconds This is not achievable by modern computers. ~ days for 1 experiment (~28 years) Hundreds of million computer idle at any time Why not use their unspent cycles. Create a “screen saver”

122 Crossing energy barrier Most of the time is spent waiting for the thermal motion to topple a structure over a barrier. Principle of Ensemble dynamics M CPU should take M X less time to go over a barrier. K = 1/10,000 ns, M = 10,000, t = 30 ns f(t) ~ 30 folding events

123 Ensemble Dynamics Start M dynamic calculations with the same initial structure. Once 1 thread finds a barrier and go over it, copy the state of this thread into all other M replicate processes. The communication overhead is negligible if the crossings are rare events, which is true in this case.

124 Ensemble Dynamics Detecting a barrier Will be noticeable by a large variance in energy over the duration of the simulation.

125 Ensemble Dynamics Calculation details We simulated folding and unfolding at 300K at pH 7.0, using OPLS parameters set to Generalized Borne implicit solvent model. Time step 2 fs Long range interaction truncated with a 16A cutoff.

126 What are they doing with this technique?

127 A more complex system Note how most of the interactions in the partially folded protein are non- native. This means that in order to resume folding, these must be broken. The Villin headpiece is one of the fastest (known) folding peptide !! What about simulating anything else?

128 Energy Landscape It is clear in this figure that there are: 1.one folding pathway 2.One intermediate 3.Two energy barriers

129

130 Statistical Mechanics Practical definition for our purpose: Statistical mechanics can be used to create predictive models in absence of theoretical models. For example: interaction between amino-acids.

131 Statistical mechanics Atom-level simulation are expansive, and empirical. Statistical mechanics bridges frequencies of observations with physical forces for chemical systems. The resulting model is thus used to assess the “energy” of a trial conformation and can be used as an objective function to optimize a solution. This technique is increasingly used in bioinformatics since the information in the database can be seen as the collection of observation at equilibrium.

132 Statistical mechanics In other words, if it can’t be seen in the database, the energy state of an observation must be high. If its common, the energy must be low. Remember, everything is possible, the probability of an observation is related to its relative energy.

133 How does this ties in to bioinformatics? There is a direct relationship between energy of a state in a system in equilibrium and the probability to observe this state.

134 What are “states” in protein structures? There are a lot of freedom in defining states for protein structures. Here is one example: Sequence  Contacts In this plot, if two positions of the 1D sequence are in physical contact, it is marked as an orange pixel. It is thus possible to harvest from a collection of structures a matrix of observed contacts.

135 What are states in protein structures? There are a lot of freedom in defining states for protein structures. Here is one example: Sequence  Contacts In this case the energy for any given pair would be:

136 What are states in protein structures? There are a lot of freedom in defining states for protein structures. Here is one example: Sequence  In order for this value to be valid; there is an assumption of equilibrium. Equilibrium: The sampling would not change significantly over time.

137 What are states in protein structures? There are a lot of freedom in defining states for protein structures. Here is one example: Sequence  Pitfall In order to be accurate for rare observation, the total number of observation should be infinitely large and derived from sequences-structures in equilibrium. Practically, there should be enough instances of the rarest entry to avoid large errors on the estimate (log(0)).

138 What are states in protein structures? There are a lot of freedom in defining states for protein structures. Here is one example: Sequence  Miyazawa-Jernigan Matrix Such matrix has been generated Miyazawa, S.,Protein Eng Apr;6(3): This is particularly useful for threading sequences in known structures for structure prediction purpose.

139 What are states in protein structures? The implementation of a distance-based energy term is trickier… but boils down to the same thing. Knowledge-based force- field Need to store in 4D matrices the tuple { (a,b), r, k } R  distance in Euclidian space K  distance in sequence space x1 x2 x3 x4 x5 x6 x7 x8 x9 r k = 6

140 What are states in protein structures? The energy will be calculated with respect to all parameters considered. Knowledge-based force- field x1 x2 x3 x4 x5 x6 x7 x8 x9 r k = 6

141 What are states in protein structures? There are some implementation for this technique, such as PROSAII Knowledge-based force- field x1 x2 x3 x4 x5 x6 x7 x8 x9 r k = 6

142 What are “states” in protein structures? The exposure of each site to the exterior is an important factor. This is often quantified as Solvent Accessible Area (ASA) Knowledge-based force- field Need to store in 2D tuple { a, ASA }

143 What are states in protein structures? Ultimately, the energy of seeing a given sequence adopt a given structure can be computed as follow: Knowledge-based force-field Caveats The finer is the parameterization, the larger must be the reference collection of (appropriate) structures in the database in order to observe many times all possibilities. Design-level decision as to the choice of the minimum set of terms to fully define a structure.

144 An example Real life example of using Knowledge- based methods. This enzyme is called Enolase. It is a key enzyme in the sugar breakdown metabolism. If there are important terms that are forgotten, the energy values may be inadequate.

145 An example Real life example of using Knowledge- based methods. The function and the composition are very tightly related. Red  negatively charged Blue  positively charged Tan  Hydrophobic These are the active site residues.

146 An example Real life example of using Knowledge- based methods. The critical region in this protein has radically different properties than expected in an average protein. The knowledge-based system does not account for these properties and thus, the position shown in white were poorly estimated. The way this assessment was done quantitatively goes well beyond the scope of this course.

147 An other example Cubic lattice simulation The dimensionality of the protein folding problem can be reduced by simplifying the geometric properties of the system. Knowledge-based energy evaluation can be used as an objective function that is relevant to the physical world, without the need to fully define a system with the 6 degrees of freedom.

148 Spectrum of strategies Physics Knowledge Quantum mechanics Molecular Mechanics Statistical Mechanics Homology Modeling

149 Homology Related by a common ancestor. Sequence identity amongst homologous structure can be as low as 15%. Why making models? There is a good chance that the structural efforts will never catch up with the sequencing projects. How? Figure out the most probable 3D structure, given a (1D) sequence and a 3D template from a related protein.

150 Homology Modeling Assumption Regions of alignable sequence share homologous structures Loop regions (non-conserved residues) allow insertions and deletions without disrupting the overall structure of a protein. Query sequence Sequence Similarity to Solved structure? PSI-Blast/profileMSA Secondary Structure Prediction Fold prediction Homology ModelingModel Validation

151 Homology Modeling Aligning a sequence and a structure MSA (multiple sequence alignment) between the query and the sequence of the target structure. Profile MSA – The query and a MSA of homolog proteins to the target structure. Threading.

152 Homology Modeling Principle of threading “Pull” a sequence through a structure such that the alignment correspond to the frame with the best energy score.

153 Homology Modeling Energy evaluation for threading Statistical mechanics is ideal in this case because physical models would require extensive simulation time to figure out the precise atomic conformation.

154 Homology Modeling Threading to detect correct alignments The application GenTHREADER uses threading to perform protein fold recognition from genomic sequences.

155 Homology Modeling General Principle 1.Align to the sequence of a known structure. 2.Change the structure of the side-chains to match the query sequence according to the sequence alignment. 3.Model loops and variable regions. 4.Minimize energy / conformational search 5.Check models for inconsistencies. Feasibility > 40% sequence identity is preferable. 25% - 40% “Twilight Zone” < 25% Insufficient similarity in most cases. May work only for one domain out of the whole protein.

156 Neural Network Anatomy of a NN: Input parametersOutput parameters Weights

157 Neural Network Before a NN can be used, it must be trained: Training compared the output of a NN with a known answer, the weight of each “arrows” is changed to minimize the error.

158 Secondary Structure prediction Three Generations of methods GenerationApproach 1 (’60-’70) GOR1 Single character statistical information ~ 57% ACC 2 (‘80) GOR3 Local interactions ~ 63% ACC 3 (’90+) PHD Homologous protein sequences ~ 72% ACC

159 Secondary Structure prediction 1 ST Generation Making use of compiled frequencies of the different characters for three possible classes: Helix (H) Strand (S) Coild (-) SDFDKILVSTYSPPQARILIVM -----SSSSSSS----HHHHHH

160 Secondary Structure prediction 2 nd Generation Making use of compiled frequencies of the different characters for three possible classes. Considering the periodicity and neighbors. Sliding window analyses SDFDKILVSTYSPPQARILIVM -----SSSSSSS----HHHHHH

161 Secondary Structure prediction 3 rd Generation Frequency vectors obtained from multiple sequence alignments. These MSA can be generated using BLAST or Psi-BLAST Also known as profiles

162 Secondary Structure prediction Best done using Neural Networks (or HMM… ) 3 rd Generation H H - … S The NN output of the profiles gets scanned by a few, distinct, NNs using a sliding window strategy. Assignment on the basis of the “winner takes all”.

163 Secondary Structure prediction Alignment grow, secondary structure prediction improves Przybylski, Rost Proteins, 46: Conlcusions Using MSA (multiple sequence alignment) significantly improve the predictions (0.72 -> 0.75) The larger the dB used, the better. However, there is a point where the information content saturates. Psi-BLAST vs BLAST: BLAST may be better in some cases. Refining the alignment did not help.

164 Secondary Structure prediction Bidirectional Dynamics for protein secondary structure prediction Baldi et al., 2000, in Sequence learning, pp IOHMM model Memory evaluated experimentally at about 15 characters

165 Secondary Structure prediction Bidirectional Dynamics for protein secondary structure prediction Baldi et al., 2000, in Sequence learning, pp Recurrent Neural Network implementation

166 Overview Protein folding and parallel computing. Current simulation works for modest-sized systems. Homology modeling and statistical mechanics. There is a clear advantages to use the information that we already have to solve new problems. Secondary structure prediction and artificial intelligence. Machine learning is appropriate to capture the trends leading to prediction.


Download ppt "Protein folding Process of folding Modeling the process of folding Evolution vs. folding Impact of function on protein evolution."

Similar presentations


Ads by Google