2. Introduction to Rosetta and structural modeling

2. Introduction to Rosetta and structural modeling
The Rosetta framework Scoring (selecting the structure) and Sampling (finding the structure) Cartesian and polar coordinates

The Rosetta Strategy Observation: local sequence preferences bias, but do not uniquely define the local structure of a protein Goal: mimic interplay of local and global interactions that determine protein structure

The Rosetta Strategy Local interactions: fragments
Derived from known structures Sampled for similar sequences/secondary structure propensity Fragment library represents accessible local structures for short sequence

The Rosetta Strategy Global (non-local) interactions: scoring function
Buried hydrophobic residues, paired b strands, specific side chain interactions, etc. Derived from known structures (statistics on preferred conformations) Boltzmann’s principle relates frequency to energy

A short history of Rosetta
In the beginning: ab initio modeling of protein structure starting from sequence Short fragments of known proteins are assembled by a Monte Carlo strategy to yield native-like protein conformations Reliable fold identification for short proteins Improved to high-resolution (< 2A RMSD) ATCSFFGRKLL…..

A short history of Rosetta
Success of ab initio protocol lead to extension to Protein design Design of new fold: TOP7 Protein loop modeling; homology modeling Protein-protein docking; protein interface design Protein-ligand docking Protein-DNA interactions; RNA modeling Many more, e.g. solving the phase problem in Xray crystallography ATCSFFGRKLL….. ATCSFFGRKLL…..

Rosetta extensions Boinc (Rosetta@home) FoldIt Rosettascripts
PyRosetta

Scoring and Sampling

The basic assumption in structure prediction
Native structure located in global minimum (free) energy conformation (GMEC) A good Energy function can select the correct model among decoys A good sampling technique can find the GMEC in the rugged landscape GMEC E Conformation space

Two-Step Procedure Low-resolution step locates potential minima (fast)
Cluster analysis identifies broadest basins in landscape High-resolution step can identify lowest energy minimum in the basins (slow) E Conformation space GMEC

How are scoring terms optimized?
Nature uses one scoring function… Aim: one generic function for different applications Optimization of parameters: Originally from small molecules (experiments & quantum mechanical calculations) Today: use of protein structures solved at high-accuracy Benchmarks: Discriminate ground state from alternative conformations Identify correct side chain conformation Predict effect of stability of point mutations (DDG) Top-down machine learning approaches optimize several benchmarks simultaneously* Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109 *Park … & DiMaio (2016). Simultaneous Optimization of Biomolecular Energy Function on Features from Small Molecules and Macromolecules. J. Chem. Theory Comput

Low-Resolution Step Structure Representation:
Equilibrium bonds and angles (Engh & Huber 1991) Centroid: average location of center of mass of side-chain (Centroid | aa, f,) No modeling of side chains Fast

P(str | seq) = P(str)*P(seq|str) / P(seq)
Low-Resolution Scoring Function Bayes Theorem: Independent components prevent over-counting P(str | seq) = P(str)*P(seq|str) / P(seq) Knowledge-based parameters: Based on statistics from high-resolution structures in the PDB structure dependent features sequence- dependent features constant

Sequence-Dependent Components
Bayes Theorem: P(str | seq) = P(str) * P(seq | str) / P(seq) Score = Senv+ Spair + … Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. Simons KT, Kooperberg C, Huang E, Baker D. J Mol Biol. 1997;268: We explore the ability of a simple simulated annealing procedure to assemble native-like structures from fragments of unrelated protein structures with similar local sequences using Bayesian scoring functions. Environment and residue pair specific contributions to the scoring functions appear as the first two terms in a series expansion for the residue probability distributions in the protein database; the decoupling of the distance and environment dependencies of the distributions resolves the major problems with current database-derived scoring functions noted by Thomas and Dill. The simulated annealing procedure rapidly and frequently generates native-like structures for small helical proteins and better than random structures for small beta sheet containing proteins. Most of the simulated structures have native-like solvent accessibility and secondary structure patterns, and thus ensembles of these structures provide a particularly challenging set of decoys for evaluating scoring functions. We investigate the effects of multiple sequence information and different types of conformational constraints on the overall performance of the method, and the ability of a variety of recently developed scoring functions to recognize the native-like conformations in the ensembles of simulated structures. Buried residue: >=16 Cb atoms within 10A of residue: two environments: buried and exposed neighbors: Cb-Cb <10Ǻ Rohl et al. (2004) Methods in Enzymology 383:66 Origin: Simons et al., JMB 1997; Simons et al., Proteins 1999

Structure-Dependent Components
P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Srg + Scb + Svdw + … Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. Simons KT, Kooperberg C, Huang E, Baker D. J Mol Biol. 1997;268: We explore the ability of a simple simulated annealing procedure to assemble native-like structures from fragments of unrelated protein structures with similar local sequences using Bayesian scoring functions. Environment and residue pair specific contributions to the scoring functions appear as the first two terms in a series expansion for the residue probability distributions in the protein database; the decoupling of the distance and environment dependencies of the distributions resolves the major problems with current database-derived scoring functions noted by Thomas and Dill. The simulated annealing procedure rapidly and frequently generates native-like structures for small helical proteins and better than random structures for small beta sheet containing proteins. Most of the simulated structures have native-like solvent accessibility and secondary structure patterns, and thus ensembles of these structures provide a particularly challenging set of decoys for evaluating scoring functions. We investigate the effects of multiple sequence information and different types of conformational constraints on the overall performance of the method, and the ability of a variety of recently developed scoring functions to recognize the native-like conformations in the ensembles of simulated structures.

Structure-Dependent Components
P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Srama ….+…..+ 10 Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. Simons KT, Kooperberg C, Huang E, Baker D. J Mol Biol. 1997;268: We explore the ability of a simple simulated annealing procedure to assemble native-like structures from fragments of unrelated protein structures with similar local sequences using Bayesian scoring functions. Environment and residue pair specific contributions to the scoring functions appear as the first two terms in a series expansion for the residue probability distributions in the protein database; the decoupling of the distance and environment dependencies of the distributions resolves the major problems with current database-derived scoring functions noted by Thomas and Dill. The simulated annealing procedure rapidly and frequently generates native-like structures for small helical proteins and better than random structures for small beta sheet containing proteins. Most of the simulated structures have native-like solvent accessibility and secondary structure patterns, and thus ensembles of these structures provide a particularly challenging set of decoys for evaluating scoring functions. We investigate the effects of multiple sequence information and different types of conformational constraints on the overall performance of the method, and the ability of a variety of recently developed scoring functions to recognize the native-like conformations in the ensembles of simulated structures.

High-Resolution Step Slow, exact step Structure Representation:
Locates global energy minimum Structure Representation: All-atom (including hydrogens but no water) Side chains selected from a “rotamer” library of preferred conformations Side chain conformation adjusted frequently e.g. score12; Talaris2014; Ref2015 … Dunbrack 1997

High-Resolution Step: Rotamer Libraries
Side chains have preferred conformations They are summarized in rotamer libraries Select one rotamer for each position Best conformation: lowest-energy combination of rotamers Serine c1 preferences t=180o g+=+60o g-=-60o

High-Resolution Scoring Function
Major contributions: Burial of hydrophobic groups away from water Void-free packing of buried groups and atoms Buried polar atoms form intra-molecular hydrogen bonds

Important bonds for protein folding and stability
Dipole moments attract each other by van der Waals force (transient and very weak: kcal.mol) Hydrophobic interaction –hydrophobic groups/ molecules tend to cluster together and shield themselves from the hydrophilic solvent

Packing interactions Score = SLJ(atr + rep) + …. rij Linearized repulsive part Beta_nov15 e: well depth from CHARMm19

Coulomb electrostatic energy Score = … + Selec+ …. Co=332 Beta_nov15

Implicit solvation (Gaussian-exclusion Lazaridis-Karplus model) Score = … + Ssolvation + …. Excluded volume approximates desolvation penalty; Density f(r) approximated as Gaussian or anisotropic distribution Excluded volume implicit solvation model: Penalizes buried polars Solvation free energy density is assumed to be approximated by a gaussian distribution fi(r)4 pi r2 = alphai exp (-xi2) xi= (r – Ri)/lambdai Lambdai= 3.5A (6.0A for deionized groups) correlation length (width of first, or 2 first solvation shells) alphai = 2 * Delta Gifree/(sqrt pi * lambdai) proportionality coefficient polar polar Anisotropic model takes into account preferred water positions Lazaridis & Karplus, Proteins 1999 Beta_nov15

Hydrogen Bonding Energy
Orientation dependent Statistics derived from 8000 high resolution structures histidine imidazole ring acceptor-backbone amide Beta_nov15

Rotamer preference Score = … + Sdunbrack + …. Dunbrack, 1997

Scoring Function: Summary
One long, generic function …. Score = Senv+ Spair + Srg + Scb + Svdw + Sss+ Ssheet+ Shs + Srama + Shb (srbb + lrbb) + docking_score + Sdisulf_cent+ Srs+ Sco + Scontact_prediction + Sdipolar+ Sprojection + Spc+ Stether+ Sfy+ Sw+ Ssymmetry + Ssplicemsd + ….. docking_score = Sd env+ Sd pair + Sd contact+ Sd vdw+ Sd site constr + Sd + Sfab score Score = SLJ(atr + rep) + Selec+ Ssolvation + Shb(srbb+lrbb+bbsc+sc) + Sdunbrack + Spair – Sref + Sdisulfide_fa13 Sprob1b + Sintrares + Sgb_elec + Sgsolt + Sh2o(solv + hb) + S_plane

Current default Rosetta Energy Function: Ref2015
Alford et al., JCTC 2017

Scoring Function: Summary
One long, generic function …. A weighted sum of different terms Score = w1*SLJatr + w2*SLJrep + w3*Selec + w4*Ssolvation + w5*Shb(srbb+lrbb+bbsc+sc) + w6*Sdunbrack + w7*Spair – Sref…… How can it be improved ? Feature Analysis Tool : improve parameters OptE : optimize weights Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

Feature Analysis : improve scoring term
e.g. HB distance H- Og in Ser & Thr Aim: similar distributions in crystal structures and models Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

Feature Analysis : improve scoring term
e.g. HB distance H- Og in Ser & Thr After correction: distribution in native & model structures overlap Figure 6.3 H-bond length distributions for hydroxyl donors (SER/THR) to backbone oxygens. The thick curves are kernel density estimations from observed data normalized for equal volume per unit distance. The black curve in the background of each panel represents the Native sample source. (A) Boltzmann distribution for the length term in the Rosetta H-bond model with the Score12 and NewHB parameterizations. (B) Relaxed Natives with the Score12 energy function. The excessive peakiness is due to a discontinuity in the Score12 parametrization of the H-bond model. (C) Relaxed Natives with the NewHB energy function. (D) Relaxed Natives with the NewHB energy function and the Lennard–Jones minima between the acceptor and hydroxyl heavy atoms adjusted from 3.0 to 2.6 Å, and between the acceptor and the hydrogen atoms adjusted from 1.95 to 1.75 Å. Aim: similar distributions in crystal structures and models Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

OptE : optimize weights
Score = w1*SLJatr + w2*SLJrep + w3*Ssolvation + w4*Shb(srbb+lrbb+bbsc+sc) + w5*Sdunbrack + w6*Spair – Sref Maximum Likelihood Parameter Estimation Benchmarks: Discriminate ground state from alternative conformations Identify correct side chain conformation Sequence recovery in design: choose correct amino acid residue Predict effect of stability of point mutations (DDG) & more … Aim: Best score for correct prediction Leaver-Fay, …, & Baker (2013) Methods in Enzymology 523:109

DualOptE : parameterization using both small molecule and macromolecule properties (Ref2015)
Optimize 100s of parameters simultaneously: also thermodynamic properties of small molecules Independent validation crucial Park et al.,(2016). Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. Journal of Chemical Theory and Computation, 12:6201

Scoring and Sampling

The basic assumption in structure prediction
Native structure located in global minimum (free) energy conformation (GMEC) A good Energy function can select the correct model among decoys A good sampling technique can find the GMEC in the rugged landscape GMEC E Conformation space

The Rosetta sampling strategy: A general overview
9 residue fragments 3 residue fragments Gradual addition of parameters to scoring function Quick quenching Fragment Sampling Strategies to keep fragment insertion/perturbation local Monte Carlo (MC) Sampling MC sampling with minimization Local optimization Repacking and refinement Side chain rearrangement

Representations of protein structure- Cartesian and polar coordinates
PDB x y z ATOM N GLN A N ATOM CA GLN A C ATOM C GLN A C ATOM O GLN A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 2 3 …. …

2 ways to represent the protein structure
Cartesian coordinates (x,y,z; pdb format) Intuitive – look at molecules in space Easy calculation of energy score (based on atom-atom distances) Difficult to change conformation of structure (while keeping bond length and bond angle unchanged) Polar coordinates (F-Y-W; equilibrium angles and bond lengths) Compact (3 values/residue) Easy changes of protein structure (turn around one or more dihedral angles) Non-intuitive Difficult to evaluate energy score (calculation of neighboring matrix complicated)

A snake in the 2D world Cartesian representation: points:
(0,0),(1,1),(1,2),(2,2),(3,3) connections (predefined): 1-2,2-3,3-4,4-5 x 5 (3,3) 4-5 3 (1,2) 3-4 4 (2,2) 2-3 1-2 2 (1,1) 1 y (0,0)

A snake in the 2D world Internal coordinates:
x √2 1 Internal coordinates: bond lengths (predefined): √2,1,1,√2 angles: 450,90o,0o,45o y x y 45o 90o From wikipedia

A snake wiggling in the 2D world
Constraint: keep bond length fixed Move in Cartesian representation (0,0),(1,1),(1,2),(2,2),(3,3)  (0,0),(1,1),(1,2),(2,2),(3,0) Bond length changed! x √2 √3 y

A snake wiggling in the 2D world
Constraint: keep bond length fixed Move in polar coordinates 450,90o,0o,45o  450,90o,45o,45o Bond length unchanged! Large impact on structure x y

Polar Cartesian coordinates
Convert r and q to x and y x y √2,1,1,√2 450,90o,0o,45o (0,0),(1,1),(1,2),(2,2),(3,3) From wikipedia

Cartesianpolar coordinates
Convert x and y to r and q x y (0,0),(1,1),(1,2),(2,2),(3,3) √2,1,1,√2 450,90o,0o,45o

Moving the snake to the 3D world
Cartesian representation: points: additional z-axis (0,0,0),(1,1,0),(1,2,0),(2,2,0),(3,3,0) connections (predefined): 1-2,2-3,3-4,4-5 Internal coordinates: bond lengths (predefined): √2,1,1,√2 angles: 450,90o,0o,45o dihedral angles: 00,180o z y x Proteins: bond lengths and angles fixed. Only dihedral angles are varied

Dihedral angles c1-c4 define side chain
Dihedral angle: defines geometry of 4 consecutive atoms (given bond lengths and angles) From wikipedia

What we learned from our snake
Cartesian representation: Easy to look at, difficult to move Moves do not preserve bond length (and angles in 3D) Internal coordinates: Easy to move, difficult to see calculation of distances between points not trivial z x y Proteins: bond lengths and angles fixed. Only dihedral angles are varied

Solution: toggle MOVE STRUCTURE - Polar coordinates:
Transform: calculate dihedral angles from coordinates MOVE STRUCTURE - Polar coordinates: introduce changes in structure by rotating around dihedral angle(s) (change F-Y values) CALCULATE ENERGY - Cartesian coordinates: Derive distance matrix (neighbor list) for energy score calculation Transform: build positions in space according to dihedral angles PDB x y z ATOM N GLN A N ATOM CA GLN A C ATOM C GLN A C ATOM O GLN A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 2 3 …. … (0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Cartesian polar coordinates
How to calculate polar from Cartesian coordinates: example F: C’-N-Ca-C define plane perpendicular to N-Ca (b2) vector calculate projection of Ca-C (b3) and C’-N (b1) onto plane calculate angle between projections PDB x y z … ATOM C GLN A N ATOM N GLY A C ATOM CA GLY A C ATOM O GLY A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 ….. 33 34 …. … (0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Polar Cartesian coordinates
Find x,y,z coordinates of C, based on atom positions of C’, N and Ca, and a given F value (F: C’-N-Ca-C) create Ca-C vector: size Ca-C=1.51A (equilibrium bond length) angle N-Ca-C= 111o (equilibrium value for N-Ca-C angle) rotate vector around N-Ca axis to obtain projections of Ca-C and N-C’ with wanted F PDB x y z … ATOM C GLN A N ATOM N GLY A C ATOM CA GLY A C ATOM O GLY A O ….. …. Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 ….. 33 34 …. … (0,0),(1,1),(1,2),(2,2),(3,3) 450,90o,0o,45o

Representation of protein structure
Rosetta folding 1 2 3 4 5 6 7 8 3 backbone dihedral angles per residue Build coordinates of structure starting from first atom, according to dihedral angles (and equilibrium bond length and angle) 1 2 3 4 5 6 7 8 7 8 Sampling and minimization in TORSIONAL space: change angle and rebuild, starting from changed angle See also: and Based on slides by Chu Wang

Representation of protein structure
Rosetta folding 1 2 3 4 5 6 7 8 3 backbone dihedral angles per residue Sampling and minimization in TORSIONAL space Sampling and minimization in RIGID-BODY space Backbone dihedral angles fixed (rigid-body) 4 3 1 2 8 7 5 6 Rosetta docking 4’ 3’ 1’ 2’ 8’ 7’ 5’ 6’ 6 rigid-body DOFs -- 3 translational vectors 3 rotational angles How can those two types of degrees of freedom be combined?

Fold tree representation
Originally developed to improve sampling of strand registers in -sheet proteins. Allows simultaneous optimization of rigid-body and backbone/sidechain torsional degrees of freedom. Example: fold-tree based docking “peptide” edge – 3 backbone dihedral angles 4 3 1 2 8 7 5 6 “long-range” edge – 6 rigid-body DOFs 4’ 3’ 1’ 2’ 8’ 7’ 5’ 6’ 4’ 3’ 1’ 2’ 8’ 7’ 5’ 6’ “peptide” edge – 3 backbone dihedral angles Construct fold-trees to treat a variety of protein folding and docking problems. Fold tree: Bradley and Baker, Proteins (2006)

Fold-trees for different modeling tasks
protein folding N C Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge 1 1’ rigid “jump” 1 1’ flexible “jump” N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;

loop modeling N 1 x 1’ 2 x 2’ C Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge 1 1’ rigid “jump” 1 1’ flexible “jump” N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;

1 C 1’ fully flexible docking N 1 1’ C 2 2’ x 3’ 3 docking w/ loop modeling N 1 C 1’ docking w/ hinge motion Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge 1 1’ rigid “jump” 1 1’ flexible “jump” N: N-terminal; C: C-terminal; X: chain break; O: root of the tree;

Color – flexible bb Gray – fixed bb Pale – symmetry operation

Color – flexible bb Gray – fixed bb Filled colored circles - flexible sc

Color – flexible bb Gray – fixed bb Filled colored circles - flexible sc empty colored circles – flexible amino acid: design

Rosetta3: Object-oriented architecture
Figure 19.2 Pose architecture: The components of the pose class are illustrated for the case of a simple eight-residue system consisting of a two base-pair DNA duplex (residues 1–4) and a protein segment (residues 5–8). Conformational and chemical information are stored within the Conformation class as Residue objects (coordinates) with pointers to ResidueTypes (chemistry); the AtomTree class records the kinematic connectivity (the mapping between internal and Cartesian coordinates). Energies from the most recent evaluation of the scoring function are stored in the Energies class, which holds residue–residue interactions in the EnergyGraph. Finally, user-defined coordinate restraints are stored in the ConstraintSet, and additional Pose-associated data can be stored in the DataCache, where it will be copied along with the Pose during simulations. Color – flexible bb Gray – fixed bb Description of object-oriented organization in Rosetta3: Leaver-Fay et al. Methods in Enzymology (2013)

2. Introduction to Rosetta and structural modeling

Similar presentations

Presentation on theme: "2. Introduction to Rosetta and structural modeling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2. Introduction to Rosetta and structural modeling

Similar presentations

Presentation on theme: "2. Introduction to Rosetta and structural modeling"— Presentation transcript:

Similar presentations

About project

Feedback