7. (Predicted) residue pair contacts guide ab initio modeling

7. (Predicted) residue pair contacts guide ab initio modeling
… and homolog refinement too… Acknowledgments for slides in this lecture to Sergey Ovchinnikov!

Restraint function: Contact prediction via correlated mutations
Recent breakthrough: Significantly longer proteins can be modeled without template (ab initio) ab initio restricted to small (100aa), single domain proteins + information about contacts Contact prediction from co-evolution -> dramatic increase of scope (… 500aa)

conformational change
What is co-evolution? Important Contacts in Proteins are evolutionarily conserved and encoded in a Multiple Sequence Alignment within mediated by ligand between due to co-evolution conformational change by measuring coevolution, we can infer important contacts in proteins!

Contacting residues can be represented as a contact map!
Contact: Residue – Residue interaction N C Grey = Structural Contact Blue = Predicted Contact Intensity = Strength of Prediction

GREMLIN used to measure Co-evolution
Global statistical model Lapedes et al. 1990s x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) Positions X1 X2 X3 X4 V1 V2 V3 V4 Hetunandan Kamisetty Facebook Balakrishnan et al. 2010

Global statistical model Lapedes et al. 1990s x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) W14 Positions X1 X2 X3 X4 V1 V2 V3 V4 Balakrishnan et al. 2010

Global statistical model Lapedes et al. 1990s x = position fi = one-body energy (Conservation) ψij = two-body energy (Coupling) Learn pseudo-likelihood model: Connectivity (sparse: Few significant correlations – contacts) Parameters (optimize model of X - MSA) Balakrishnan et al. 2010

Global statistical model Lapedes et al. 1990s GREMLIN APC(L2norm( )) Wij APC average product correction L2 of the matrix sqroot (of sum of everything squared) Pseudo likelihood learning procedure, with penalty to promote sparse NET x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) 50S ribosomal protein L6 APC: ave. product correction

Global statistical model Lapedes et al. 1990s GREMLIN APC(L2norm( )) Wij APC average product correction L2 of the matrix sqroot (of sum of everything squared) Pseudo likelihood learning procedure, with penalty to promote sparse NET x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) Balakrishnan et al. 2010 50S ribosomal protein L6

Gremlin (Generative REgularized ModeLs of proteINs)
based on pseudo-likelihood framework: Markov Random Field (more complex than HMM: chain) optimized for maximum correct contact predictions includes predicted context information: SS (PSIPRED) Contacts (SVMcon) informative MSA: # S (Sequences, <90%id) > 4-5 x L (protein length) @ 4-5L sequence depth, the top 1.5L contacts are ~50% correct reliable modeling: ≥ 1 reliable non-local contact every <12aa> -> prediction of longer proteins Original paper: Balakrishnan …. Langmead. Proteins 2010 Bakerlab: Kamisetty et al. . PNAS 2013; Kim et al.. Proteins 2013; Ovchinnikov et al. eLife2015 & Science 2017

When is it useful? Needs many sequences -> structural template often available -> no need for contact predictions …. Model discrimination? DGREMLIN: difference between native and model scores (CAMEO dataset n=329) For 10% (34/329 proteins) GREMLIN discriminates the native from the rest Utility of contact prediction for structure modeling. (A) Ranking of alternate models by GREMLINΔ. Three scenarios are illustrated; each represents a distinct protein target, black dots indicate alternate models, red dots indicate native structures. (Left) GREMLINΔ is not useful in selecting the closest model and does not correctly discriminate between native (target pdb:4hwnA) and homology models; (Middle) GREMLINΔ ranks homology models correctly (top five models within 0.05 of best five on average; R2 between GREMLIN score and fraction of native contacts > 0.8) but adds no additional information (target pdb:4fn4D); (Right) GREMLINΔ discriminates between best model and native structure (target pdb:4hxtA). In an additional 6% of the targets, GREMLINΔ correctly discriminated the native from the homology models but there were not enough models to reliably establish accuracy of ranking. Kamisetty et al. 2013

When is it useful? Needs many sequences -> structural template often available -> no need for contact predictions …. Better information than templates? HHD (closeness of template: DHHPred scores) 0: HHPred query and template alignment identical 1: no homolog with known structure (CAMEO dataset n=339) HHD >0.5 -> GREMLIN is useful for model discrimination (GREMLIN D>0) Utility of contact prediction for structure modeling. (B)HHΔ predicts GREMLINΔ: GREMLINΔ versus structural similarity of homolog to native structure computed by TM-align (14) (for homologs of all targets with high-resolution crystal structures < 2.1 Å). When HHΔ ≤ 0:5 (blue bars), GREMLINΔ is rarely better than random (green bars, constructed by pooling 100 permutations of predicted scores for each target). When HHΔ > 0:5 (red bars), GREMLINΔ is significantly positive and contact scores successfully discriminate between native and homology model even when the homolog is likely to be from the same fold (similarity ∈1⁄20:5; 0:8Þ). Error bars show mean and SD of distributions in all cases. (TMalign) Kamisetty et al. 2013

When is it useful? Needs many sequences -> structural template often available -> no need for contact predictions …. Analysis of PFAM GREMLIN could be useful for 14% (422/12,452) of the families Estimated from: # cases with distant template (HHD>0.5) # cases with enough sequences (Sequences/Length>4) Frequency of utility of contact prediction. The protein families in the Pfam database were divided into three groups based on the HHsearch P value of the closest protein of known structure (Left, HHsearch P value > 10−6.5; Middle, HHsearch P value between 10−40 and 10−6.5; Right, HHsearch P value > 10−40). Within each group, the number of families with sequences/ length less than 1, between 1 and 5, and greater than 5 are shown in blue, red, and green, respectively (Upper bars). For families with > 5 sequences per position (Upper green bars), distribution of HHΔ to the closest protein of known structure is shown in the lower panel. In cases where the difference in profiles is large (HHΔ>0:5: right bar in each group, Lower), these predictions are likely to improve on comparative models. Kamisetty et al. 2013

Example: CASP T0806 predicted contacts
YAAA_ECOLI Seqs: 1208 Length: 258 Top 1.5L contacts HHsearch results of top HIT Prob = 12.4% E-value = 20 Improve confidence by combination with GREMLIN contacts

Not all contacts should be made!
Monomer Homo-dimer Ligand mediated Multi-state

Functional form to “de-noise”
Starting conformation Sigmoid Harmonic Sigmoidal restraints prevent “false” contacts from distorting the structure, maximizing self-consistent contacts. Though requires LOTS of sampling.

Residue-pair-specific Cβ-Cβ distance
2.9 9.0 Maximum Cβ-Cβ distance that allows a contact (< 5Å between any heavy atom). Bring residues close enough to form contacts, let Rosetta energy function decide if contact should be formed Can be used in centroid mode

CASP target T0806 - each model made/missed a different subset of contacts
Contact maps of the top 4 models Structure Contacts (5Å) Predicted Contacts Top 4 models

Pipeline Contact prediction essential for convergence
Hybridize (using RosettaCM) Fragment insertion (20 trials) Abinitio (using RosettaAB) Contact prediction essential for convergence Repeat until CASP deadline or convergence. High-Resolution comparative modeling with RosettaCM Y Song, F DiMaio, RYR Wang, D Kim, C Miles, TJ Brunette, J Thompson, D Baker One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling Kim, D.E., DiMaio, F., Yu‐Ruei Wang, R., Song, Y. and Baker, D. Iterative refinement essential for improved model quality

Transition ab initio -> Template based modeling
Contact-assisted ab initio prediction using Rosetta Contacts refine template topology Determination of Topology: Ab initio folding w constraints Find fragment pairs Refinement of Topology: Refine structure by imposing constraints One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling Kim, D.E., DiMaio, F., Yu‐Ruei Wang, R., Song, Y. and Baker, D.

Modeling with contact predictions: CASP 12 results for Rosetta
Examples Predicted contacts Model X-ray <5A; <10A; >10A Bakerlab: Kamisetty et al. . PNAS 2013; Kim et al.. Proteins 2013; Ovchinnikov et al. eLife2015 & Science 2017

Nf = #sequence clusters (<80% seq. id)
Modeling with contact predictions: New models for uncharacterized families Calibrate on 27 proteins with known structure using subsampled alignments Approach: Generate Gremlin Matrix based on alignments of increasing length Generate 20K Models with constrained RosettaCM & select top-scoring model (De novo) Hybridize-refine top20 models (Refinement) Measure: Nf Number of effective sequences: Nf = #sequence clusters (<80% seq. id) √Length >64 Accurate model >16 Same fold Fig. 2. Metagenome data greatly increased fraction of structures that can be accurately modeled. (A) Dependence of coevolution guided Rosetta structure-prediction accuracy on the effective number of sequences Nf in the protein family. For each of 27 proteins of known structure, the multiple sequence alignment was subsampled, and residue-residue contacts were predicted by using GREMLIN. Rosetta structure-prediction calculations were then used to generate ~20,000 models, and a single model was selected on the basis of the Rosetta energy and the fit to the coevolution constraints; the average TM score of these selected models over all 27 cases is shown on the y axis (dashed line). Hybridization-based refinement of the top 20 models together with the top 10 map_align-based models for each case increases the average accuracy (solid line); models with fold-level accuracy (TM score of >0.5) are obtained for Nf ≥ 16, and models with accuracy typical of comparative modeling, for Nf of 64. (B) Fraction of protein families of unknown structure with at least 64 Nf. Dashed line: including only sequences in UniRef100 database; solid line: including sequences in UniRef100 database together with metagenome sequence data from the Joint Genome Institute (37). (C) Distribution of Nf values for 5211 Pfam families with currently unknown structure, after the addition of metagenomic sequences; 25% of the protein families have Nf > 64, 34% have Nf > 32, and 45% have Nf>16. NF: Correlates well with accuracy (TM score) Length-independent Ovchinnikov et al. eLife2015 (1) & Science 2017 (2)

Modeling with contact predictions: New models for uncharacterized families
Nf > 64 Large-scale modeling of prokaryotic proteomes 58 / 121 prokaryotic protein families with no structural template  templates for ~400K prokaryotic proteins Large scale + metagenomic data 921/1297 with enough long-range contacts; 1024 domains 614/1024 no current structure -> 137 new folds  templates for ~500K uniprot & 3M metagenomic proteins Fig. 2. Metagenome data greatly increased fraction of structures that can be accurately modeled. (A) Dependence of coevolution guided Rosetta structure-prediction accuracy on the effective number of sequences Nf in the protein family. For each of 27 proteins of known structure, the multiple sequence alignment was subsampled, and residue-residue contacts were predicted by using GREMLIN. Rosetta structure-prediction calculations were then used to generate ~20,000 models, and a single model was selected on the basis of the Rosetta energy and the fit to the coevolution constraints; the average TM score of these selected models over all 27 cases is shown on the y axis (dashed line). Hybridization-based refinement of the top 20 models together with the top 10 map_align-based models for each case increases the average accuracy (solid line); models with fold-level accuracy (TM score of >0.5) are obtained for Nf ≥ 16, and models with accuracy typical of comparative modeling, for Nf of 64. (B) Fraction of protein families of unknown structure with at least 64 Nf. Dashed line: including only sequences in UniRef100 database; solid line: including sequences in UniRef100 database together with metagenome sequence data from the Joint Genome Institute (37). (C) Distribution of Nf values for 5211 Pfam families with currently unknown structure, after the addition of metagenomic sequences; 25% of the protein families have Nf > 64, 34% have Nf > 32, and 45% have Nf>16. Nf >64: accurate model >16: accurate fold Ovchinnikov et al. eLife2015 (1) & Science 2017 (2)

Summary : Structure prediction with
correlated contacts Correlated evolution identifies neighboring residue pairs in protein structure Informative alignment MSA is critical Enough sequences are available today Contacts used to guide structure prediction In particular when no template is identified Significant increase in proteins with reliable structural models In particular for Transmembrane proteins Helped by metagenomic data

7. (Predicted) residue pair contacts guide ab initio modeling

Similar presentations

Presentation on theme: "7. (Predicted) residue pair contacts guide ab initio modeling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

7. (Predicted) residue pair contacts guide ab initio modeling

Similar presentations

Presentation on theme: "7. (Predicted) residue pair contacts guide ab initio modeling"— Presentation transcript:

Similar presentations

About project

Feedback