Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jürgen Sühnel Supplementary Material: -2011- 3D Structures of Biological Macromolecules.

Similar presentations


Presentation on theme: "Jürgen Sühnel Supplementary Material: -2011- 3D Structures of Biological Macromolecules."— Presentation transcript:

1 Jürgen Sühnel jsuehnel@fli-leibniz.de Supplementary Material: http://www.fli-leibniz.de/www_bioc/3D/ -2011- 3D Structures of Biological Macromolecules Part 4: Protein Structure Prediction Leibniz Institute for Age Research, Fritz Lipmann Institute (FLI) Jena Centre for Bioinformatics (JCB) Jena Centre for Systems Biology of Ageing (JenAge) Jena / Germany

2 Year Yearly Total 1980 16 70 structures 1993 695 1582 structures (~ 2 new structures per day) 2003 416723597 structures (~ 11 new structures per day) 2009 739662191 structures (~ 20 new structures per day) 2010 792370114 structures (~ 22 new structures per day) 2011 812378237 structures (~ 22 new structures per day) (nur experimentelle Strukturen) PDB Content Growth

3 January 2, 2012 (Last Update: December 21, 2011)

4 UniProt/SwissProt: Growth Rate 19.01.2011

5 UniProt/TrEMBL: Growth Rate 19.01.2011

6 Swiss-Prot/TrEMBL: Amino Acid Composition Swiss-ProtTrEMBL 15-Jan-2008

7 Structural Genomics Structural genomics consists in the determination of the three dimensional structure of all proteins of aproteins given organism, by experimental methods such as X-ray crystallography, NMR spectroscopyX-ray crystallographyNMR spectroscopy or computational approaches such as homology modelling.homology modelling As opposed to traditional structural biology, the determination of a protein structure throughstructural biologyprotein structure a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics, i.e. determining protein functionprotein functionstructural bioinformaticsprotein function from its 3D structure. One of the important aspects of structural genomics is the emphasis on high throughput determination of protein structures. This is performed in dedicated centers of structural genomics.centers of structural genomics While most structural biologists pursue structures of individual proteins or protein groups, specialists in structural genomics pursue structures of proteins on a genome wide scale. This implies large scale cloning, expression and purification. One main advantage of this approach is economy of scale. On the other hand, the scientific value of some resultant structures is at times questioned. en.wikipedia.org/wiki/Structural_genomics

8 Structural Genomics

9 Protein Structure Prediction

10

11 A Good Protein Structure Minimizes disallowed torsion angles Maximizes number of hydrogen bonds Minimizes interstitial cavities or spaces Minimizes number of “bad” contacts Minimizes number of buried charges

12 Protein Structure Prediction – CAFASP Contest http://www.cs.bgu.ac.il/~dfischer/CAFASP5/

13 Protein Structure Prediction – CASP Contest http://predictioncenter.gc.ucdavis.edu/

14 Protein Structure Prediction – CASP Contest http://predictioncenter.gc.ucdavis.edu/

15 Protein Structure Prediction –Secondary structure –3D structure Modeling by homology (Comparative modeling) Fold recognition (Threading) Ab initio prediction –Rule-based approaches –Lattice models –Simulating the time dependence of folding Refinement Exploring the effect of single amino acid substitutions Ligand effects on protein structure and dynamics (induced fit)

16 Lysozyme

17 Lysozyme – 5lyz

18

19 Lysozyme – 5lyz: Information from the JenaLib Atlas Page

20

21

22 Lysozyme – 5lyz: Information from the JenaLib Atlas Page - ProSite

23 Lysozyme – 5lyz: PROSITE Signature

24 PROMOTIF Secondary Structure Analysis – 5lyz....

25 Protein Backbone Torsion Angles D. W. Mount: Bioinformatics, Cold Spring Harbor Laboratory Press, 2001.

26 Sidechain Torsion/Dihedral Angles

27 PROMOTIF Secondary Structure Analysis – 5lyz

28

29

30 Chou-Fasman Secondary Structure Prediction

31 Amino Acid Propensities From a database of experimental 3D structures, calculate the propensity for a given amino acid to adopt a certain type of secondary structure l Example: N(Ala)=2.000; N(tot)=20.000; N(Ala, helix)=568; N(helix)=4.000. P(Ala,helix) = [N(Ala,helix)/N(helix)] / [N(Ala)/N(tot)] P(Ala,helix) = [568/4.000]/[2.000/20.000] = 1.42 Used in Chou-Fasman algorithm

32 Chou-Fasman Secondary Structure Prediction Assign all of the residues in the peptide the appropriate set of parameters. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix. Repeat this procedure to locate all of the helical regions in the sequence. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region. To identify a bend at residue number j, calculate the following value p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality P(a-helix) P(b-sheet), then a beta-turn is predicted at that location.

33 Lysozyme – 5lyz: Chou-Fasman Secondary Structure Prediction http://fasta.bioch.virginia.edu/fasta_www/chofas.htm

34 Lysozyme – 5lyz: Chou-Fasman Secondary Structure Prediction http://fasta.bioch.virginia.edu/fasta_www/chofas.htm GRCE (0.57|0.98|0.70|1.39)0.91 RCEL (0.98|0.70|1.39|1.41) 1.12 CELA (0.70|1.39|1.41|1.42) 1.23 ELAA (1.39|1.41|1.42|1.42)1.41

35 Lysozyme – 5lyz: PhD/PROF Structure Prediction http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top PROF_sec:PROF predicted secondary structure: H=helix, E=extended (sheet), blank=other (loop) PROF = PROF: Profile network prediction Heidelberg Rel_secreliability index for PROF_sec prediction (0=low to 9=high) SUB_secsubset of the PROFsec prediction, for all residues with an expected average accuracy > 82% (tables in header) NOTE: for this subset the following symbols are used: L: is loop (for which above ' ' is used).: means that no prediction is made for this residue, as the reliability is: Rel < 5 O3_accobserved relative solvent accessibility (acc) in 3 states: b = 0-9%, i = 9-36%, e = 36-100%. P3_accPROF predicted relative solvent accessibility (acc) in 3 states: b = 0-9%, i = 9-36%, e = 36-100%. Rel_accreliability index for PROFacc prediction (0=low to 9=high) SUB_accsubset of the PROFacc prediction, for all residues with an expected average correlation > 0.69 (tables in header) NOTE: for this subset the following symbols are used: I: is intermediate (for which above ' ' is used).: means that no prediction is made for this residue, as the reliability is: Rel < 4

36 Lysozyme – 5lyz: PhD/PROF Structure Prediction, BLAST http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top

37 Lysozyme – 5lyz: PhD/PROF Structure Prediction, BLAST http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top

38 Lysozyme – 5lyz: PhD/PROF Structure Prediction http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top Perform BLAST search to find local alignments Remove alignments that are “too close” Perform multiple alignments of sequences Construct a profile (PSSM) of amino-acid frequencies at each residue Use this profile as input to the neural network A second network performs “smoothing” The third level computes jury decision of several different instantiations of the first two levels.

39 PSSM A PSSM, or Position-Specific Scoring Matrix, is a type of scoring matrix used in protein BLAST searches in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment. Thus, a Tyr-Trp substitution at position A of an alignment may receive a very different score than the same substitution at position B. This is in contrast to position-independent matrices such as the PAM and BLOSUM matrices, in which the Tyr-Trp substitution receives the same score no matter at what position it occurs.protein BLASTBLOSUM matrices

40 PSI-BLAST Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructedprofile (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" are used to refine the profile. This iterative searching strategy results in increased sensitivity.

41 Conserved Domain Database http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

42 PSSM – 1tot PSSM – 1tot (Zz Domain Of Cbp: An Unusual Zinc Finger Fold In A Protein Interaction Module) http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

43 1tot 1tot (Zz Domain Of Cbp: An Unusual Zinc Finger Fold In A Protein Interaction Module)

44 Lysozyme – 5lyz: PsiPred Structure Prediction http://bioinf.cs.ucl.ac.uk/psipred/psiform.html

45 PsiPred PSIPRED is a simple and reliable secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLASTPSI-BLAST (Position Specific Iterated - BLAST). Version 2.0 of PSIPRED includes a new algorithm which averages the output from up to 4 separate neural networks in the prediction process to further increase prediction accuracy. Using a very stringent cross validation method to evaluate the method's performance, PSIPRED 2.0 is capable of achieving an average Q3 score of nearly 78%. Predictions produced by PSIPRED were also submitted to the CASP4 server and CASP4 assessed during the CASP4 meeting, which took place in December 2000 at Asilomar. PSIPRED 2.0 achieved an average Q3 score of 80.6% across all 40 submitted target domains with no obvious sequence similarity to structures present in PDB, which placed PSIPRED in first place out of 20 evaluated methods (an earlier version of PSIPRED was also ranked first in CASP3 held in 1998).

46 Comparing Secondary Structure Prediction Results PsiPred Chou-Fasman Phd/PROF

47 Comparing Secondary Structure Prediction Results

48 Protein Secondary Structure Prediction - Summary 1st Generation - 1970s Chou & Fasman, Q3 = 50-55% 2nd Generation -1980s Qian & Sejnowski, Q3 = 60-65% 3rd Generation - 1990s PHD, PSI-PRED, Q3 = 70-80% Features of the new methods: Taking into account evolutionary information Neural networks Failures: Nonlocal sequence interactions Wrong prediction at the ends of H/E Q3 – Percentage of correctly assigned amino acids in a test set

49 Protein Structure Prediction http://speedy.embl-heidelberg.de/gtsp/flowchart2.html

50 Modeling by Homology (Comparative Modeling) http://salilab.org/modeller/

51 Modeling by Homology (Comparative Modeling) http://modbase.compbio.ucsf.edu/modbase-cgi-new/search_form.cgi

52 Modeling by Homology (Comparative Modeling) http://modbase.compbio.ucsf.edu/modbase-cgi-new/search_form.cgi

53 Modeling by Homology (Comparative Modeling) http://modbase.compbio.ucsf.edu/modbase-cgi-new/search_form.cgi

54 Modeling by Homology (Comparative Modeling) http://swissmodel.expasy.org/

55 Modeling by Homology (Comparative Modeling) http://salilab.org/modeller/ Comparative modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target template alignment, model building, and model evaluation and refinement. The number of protein sequences that can be modeled and the accuracy of the predictions are increasing steadily because of the growth in the number of known protein structures and because of the improvements in the modeling software. Further advances are necessary in recognizing weak sequence structure similarities, aligning sequences with structures, modeling of rigid body shifts, distortions, loops and side chains, as well as detecting errors in a model. Despite these problems, it is currently possible to model with useful accuracy significant parts of approximately one third of all known protein sequences.

56 Fold Recognition (Threading) Methods of protein fold recognition attempt to detect similarities between protein 3D structure that are not accompanied by any significant sequence similarity. The unifying theme of these appraoches is to try and find folds that are compatible with a particular sequence. Unlike sequence-only comparison, these methods take advantage of the extra information made available by 3D structure information. Rather than predicting how a sequence will fold, they predict how well a fold will fit a sequence.

57 Secondary structure is more conserved than primary structure Tertiary structure is more conserved than secondary structure Therefore very remote relationships can be better detected through 2 o or 3 o structural homology instead of sequence homology Fold Recognition (Threading) – Why ?

58 Fold Recognition (Threading)

59 Fold Recognition (Threading) – 2 Types 2D Threading or Prediction Based Methods (PBM) –Predict secondary structure (SS) or ASA of query –Evaluate on basis of SS and/or ASA matches 3D Threading or Distance Based Methods (DBM) –Create a 3D model of the structure –Evaluate using a distance-based “hydrophobicity” or pseudo-thermodynamic potential

60 Fold Recognition Database of 3D structures and sequences –Protein Data Bank (or non-redundant subset) Query sequence –Sequence < 25% identity to known structures Alignment protocol –Dynamic programming Evaluation protocol –Distance-based potential or secondary structure Ranking protocol

61 Fold Recognition http://www.sbg.bio.ic.ac.uk/~3dpssm/index2.html

62 Ab Initio Prediction Predicting the 3D structure without any “prior knowledge” Used when homology modelling or threading have failed (no homologues are evident) Equivalent to solving the “Protein Folding Problem” Still a research problem

63 Ab Initio Prediction http://rosettadesign.med.unc.edu/

64 Ab Initio Prediction Simons, Strauss, Baker. J. Mol. Biol. 2001, 306, 1191-1199.

65 Ab Initio Prediction – Lysozyme (5lyz) http://rosettadesign.med.unc.edu/

66 Combining Prediction Procedures http://robetta.bakerlab.org/

67 Protein Model Portal http://www.proteinmodelportal.org/

68 Molecular Mechanics (Force Field) http://cmm.info.nih.gov/modeling/guide_documents/molecular_mechanics_document.html

69 How Do We Get the Parameters ? Experimental Data (Examples: Geometrical Parameters) Quantum-chemical Calculations (Examples: Charges)

70 Geometry Optimization

71 Optimization Methods – Steepest Descent Steepest descent

72 Optimization Methods – Conjugate Gradients Method

73 Optimization Methods – Newton-Raphson Methods g -. gradient h - Hessian

74 FLI Computing Facilities IBM Linux ClusterSGI Altix

75 Clusters are made up of dedicated components and all components in a cluster are exclusively owned and managed as part of the cluster. All resources are known, fixed and usually uniform in configuration. It is a static environment. Grids differ from clusters because grids share resources from and among independent system owners. Grids are configured from computer systems that are individually managed and used both as independent systems and as part of the grid. Thus, individual components are not 'fixed' in the grid and the overall configuration of the grid changes over time. This results in a dynamic system that continually assesses and optimises its utilisation of resources. Cluster vs. Grid Computing

76 EUROGRID - BioGRID www.eurogrid.org/wp1.html

77 Simulation of Protein Folding

78 Thousand trillon FLOPs

79 ~ 65.000 processors teraflop – a trillion floating point operations per second IBM Blue Gene Project | System-on-a-Chip Approach


Download ppt "Jürgen Sühnel Supplementary Material: -2011- 3D Structures of Biological Macromolecules."

Similar presentations


Ads by Google