Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.

Similar presentations


Presentation on theme: "Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics."— Presentation transcript:

1 Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics Institute VU (IBIVU) Faculty of Sciences / Faculty Earth and Life Sciences Vrije Universiteit Amsterdam

2 Protein structure evolution Insertion/deletion of secondary structural elements can ‘easily’ be done at loop sites

3 Protein structure evolution Insertion/deletion of secondary structural elements can ‘easily’ be done at loop sites Protein structure evolution Insertion/deletion of structural domains can ‘easily’ be done at loop sites N C

4 Folds: how many? Chothia (1992) – appr. 1,000 folds Estimates vary from 1,000 – 10,000 With 30,000 human genes, ≥3 genes per fold on average four broad structural protein fold classes: all-α all-β α/β α+β Fold classification Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, 1992. 357(6379): p. 543-4. Zhang, C. and C. DeLisi, Estimating the number of protein folds. J Mol Biol, 1998. 284(5): p. 1301-5.

5 The first protein structure in 1960: Myoglobin -  fold

6 Tropomyosin Coiled-coil domains This long protein is involved In muscle contraction

7 5(  ) fold Flavodoxin fold Flavodoxin family - TOPS diagrams (Flores et al., 1994) 12345 1 234 5  /  fold

8 Greek key  -strand motif

9 Plait motifAlpha-beta barrel

10  3-layer motifs (2 layers of helices with a  -sheet in between) are often specified as x-y-z (e.g. 4-14-5), where x is number of helices in first helical layer, y is number of strands in  -sheet, and y is number of helices in second helical layer

11 For  proteins, there are no good classification systems. You can only count…

12 How many folds – Chothia 1992 The first estimate of the number of protein families has been explicitly done by Chothia in 1992. At that time about 120 structural families were known. Chothia summarized the results of several genome projects and revealed that the chances of a random protein to belong to one of the known sequence families is approximately 1/3. According to the results of sequence comparison of the PDB with sequence databases (Sander, Schneider 1991), about 1/4 of all sequences appeared to be similar to one of the PDB entries at 25% identity level. Assuming equal distribution of proteins among the families, Chothia concluded that the total number of protein structural families should be equal to 120*3*4 = 1440.

13 How many folds – Alexandrov & Go, 1994, updated Pfam-2.1 database consists of 101,724 domains of proteins from SwissProt (Bairoch & R., 1996) release 34, clustered in 13,816 families. There were also 7,694 proteins of 30 or more amino acids in SwissProt-34, which are not present in Pfam and are not similar to other proteins. We have added them into the database, which now contains 109,418 domains in 21,510 families. We have eliminated very similar sequences from the database, trying to make the database more homogeneous. In the final classification there were 60,601 domains, distributed within 21,510 families. All families were ranked by the number of domains in each family. The resulting distribution fits nicely to the Zipf’s law.

14 How many folds Distribution of protein sequences among protein families. One can see that the distribution is essentially non- equal. The shape of the distribution is described very well by Zipf’s law: n(r) = ar -b, with a= 640 and b=0.64. Correlation coefficient of this approximation equals to 0.992. r is the rank of family, n(r) is the number of proteins in the r-th family, a is a scaling constant, depending on the number of proteins in the dataset, and b 0.64. Constant b does not depend on the size of the dataset.

15 Fold number according to Alexandrov & Go 60,000 protein sequence families in 14,000 different folds

16 Fold number according to Alexandrov & Go An important feature of Zipf distribution is that it has a very long tail of clusters with only few members in it. For example, if b=0.7, half of all proteins is located in 10% of all clusters.

17 General fold classification systems The definitions of four broad structural classes, all-α, all-β, α/β, and α+β, based on secondary structure compositions and β-sheet topologies [Levitt & Chothia, 1976] represented the first step towards a global characterization of the protein fold space. These definitions have been generally accepted and are being used by many classification systems to organize the fold hierarchy [Murzin et al., 1995; Orengo et al., 1997]. However, there is a need for methods to represent the full range of structural relationships among folds for a better understanding of the organizing principles and features of the protein fold space. The fold family trees such as those built by Effimov [1997], Zhang and Kim [2000] and Taylor [2002] are very informative, but the construction of such trees involves extensive manual operations and, sometimes, considerable human judgment. An alternative approach is to apply a uniform measure of the structural similarity across all fold types and map the structural relationships into a low dimensional space. Two such maps have been introduced, one is represented in the CATH database by Orengo and colleages [1997] and the other in the DALI database by Holm and Sander [1993]. Although the two maps are based on different structural alignment algorithms and multivariant analysis methods, they give similar two-dimensional projections featuring three large clusters corresponding to α, β, and α/β folds, respectively.

18 General fold classification system references Levitt, M. and C. Chothia, Structural patterns in globular proteins. Nature, 1976. 261(5561): p. 552-8. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40. Orengo, C.A., et al., CATH--a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093-108. Taylor, W.R., A 'periodic table' for protein structures. Nature, 2002. 416(6881): p. 657-60. Orengo, C.A., et al., Identification and classification of protein fold families. Protein Eng, 1993. 6(5): p. 485- 500. Efimov, A.V., Structural trees for protein superfamilies. Proteins, 1997. 28(2): p. 241-60. Zhang, C. and S.H. Kim, A comprehensive analysis of the Greek key motifs in protein beta- barrels and betasandwiches. Proteins, 2000. 40(3): p. 409-19. Holm, L. and C. Sander, Protein structure comparison by alignment of distance matrices. J Mol Biol, 1993. 233(1): p. 123-38.

19 Fold distribution Metric matrix distance geometry method applied to all pair-wise “distances” (structural dissimilarities) to assign three- dimensional coordinates to a set of 498 SCOP folds such that the relative distance between two folds is inversely correlated with the DALI alignment score. The results of the mapping are shown in the figure on the left.

20 The first 20 eigen values of the metric matrix calculated from the 498x498 DALI structural alignment scores.

21

22 Comparing the fold usages between two species in the eubacterial domain (Chlamydia versus Aquifex, A) and between those of two different domains (Chlamydia of bacteria versus Halobacterium of archaea, B). The usages of the 498 folds by the second organism are subtracted from the fold usages by the first organism. A contour surface (mesh) is then constructed and set at the values of 0.4% for blue and –0.4% for red. Regions within the blue contour include folds that appear more frequently in the first organism, whereas regions within the red contour include folds that occur more frequently in the second organism.

23 CATH database

24 Domain size The size of individual structural domains varies widely from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998), the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).


Download ppt "Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics."

Similar presentations


Ads by Google