Presentation is loading. Please wait.

Presentation is loading. Please wait.

Domain database The CATH domain database and associated resources - DHS, Gene3D The CATH domain database and associated resources - DHS, Gene3D How do.

Similar presentations


Presentation on theme: "Domain database The CATH domain database and associated resources - DHS, Gene3D The CATH domain database and associated resources - DHS, Gene3D How do."— Presentation transcript:

1 domain database The CATH domain database and associated resources - DHS, Gene3D The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes? What is the distribution of the CATH domain families in the PDB and in the genomes? C A T H lass rchitecture opology or Fold Group omologous Superfamily Orengo & Thornton 1994 CATH

2 ~20,000 chains from Protein Databank (PDB) ~50,000 domains in CATH structure database ~40% of the entries in CATH are multidomain Multidomain proteins

3 Domains are important evolutionary units analysis by Teichmann and others suggests that ~60- 80% of genes in genomes may be multidomain

4 Carboxypeptidase G2 (1cg2A) Carboxypeptidase A (2ctc) ~30% of multidomains in CATH are discontinuous

5 Algorithms for Recognising Domain Boundaries DETECTIVE Swindells 1995 DETECTIVE Swindells 1995 each domain should have a recognisable hydrophobic core DOMAK Siddiqui & Barton, 1995 DOMAK Siddiqui & Barton, 1995 residues comprising a domain make more internal contacts than external ones PUU Holm & Sander, 1994 PUU Holm & Sander, 1994 parser for protein folding units: maximal interaction within domains and minimal interaction between domains Consensus is sought between the three methods – on average this occurs about 20% of the time

6 74% 29%21% 4% 11% Close homologues Twilight zone Midnight zone Homologues/analogues

7 Algorithms for Recognising Homologues Sequence Based methods Sequence Based methods close homologues – BLAST (Altschul et al.) - SSEARCH (Smith & Waterman) - SSEARCH (Smith & Waterman) remote homologues – SAM-T99 (Karplus et al) Structure Based Methods Structure Based Methods close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) - SSAP (Taylor & Orengo) - SSAP (Taylor & Orengo) - CORA (Orengo) - CORA (Orengo)

8 74% 29%21% 4% 11% Close homologues Twilight zone Midnight zone Homologues/analogues SSEARCH HMMs, SSAP CATHEDRAL, SSAP

9 Hidden Markov Models (HMMs) query sequence Non redundant GenBank database hits these methods can currently identify ~70% of remote homologues (3 times more powerful than BLAST) SAM-T99 Karplus Group SAMOSA Orengo Group

10 Percentage of PDB structures classified in CATH by different methods over the last 2 years Near-identical SSEARCH Close homologues (>30%) SSEARCH remote homologues (<30%) HMMs remote homologues (8.6) analogues (1.9) SSAP Novel folds

11 Percentage of structural genomics PDB structures classified in CATH by different methods over the last 2 years near-identical SSEARCH close homologues (>30%) SSEARCH remote homologues (<30%) HMMs analogues SSAP novel folds remote homologues SSAP

12 Structure Based Algorithms for Recognising Homologues CATHEDRAL Pairwise alignment - secondary structure comparison CATHEDRAL Pairwise alignment - secondary structure comparison SSAP Pairwise alignment - residue comparison SSAP Pairwise alignment - residue comparison CORA Multiple alignment – residue comparison CORA Multiple alignment – residue comparison

13 74% 29%21% 4% 11% Close homologues Twilight zone Midnight zone Homologues/analogues ssearch HMMs CATHEDRAL, SSAP

14 structure is much more highly conserved than sequence cholera toxin pertussis toxin Heat labile enterotoxin 97 79% 81 12% Structure similarity (SSAP) score Sequence identity

15 Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Families structure similarity (SSAP) score sequence identity (%) same function different function

16 Residue insertions in the loops connecting secondary structures Shifts in the orientations of secondary structures

17 Structural variation in the P-loop Hydrolase Superfamily

18 Structural variation in the Galectin Binding Superfamily

19 Fast Structure Comparison Method (CATHEDRAL) ignore the variable loop regions and only compare the secondary structures ignore the variable loop regions and only compare the secondary structures derive vectors through secondary structure elements derive vectors through secondary structure elements compare closest approach distances and vector orientations using graph theory compare closest approach distances and vector orientations using graph theory Andrew Harrison et al., JMB, 2002

20 d ab a. b = | a || b | cos  + dihedral angle  + chirality

21 Compares graphs of proteins H H H d, , , chirality node edge CATHEDRAL CATHs Existing Domain Recognition ALgorithm

22 A B C I II III A,a B,c C,d I II III Comparing proteins with similar folds identifies an overlap graph with the largest common structural motif overlap graph has a structural motif of 3 secondary structures a c d I II III b IV V b

23 Graphs are compared using the Bron Kerbosch algorithm to find the largest common graph In this example the common graph contains 5 nodes times faster than residue based methods (e.g. SSAP)

24 Performance

25 Score ~ common graph size (size protein1. size protein2) 1/2 statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures

26 Score ~ common graph size (size protein1. size protein2) 1/2 statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures

27 F = A e - b. score log F = log A - b.score scores for unrelated structures exhibit an extreme value distribution allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

28 Using CATHEDRAL to Identify Domain Boundaries Graph based secondary structure comparison is very fast times faster than residue based methods New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be used to identify significant matches % of domains in new multi-domain structures have relatives in CATH

29 Secondary structure match by graph SSAP residue alignment Multi-domain structure Fold A Fold B CATHEDRAL residues in new multi-domain residues in CATH domain family 1 residues in CATH domain family 2

30 SSAP Taylor & Orengo, J. Mol. Biol Protein B Protein A residue based structure comparison method using dynamic programming Scores range from Residues in protein B Residues in protein A

31 CATHEDRAL One third of known multi-domain structures are discontinuous

32 Reasons for Structural Similarity Divergence - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence Divergence - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space

33

34

35 ~1500 domain superfamilies in CATH ~50,000 domains in PDB Domain structure database A T H lass rchitecture opology or Fold Group omologous Superfamily Orengo & Thornton 1994 C

36 Class Architecture Topology or Fold 3 ~36 ~810 domain database ~50,000 domains CAT H

37 Topology or Fold Group ~810 Homologous Superfamily (Domain Family) ~1500 Sequence Family (35%, 60%, 95%) 40,000 domain entries ~50,000 domain entries C AT H

38 DHS Dictionary of Homologous Superfamilies Description of structural and functional characteristics for each superfamily

39 DHS Dictionary of Homologous Superfamilies Description of structural and functional characteristics for each superfamily

40 DHS:Dictionary of Homologous Superfamilies Variation in Secondary Structures Across Superfamily

41 Functional annotations from GO, EC, COGs, KEGG DHS:Dictionary of Homologous Superfamilies

42 DHS:Dictionary of Homologous superfamilies Multiple structure alignments with conserved residues highlighted

43 Population of CATH Families and Structural Groups cluster proteins with similar sequences ~50,000 structural domains ~4000 sequence families (35%) ~1,500 homologous superfamilies cluster proteins with similar structures and functions ~810 fold groups ~36 architectures 3 major protein classes cluster proteins with similar structures H T A C S

44 Rossmann Fold Jelly Roll Alpha/Beta Plaits Arc repressor-like OB Fold CATH Rossmann Alpha-beta plaitTIM barrel Jelly Roll Immunoglobulin OB fold SH3-like Up-down Arc repressor-like nearly one third of the superfamilies belong to <10 fold groups

45 CATH numbering scheme 2. Mainly beta 40. Barrel 50. OB Fold 100 Heat labile enterotoxin superfamily Class Architecture Topology Homology

46 CATH CATH domain structure database

47 CATH CATH class level

48 CATH CATH architecture level

49 CATH CATH Topology or fold group level

50 CATH CATH homologous superfamilies in each fold group

51 CATH CATH homologous superfamily level

52 CATH CATH sequence families (>=35% identity) in each superfamily

53 CATH CATH classification information for individual domains

54 CATH CATH structural relatives listed for each domain

55 CATH server

56 CATH server

57 CATH server structural matches and statistics listed for query domain

58 Library of HMMs built for representative sequences from each CATH domain superfamily Library of HMMs built for representative sequences from each CATH domain superfamily Expanding CATH with sequence relatives from genomes Scan against CATH HMM library protein sequences from genomes assign domains to CATH superfamilies

59 H S1 S2 S3 H S1 S2 S3 S4 S5 Homologous Superfamily sequences added from GenBank, genomes, SWPT- TrEMBL CATH-HMMs Sequence family Expanding CATH ~1400 Domain Structure Superfamilies ~50,000 sequences ~4,000 sequence families ~600,000 sequences ~24,000 sequence families Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies

60 Rossmann Fold Jelly Roll Alpha/Beta Plaits TIM Barrel Immunoglobulin-like Arc repressor-like OB Fold Four helix bundle SH3-type barrel Alpha horseshoe fold Gene3D Rossmann Alpha-beta plait TIM barrel Jelly Roll Arc repressor-like Up-down SH3-like OB fold Immunoglobulin Alpha horseshoe

61 Gene3D CATH domain structure annotations for complete genomes

62 Gene3D Individual genome statistics

63 Gene3D Assignment of sequences to Gene3D protein families

64 Gene3D Functional annotations for individual sequences

65 Gene3D Functional annotations for individual sequences

66 Gene3D Domain annotations for individual sequences

67 Gene3D Domain annotations for individual sequences

68 Summary CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB These domains families contain over 600,000 domain sequences from the genomes and sequence databases Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading

69 Frances Pearl Ian Sillitoe Oliver Redfern Mark Dibley Tony Lewis Chris Bennett Andrew Harrison Gabrielle Reeves Alastair Grant David Lee Acknowledgements Janet Thornton Medical Research Council, Wellcome Trust, NIH Biotechnology and Biological Sciences Research Council


Download ppt "Domain database The CATH domain database and associated resources - DHS, Gene3D The CATH domain database and associated resources - DHS, Gene3D How do."

Similar presentations


Ads by Google