Presentation is loading. Please wait.

Presentation is loading. Please wait.

C A T H C A T H lass rchitecture opology or Fold Group

Similar presentations


Presentation on theme: "C A T H C A T H lass rchitecture opology or Fold Group"— Presentation transcript:

1 C A T H C A T H lass rchitecture opology or Fold Group
domain database A Orengo & Thornton 1994 rchitecture T opology or Fold Group H omologous Superfamily The CATH domain database and associated resources DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes?

2 Multidomain proteins ~40% of the entries in CATH are multidomain
~20,000 chains from Protein Databank (PDB) ~50,000 domains in CATH structure database ~40% of the entries in CATH are multidomain

3 Domains are important evolutionary units
analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be multidomain

4 ~30% of multidomains in CATH are discontinuous
Carboxypeptidase A (2ctc) Carboxypeptidase G2 (1cg2A) ~30% of multidomains in CATH are discontinuous

5 Algorithms for Recognising Domain Boundaries
DETECTIVE Swindells 1995 each domain should have a recognisable hydrophobic core DOMAK Siddiqui & Barton, 1995 residues comprising a domain make more internal contacts than external ones PUU Holm & Sander, 1994 parser for protein folding units: maximal interaction within domains and minimal interaction between domains Consensus is sought between the three methods – on average this occurs about 20% of the time

6 Homologues/analogues
74% Close homologues 29% 21% Twilight zone 4% Midnight zone 11% Homologues/analogues

7 Algorithms for Recognising Homologues
Sequence Based methods close homologues – BLAST (Altschul et al.) - SSEARCH (Smith & Waterman) remote homologues – SAM-T99 (Karplus et al) Structure Based Methods close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) - SSAP (Taylor & Orengo) - CORA (Orengo)

8 Homologues/analogues
74% Close homologues SSEARCH 29% 21% Twilight zone HMMs, SSAP 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP

9 Hidden Markov Models (HMMs)
SAM-T Karplus Group SAMOSA Orengo Group Non redundant GenBank database query sequence hits these methods can currently identify ~70% of remote homologues (3 times more powerful than BLAST)

10 Percentage of PDB structures classified in CATH by different methods over the last 2 years
remote homologues (8.6) analogues (1.9) SSAP Novel folds 2.0 1.9 remote homologues (<30%) HMMs 8.6 7.6 20.7 59.2 Close homologues (>30%) SSEARCH Near-identical SSEARCH

11 Percentage of structural genomics PDB structures classified in CATH by different methods over the last 2 years near-identical SSEARCH novel folds 22.0 8.0 28.4 7.7 11.8 analogues SSAP close homologues (>30%) SSEARCH remote homologues SSAP remote homologues (<30%) HMMs

12 Structure Based Algorithms for Recognising Homologues
CATHEDRAL Pairwise alignment - secondary structure comparison SSAP Pairwise alignment - residue comparison CORA Multiple alignment – residue comparison

13 Homologues/analogues
74% Close homologues ssearch 29% 21% Twilight zone HMMs 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP

14 structure is much more highly conserved than sequence
cholera toxin pertussis toxin Structure similarity (SSAP) score 97 81 Heat labile enterotoxin 79% 12% Sequence identity

15 structure similarity (SSAP)
Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Families structure similarity (SSAP) score same function different function sequence identity (%)

16 Residue insertions in the loops connecting secondary structures
Shifts in the orientations of secondary structures

17 Structural variation in the P-loop Hydrolase Superfamily

18 Structural variation in the Galectin Binding Superfamily

19 Fast Structure Comparison Method (CATHEDRAL)
Andrew Harrison et al., JMB, 2002 ignore the variable loop regions and only compare the secondary structures derive vectors through secondary structure elements compare closest approach distances and vector orientations using graph theory

20 d a b a . b = | a || b | cos  + dihedral angle  + chirality

21 Compares graphs of proteins
CATHEDRAL CATHs Existing Domain Recognition ALgorithm d, , , chirality H H edge d, , , chirality d, , , chirality H node Compares graphs of proteins

22 overlap graph has a structural motif of 3 secondary structures
Comparing proteins with similar folds identifies an overlap graph with the largest common structural motif A III A,a I C III II B I C,d IV a B,c II III b b I overlap graph has a structural motif of 3 secondary structures d V II c

23 In this example the common graph contains 5 nodes.
Graphs are compared using the Bron Kerbosch algorithm to find the largest common graph In this example the common graph contains 5 nodes. 1000 times faster than residue based methods (e.g. SSAP)

24 Performance

25 statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2

26 statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2

27 F = A e - b . score log F = log A - b .score
scores for unrelated structures exhibit an extreme value distribution F = A e - b . score log F = log A - b .score allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

28 Using CATHEDRAL to Identify Domain Boundaries
Graph based secondary structure comparison is very fast times faster than residue based methods New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be used to identify significant matches. 85-90% of domains in new multi-domain structures have relatives in CATH

29 CATHEDRAL residues in new multi-domain
Multi-domain structure Secondary structure match by graph SSAP residue alignment residues in new multi-domain residues in CATH domain family 1 Fold A residues in CATH domain family 2 Fold B

30 residue based structure comparison method using dynamic programming
SSAP Protein B Protein A Taylor & Orengo, J. Mol. Biol. 1989 residue based structure comparison method using dynamic programming Scores range from 0-100 Residues in protein A Residues in protein B

31 One third of known multi-domain structures are discontinuous
CATHEDRAL One third of known multi-domain structures are discontinuous

32 Reasons for Structural Similarity
Divergence - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space

33

34

35 Domain structure database
lass Domain structure database A Orengo & Thornton 1994 rchitecture T opology or Fold Group H omologous Superfamily ~50,000 domains in PDB ~1500 domain superfamilies in CATH

36 C A T H 3 ~36 ~810 ~50,000 domains Class Architecture Topology or Fold
domain database

37 Superfamily (Domain Family)
C A T H Topology or Fold Group ~810 40,000 domain entries ~50,000 domain entries Homologous Superfamily (Domain Family) ~1500 Sequence Family (35%, 60%, 95%)

38 Dictionary of Homologous Superfamilies
DHS Dictionary of Homologous Superfamilies Description of structural and functional characteristics for each superfamily

39 Dictionary of Homologous Superfamilies
DHS Dictionary of Homologous Superfamilies Description of structural and functional characteristics for each superfamily

40 Variation in Secondary Structures Across Superfamily
DHS:Dictionary of Homologous Superfamilies

41 Functional annotations from GO, EC, COGs, KEGG
DHS:Dictionary of Homologous Superfamilies

42 Multiple structure alignments with conserved residues highlighted
DHS:Dictionary of Homologous superfamilies

43 Population of CATH Families and Structural Groups
~50,000 structural domains cluster proteins with similar sequences ~4000 sequence families (35%) S cluster proteins with similar structures and functions ~1,500 homologous superfamilies H cluster proteins with similar structures T ~810 fold groups A ~36 architectures C 3 major protein classes

44 nearly one third of the superfamilies belong to <10 fold groups
Rossmann Fold Jelly Roll Alpha/Beta Plaits Arc repressor-like OB Fold CATH Arc repressor-like nearly one third of the superfamilies belong to <10 fold groups Up-down Rossmann SH3-like OB fold Immunoglobulin Jelly Roll Alpha-beta plait TIM barrel

45 CATH numbering scheme 2.40.50.100 Class 2. Mainly beta 40. Barrel
Architecture OB Fold Topology 100 Heat labile enterotoxin superfamily Homology

46 CATH domain structure database
CATH domain structure database

47 CATH CATH class level

48 CATH architecture level
CATH architecture level

49 CATH Topology or fold group level
CATH Topology or fold group level

50 CATH homologous superfamilies in each fold group
CATH homologous superfamilies in each fold group

51 CATH homologous superfamily level
CATH homologous superfamily level

52 CATH sequence families (>=35% identity) in each superfamily
CATH sequence families (>=35% identity) in each superfamily

53 CATH classification information for individual domains
CATH classification information for individual domains

54 CATH structural relatives listed for each domain
CATH structural relatives listed for each domain

55 CATH server

56 CATH server

57 structural matches and statistics listed for query domain
CATH server structural matches and statistics listed for query domain

58 Expanding CATH with sequence relatives from genomes
Library of HMMs built for representative sequences from each CATH domain superfamily Scan against CATH HMM library protein sequences from genomes assign domains to CATH superfamilies

59 ~1400 Domain Structure Superfamilies
Expanding CATH ~1400 Domain Structure Superfamilies sequences added from GenBank, genomes, SWPT-TrEMBL S1 S1 S2 H S2 H S3 Homologous Superfamily Homologous Superfamily S3 CATH-HMMs S4 Sequence family S5 ~50,000 sequences ~4,000 sequence families ~600,000 sequences ~24,000 sequence families Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies

60 Gene3D Arc repressor-like Up-down Alpha horseshoe SH3-like OB fold
Rossmann Fold Jelly Roll Alpha/Beta Plaits TIM Barrel Immunoglobulin-like Arc repressor-like OB Fold Four helix bundle SH3-type barrel Alpha horseshoe fold Gene3D Arc repressor-like Up-down Alpha horseshoe SH3-like OB fold Rossmann Immunoglobulin Jelly Roll TIM barrel Alpha-beta plait

61 CATH domain structure annotations for complete genomes
Gene3D CATH domain structure annotations for complete genomes

62 Individual genome statistics
Gene3D Individual genome statistics

63 Assignment of sequences to Gene3D protein families
Assignment of sequences to Gene3D protein families

64 Functional annotations for individual sequences
Gene3D Functional annotations for individual sequences

65 Functional annotations for individual sequences
Gene3D Functional annotations for individual sequences

66 Domain annotations for individual sequences
Gene3D Domain annotations for individual sequences

67 Domain annotations for individual sequences
Gene3D Domain annotations for individual sequences

68 Summary CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB These domains families contain over 600,000 domain sequences from the genomes and sequence databases Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading

69 Acknowledgements Janet Thornton Frances Pearl Ian Sillitoe
Oliver Redfern Mark Dibley Tony Lewis Chris Bennett Andrew Harrison Gabrielle Reeves Alastair Grant David Lee Janet Thornton Medical Research Council, Wellcome Trust, NIH Biotechnology and Biological Sciences Research Council


Download ppt "C A T H C A T H lass rchitecture opology or Fold Group"

Similar presentations


Ads by Google