Presentation is loading. Please wait.

Presentation is loading. Please wait.

MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.

Similar presentations


Presentation on theme: "MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families."— Presentation transcript:

1 MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families can we identify in the proteomes  How many structures needed to cover a high fraction of prokaryotic, eukaryotic families  Targeting Universal Recurrent Superfamilies (SCOP/CATH/Pfam) to optimise coverage of fold and function space Russell Marsden, Alastair Grant, David Lee, Annabel Todd Janet Thornton, Andrzej Joachim Midwest Consortium

2 Protein Families in Complete Genomes with Structural/Functional Annotations 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes Gene3D Buchan, Thornton, Orengo, Genome Research (2002)

3 Protein Families in Complete Genomes with Structural/Functional Annotations 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes Gene3D Buchan, Thornton, Orengo, Genome Research (2002)

4 BLAST all the sequences from 120 completed genomes against each and cluster into protein families BLAST all the sequences from 120 completed genomes against each and cluster into protein families For each sequence identify CATH and Pfam domains For each sequence identify CATH and Pfam domains Clustering Sequences into Protein Superfamilies of Known Domain Composition PFscape - Protein Family Landscape SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000 TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002

5 Clustering ~800,000 genes from 120 complete genomes PFscape Gene Superfamily 1 Gene Superfamily 2 Gene Superfamily 3 Gene Superfamily 4 ~50,000 gene superfamilies of 2 or more sequences, 150,000 singletons

6 Library of HMMs built for representative sequences from each CATH and Pfam domain superfamily Library of HMMs built for representative sequences from each CATH and Pfam domain superfamily Mapping CATH and Pfam Domains onto Genome Sequences Scan against CATH & Pfam SAM-T99 HMM library protein sequences from genomes assign domains to CATH and Pfam superfamilies

7 Performance of Sequence Mapping Method 1D-HMM (SAM-T99) Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by SAM-T99 (%) of homologues found Error rate Library of 1D-HMM models detects ~80% of remote homologues

8 Use HMMs to annotate Gene Superfamilies with CATH and Pfam domains Gene Superfamily 1 Gene Superfamily 3 Gene Superfamily 4 Gene Superfamily 2 50,000 Gene Superfamilies CATH Pfam NewFam

9 Gene Superfamily 1 Gene Superfamily 3 Gene Superfamily 2 Merge superfamilies with the same domain combinations Gene3D: 50,000 -> 36,000 Superfamilies

10 Superfamily Families (35%ID) Superfamilies Further Classified into Families Multi-linkage clustering relatives in each sequence family have 35% or more sequence identity relatives in each sequence family have 35% or more sequence identity For good homology models one structure is needed for each family within a superfamily

11 Percentage of Families CATH (60,360)+Pfam(53,907)+Newfam(56,973) = 171,240 Families Number of domain superfamilies and families with no close structural homologue CATH (1400)+Pfam(4100)+Newfam(46,384) = 51,844 Superfamilies 100 50 NewFamCATHPfam Percentage of Sequence Families with and without Close Structural Homologues (>35% identity) No close PDB homologue

12 CATH Number of Non-identical Relatives Pfam Fitted power-laws (with gradients) CATH (-0.4) Pfam (-1.0) Newfam (-1.9) Newfam Number of Non-identical Relatives Number of Superfamilies containing given number of Non-identical relatives as percentage of the total Preferentially Target Largest Superfamilies

13 50 ~70% of Proteomes are contained in < 2500 Largest CATH + Pfam + NewFamTarget Superfamilies Proteome Coverage by Superfamilies Superfamilies Ordered by Size Percentage of Proteomes (Number of non-identical proteins in 120 completed genomes) 0 50 100

14 Superfamilies Ordered by Size Percentage of Proteomes (120 completed genomes) 50 Proteome Coverage by Superfamilies CATH (superfamilies of known fold) Pfam NewFam

15 What Fraction of the Proteomes is covered by Bacterial Family Targets? Number of Target Families Percentage of Proteomes (120 completed genomes) 40 o 50 ~100,000 prokaryotic targets cover nearly 60% of proteomes 100,000200,000 0 0 50 100 prokaryotes eukaryotes eukaryotes plus prokaryotes

16 How many family targets cover a significant proportion of the eukaryotes and/or prokaryotes? Number of Target Families Percentage of Kingdom Proteomes (120 completed genomes) 40 o 50 25,000 - 45,000 family targets cover 70% of proteomes (< 2500 largest superfamily targets) prokaryotes eukaryotes eukaryotes plus prokaryotes 25,00045,000 30,000

17 MCSG Site Visit, Argonne, January 30, 2003 Target Selection Strategy  the largest < 2500 superfamily targets give 70% of proteomes  this corresponds to 25,000 - 45,000 family targets  accurate homology models are not needed for all families  target families of biological interest or containing human homologues with disease association  targets families from functionally diverse superfamilies to understand how changes in the structure can modify function  For example, Universal, Highly Recurrent Superfamilies are an interesting biological subset with diverse functions

18 Universal CATH Domain Superfamilies 30 representative eukaryotic and prokaryotic organisms Proportion of CATH domain annotations 0 50 100 ~60-70% of CATH domain annotations within each organism are from < 200 CATH universal superfamilies common to all kingdoms of life some of which are very extensively duplicated

19 Domain Recurrences in the Genomes number of superfamilies occurrences 730570 Highly Recurrent, Extensively Duplicated Superfamilies

20 S R Y V Z W O U T N M D A J L B P Q K I H E F G C Poorly charac. Cellular processes and signalling Information stor. & proce. Metabolism Analysis in bacterial genomes showed that 56 Universal Superfamilies recurred in proportion to the genome size and accounted for 45% of the CATH domain annotations Analysis in bacterial genomes showed that 56 Universal Superfamilies recurred in proportion to the genome size and accounted for 45% of the CATH domain annotations COG functional annotation (25 Functional Categories) E (Amino acid metabolism) J (Translation and protein biosynthesis) K (Transcription) T (Signal Transduction) 56 Universal and Highly Recurrent Superfamilies 15,000 bacterial family targets

21 Relative with most neighbours for which homology model can be built or function assigned For >95% confidence when inheriting functional properties, homologues should have at least 60% identity (Todd, Valencia, Rost) In Functionally Diverse Superfamilies Select More Targets In Functionally Diverse Superfamilies Select More Targets

22 functional clusters identified by sequence conservation annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT) annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT) stored in Gene3D functional clusters S60_1 Superfamily S60_2 S60_3 S60_4 S60_5 Representative Structures for Superfamilies will help identify Functional Families

23 MCSG Site Visit, Argonne, January 30, 2003 Target Selection Strategy  Targeting the 2500 largest superfamilies will cover a significant proportion (70%) of the proteomes  For good homology models between 25,000 - 45,000 family targets are needed  Preferentially select targets from medically important and/or structurally and functionally diverse superfamilies  For example, targeting Universal and Recurrent superfamilies which exhibit significant structural and functional divergence will help to improve function prediction methods


Download ppt "MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families."

Similar presentations


Ads by Google