Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.

Similar presentations


Presentation on theme: "Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome."— Presentation transcript:

1 Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Structure to function

2 Fold Group (1100) Homologous Superfamily (2100) 40,000 domain entries C AT H Sequence Family ~100,000 domains of known structure in CATH ~2 million sequences from genomes assigned to CATH superfamilies in Gene3D and functionally annotated Gene3D

3 Gene3D : Domain structure annotations in genome sequences scan against library of HMM models and sequences for CATH Pfam NewFam superfamilies ~5 million protein sequences from 560 completed genomes and UniProt ~ 2 million domain sequences assigned to CATH superfamilies

4 Gene3D (1) Cluster ~5 million sequences into protein superfamilies (2) Map domains onto the sequences using HMM technology (CATH & Pfam domains) >200,000 protein superfamilies ~10,000 domain superfamilies (2100 of known structure)

5 Proportion of genome sequences which can be assigned to domain families of known structure in CATH or SCOP HMM prediction threading prediction

6 Annotation levels for an average genome 0 50% 100% predicted to belong to structural superfamilies using HMM or threading techniques many predicted to be transmembrane many belonging to small species specific families

7 0 20 40 60 80 100 0100020003000400050006000 Families ordered by size Percentage of domain sequences Target selection strategy for PSI-2 known structure (CATH - MEGA) unknown structure (BIG -Pfam) Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG

8 Population in genomes (x 1000) Structural Diversity Correlation of sequence and structural variability of CATH families with the number of different functional groups

9 Structural diversity in the CATH Domain Superfamily P-loop hydrolases Cutinase Cocaine esterase Acetylcholinesterase

10 Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions

11 Sequence identity thresholds for 90% conservation of enzyme function (to 3 EC Levels) highly variable families Number of sequences Sequence identity threshold for 90% conservation Number of families

12 N-Fold Increase in Functional Annotation for Sequences in Gene3D general thresholds family specific thresholds N-fold increase in coverage

13 Link to UniProt Links to GO Links to different levels in the Gene3D protein family Link to InterPro Links to CATH/Pfam Links to KEGG “S” - indicates you can search the term against Gene3D Get an XML version of this page Gene3D Functional information from GO, COGS, KEGG, EC, FunCat, MINT, IntAct, ComplexDB

14 Non-PSI PDBs PSI PDBs 0 terms1 term2 terms3 terms4 terms Functional annotation of structures using EC, GO, KEGG, FunCat resources

15 Phylogenetic trees derived from multiple sequence alignments can be used to infer functionally related proteins Tree Determinants - Valencia Evolutionary Trace - Lichtarge Funshift – Sonnhammer SCI-PHY – Sjolander

16 Score conservation for each position in the alignment using an entropy measure 1 = highly conserved 0 = unconserved Putative functional site Structural model Methods exploiting information on sequence conserved residue positions Scorecons –Thornton Protein Keys – Sander multiple sequence alignment of relatives from functional group

17 Superfamily of known structure (CATH) GEMMA: Compares sequence profiles (HMMs) between subfamilies sequence subfamily 80% seq. id) putative structure-function group clusters sequence relatives predicted to have similar structures/functions even at low levels of sequence identity

18 GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark Purity (high is best) Edit distance (low) VI distance (low is best) Deviation from no. singletons (low)

19 Coverage of superfamily (%) experimental annotations inherit functions at 60% seq. id. inherit functions by GEMMA Functional annotation coverage using different strategies

20 Gene3D Biominer Methods Phylotuner: Correlation of domain occurrence profiles GOSS:Semantic Similarity calculation between protein pairs. CODA: Domain fusion analysis. HiPPI: homology inheritance of protein-protein physical interaction data. GECO: Correlation of gene expression data Protein interactions and gene networks

21 Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Structure to function

22 Methods for Assessing Structural Novelty CATHEDRAL – structure comparison Redfern et al. PLOS comp. biol. 2007

23 Structural clusters in the Aminoacyl tRNA synthetases – like family Aminoacyl tRNA synthetases DNA-binding, stress-related Argininosuccinate lyases Gln-hydrolyzing synthases Nucleotidyl-transferases structure similarity score

24 1bkzA00 2.60.120.200 1dypA00 Galectin binding superfamily

25 Aminoacyl tRNA synthetases – like 1dnpA00 Deoxyribodi- pyrimidine photo-lyases Nucleotidylyl- transferases 1ej2A00 AA tRNA synthetase, Class I 1n3lA01 Electron transfer flavoprotein 1o97D01 Identifying functional groups in domain superfamilies

26 Exploiting 3D Templates to Represent Functional Relatives JESS – Thornton GASP - Babbitt SPASM – Kleywegt PINTS – Russell DRESPAT - Sarawagi pvSOAR – Joachimiak

27 SITESEER: Match 3-residue templates and assess relevance of hits by looking at residues within the local environment green and purple – identical residues; orange and white – similar residues Laskowski and Thornton

28 FLORA:3D templates for functional groups From multiple structure alignments of functional subgroups in the superfamily, identify vectors between amino acids that are highly conserved and distinctive for the functional subgroup.

29 FLORA:3D templates for functional groups localFLORA globalFLORA single site multiple sites

30 FLORA:Performance in recognising functionally related homologues Benchmark of 36 diverse enzyme groups (from 12 families)

31 Performance of FLORA Benchmarked on 36 large enzyme families

32 FLORA: 3D Templates for Structure-Function Groups in Domain Families 1dnpA01 Deoxyribo- dipyrimidine photo-lyases 1ej2A00 Nucleotidylyl- transferases 1q77A00 Unknown function MCSG 1n3lA01 AA tRNA synthetases 1o97D01 Electron transfer flavoprotein

33 Fold and structural motifs SSM fold search Surface clefts Residue conservation DNA-binding HTH motifs Nest analysis Sequence motifs (PROSITE, BLOCKS, SMART, Pfam, etc) Sequence scans Sequence search vs PDB Sequence search vs Uniprot Superfamily HMM library Gene neighbours n-residue templates Enzyme active sites Ligand binding sites DNA binding sites Reverse templates http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/

34 Function Prediction for Proteins of ‘Putative’ or Unknown Function Class Sequence Evidence Structure Evidence Sequence + Structure Neither Successful Putative (57) 5344411 Unknown (132) 95*69*57*25 * Numbers refer to results where the top hit is classed as ‘Strong’ or ‘Moderate’ structural data provides relatively more information for proteins about which there is less knowledge these predictions need to be experimentally validated


Download ppt "Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome."

Similar presentations


Ads by Google