Presentation is loading. Please wait.

Presentation is loading. Please wait.

Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses.

Similar presentations


Presentation on theme: "Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses."— Presentation transcript:

1 Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

2 a ab duplication speciation species 1species 2 aa paralogues orthologues Homology based inference of protein functions orthologues - often have very similar functions paralogues - may have related functions ancestral protein

3 MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP…. search for orthologues search family resources (ortholgues and paralogues) analyse residue features to predict transmembrane, localisation etc predict protein interactions search for conserved residues

4 MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP…. search for orthologues HAMAP, EggNogg, COGS, KOGS search family resources (ortholgues and paralogues) analyse residue features predict transmembrane, localisation etc predict protein interactions search for conserved residues

5 HAMAP families Orthologous protein families used for High-quality Automated and Manual Annotation of microbial Proteomes in UniProtKB. 1,448 families, from Bacteria, Archaea and Plastid covering over 180,000 UniProtKB/Swiss-Prot entries, available on: http://www.expasy.org/sprot/hamap/families.html Anne-Lise Veuthey, SIB

6 HAMAP pipeline UniProtKB/ TrEMBL Profile Automated annotation Manual checking of warnings given by the system HAMAP family rules Automatic retrieval of sequences matching the profile

7 MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP…. search for orthologues analyse residue features predict transmembrane, localisation etc predict protein interactions search for conserved residues search family resources SMART, ProtoNet, Everest, Gene3D, CATH, InterPro Pfam, TIGR, PRINTS, SCOP

8 (1)Cluster 4.5 million sequences (510 completed genomes) into protein superfamilies using APC clustering algorithm (2) Map domains onto the sequences using HMM technology (CATH & Pfam domains) 335,000 protein superfamilies (orthofams) (189,000 have >5 sequences) 19% are singletons ~11,000 domain superfamilies (2100 CATH of known structure – account for ~85% of domains) BLAST, APC CATH, Pfam HMM libraries

9 Gene3D - OrthoFams Functional annotation of selected node Root Node 30% ID 95% ID 335,000 Protein families built using Affinity Propogation Clustering. Annotated with FunCat, HAMAP, EC, KEGG, GO, IntACT, HPRD, and others. Benchmarking – 99.9% map to single HAMAP

10 Functional Catalogue (FunCat) Organized hierarchically with up to six levels.Organized hierarchically with up to six levels. ~1307 categories~1307 categories Currently 9 organisms incorporated: yeast, human, A.thaliana, …Currently 9 organisms incorporated: yeast, human, A.thaliana, … Dmitrj Frishmann, GSF

11 ProtoNet 5.1 EVEREST 2.0 Michal Linial, HUJI 2.5M sequences Michal Linial, HUJI ProtoNet and EVEREST family resources

12 . Root B22 B40 B14 B31 B32 B13 B44 B 10 B37 B26 B16 B28 B11 B27 B29 B20 B25 B7 B30 B18 B9 B42 B36 B23 B19 B5 B21 B33 B39 B12 B38 B17 B35 B8 B43 B34 B41 B24 B6 B15 B4 B1 B3 AE E1 A1 A3 A2 A4 A5 A10 B2 A6 A7 A9 A8 A11 E2 A12 2.5M sequences from UniProt UPGMA efficient clustering algorithm Benchmarked against Pfam, SCOP

13 ProtoName: s afe inference of annotation For each cluster annotation assigned an Annotation Score if proteins achieve p-value <= 0.001 (b) Only clusters with > 5 proteins are considered (c) Purity is >0.9 (TP/ TP+FN) (d) Combination of functional keywords For each protein, assign the annotations of its cluster and all parents >40% of the clusters and 65% of proteins assigned a safe ProtoName

14 protein superfamily ~11%of PROTEIN superfamilies in a genome are common to all kingdoms,

15 protein superfamily common domains nearly 60% of domains are from ~200 superfamilies COMMON to all major kingdoms these have been combined in different ways to modulate function ~11%of PROTEIN superfamilies in a genome are common to all kingdoms,

16 Evolution of functional subfamilies within superfamilies. Root B22 B40 B14 B31 B32 B13 B44 B 10 B37 B26 B16 B28 B11 B27 B29 B20 B25 B7 B30 B18 B9 B42 B36 B23 B19 B5 B21 B33 B39 B12 B38 B17 B35 B8 B43 B34 B41 B24 B6 B15 B4 B1 B3 AE E1 A1 A3 A2 A4 A5 A10 B2 A6 A7 A9 A8 A11 E2 A12 Species tree built on the small subunit (SSU) ribosomal RNA superfamily + ++++++ + ++++++ ++++++ + COG functional categories

17 Percentage frequencies of functional shifts within domain superfamilies Function is predominantly conserved within the same COG functional subcategory or major category However, there are clearly cases of major functional shifts parent functions child functions metabolism signal transduction protein biosynthesis poorly characterised

18 Population in genomes (x 1000) Structural Diversity <10% of domain superfamilies (<200) are highly expanded in the genomes and functionally very diverse ~2000 superfamilies

19 N-fold increase in functional annotation using pairwise sequence identity thresholds general thresholds family specific thresholds N-fold increase in coverage >50% sequence identity - 90% probability of having related functions If the domains have the same multidomain context >30% sequence identity – 90% probability of having related functions

20 Some superfamilies contain multiple diverse functional subfamilies

21 MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP…. search for orthologues analyse residue features predict transmembrane, localisation etc predict protein interactions search family resources (orthologues and paralogues) search for conserved residues TreeDet, ScoreCons, GEMMA ETtrace, SCI-PHY, FunShift

22 Score conservation for each position in the alignment using an entropy measure 1 = highly conserved 0 = unconserved Putative functional site Structural model Identify functional subfamilies by using information on sequence conserved residue positions Scorecons –Thornton TreeDet - Valencia multiple sequence alignment of relatives from functional subfamily

23 Phylogenetic trees derived from multiple sequence alignments can be used to identify functional subfamilies TreeDet - Valencia SCI-PHY – Sjolander FunShift – Sonnhammer ETtrace - Lichtarge

24 TreeDet method for identifying functional subfamilies AlfonsoValencia group, CNIO

25 domain superfamily GEMMA: Compares sequence profiles (HMMs) between subfamilies using COMPASS method sequence subfamily 90% seq. id putative functional subfamily clusters sequence relatives predicted to have related functions

26 GeMMA v SCI-PHY using gold standard Babbitt benchmark of 5 large curated superfamilies Purity (high is best) Edit distance (low) VI distance (low is best) Deviation from no. singletons (low)

27 Coverage of superfamily (%) experimental annotations inherit functions at 50% seq. id. inherit functions by GEMMA Functional annotation coverage using different strategies

28 MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP…. search for orthologues search family resources (orthologues and paralogues) analyse residue features predict transmembrane MEMSAT, TMHMM, ENSEMBLE, PONGO predict protein interactions search for conserved residues analyse residue features predict disorder, signal peptides, localisation Barcello, DisoPred, FFpred

29 A biological hydrophobicity scale (Hessa et al., Nature 433:377 & 450:1026; Bernsel et al. PNAS in press) Gunnar Von Heijne, STO

30 Pongo annotation engine Seven predictors at the core: all-α TM topology; (a) TMHMM 2.0 (b) MEMSAT (c) PRODIV (d) ENSEMBLE (e) ENSEMBLE 2.0 (f) TMHMM DOMFIX signal peptide; (a) SPEP Rita Casadio, UNIBO http://pongo.biocomp.unibo.it/pongo

31 Performance of the high scoring methods on the 121 high- resolved chains (from PDB) Correct Topography: Correct Position of TMhelices along the sequence Correct Topology: Correct Position AND Correct Orientation with respect to the membrane plane

32 The PONGO engine: http://pongo.biocomp.unibo.it Amico M, Finelli M, Rossi I, Zauli A, Elofsson A, Viklund H, von Heijne G, Jones D, Krogh A, Fariselli P, Martelli PL, Casadio R -PONGO: a web server for multiple predictions of all-alpha transmembrane proteins- Nucleic Acids Res 34(Web server issue):169-172 (2006)

33 CBS prediction servers Broad range of prediction servers Amino acid sequence based methods within: Protein sorting Post-translational modifications of proteins Protein function and structure Immunological features Local protein features, e.g. kinase-specific phosphorylation site, nuclear export signal, propeptide cleavage site Global properties, e.g. cell cycle regulated, secreted via a non-classical pathway, member of the nucleolar subproteome, GO categories, EC categories,... Soren Brunak, DTU

34 FFPred: An Integrated Feature based Function Prediction Server for Vertebrate Proteomes Inferring function using patterns of native disorder in proteins. Lobley, A.E., Swindells, M.B., Orengo, C.A. & Jones, D.T. (2007) PLoS Comput. Biol. 3:e162. > 300 GO Term Classifiers for both Molecular Process and Biological Function Categories David Jones, UCL

35 Protein Annotations from Sequence Data Network based analyses

36 CORUM: the comprehensive resource of mammalian protein complexes No of Proteins/ Protein complexes consists of 2100 protein complexes covers ~3000 different proteins, representing 15% of protein coding genes in mammals Dmitrj Frishmann, GSF

37 MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP…. search for orthologues search family resources (orthologues and paralogues) analyse residue features predict transmembrane, disorder etc search interactions resources CORUM, IntAct, HPRD, BIND search for conserved residues Predict interactions STRING, DIMA G3D-BioMiner PROLINKS

38 Gene3D-BioMiner hiPPI homology inherited Protein-Protein Interactions CODA Co-Occurance of Domains Analysis GECO Gene Expression Correlation PhyloTuner Domain family co-evolution detection Visualisation in CytoScape Adding known functional associations i.e. from FunCat. Weighted Integration

39 CODA: FUSED DOMAINS Specie 1 Specie 2 Method adapted from Enright, Ouzounis but a new scoring scheme has been developed BioMiner

40 Homology Inferred Protein Protein Interactions Inherit data provided by HPRD, IntAct, BIND, CORUM HiPPI: Protein-protein physical interaction data Superfamily ASuperfamily B

41 Eisenberg Phylogenetic Profiles for Detecting Functional Associations Superfamily 1 Superfamily 2 Superfamily 3 CATH Domain Superfamily Organism 1 2 3 4 35 0 12 60 12 13 14 11 6 0 0 0 Gene3D Phylogenetic Occurrence Profiles Superfamily 1 Superfamily 2 Superfamily 3 Superfamily Organism 1 2 3 4 1 0 1 0 0 0 1 1 FunctionallyLinked presence or absence of superfamily in organism number of sequence relatives from superfamily in organism

42 Eisenberg Phylogenetic Profiles for Detecting Functional Associations Superfamily 1 Superfamily 2 Superfamily 3 CATH Domain Superfamily Organism 1 2 3 4 7 0 3 0 3 6 4 5 6 0 2 0 Gene3D PhyloTuner Occurrence Profiles Superfamily 1 Superfamily 2 Superfamily 3 Superfamily Organism 1 2 3 4 1 0 1 0 0 0 1 1 FunctionallyLinked presence or absence of superfamily in organism number of sequence relatives from superfamily in organism Ranea et al. (2007) PLOS Comp. Biol.

43 Cluster Level Genome Occurrence Sp1Sp2Sp3 Superfam.873 s30(a)642 s30(b)231 s35(a)642 s35(b)231 s40(a)642 s40(b)(i)030 s40(b)(ii)201 s50(a)642 s50(b)(i)030 s50(b)(ii)201 ………… Domain Superfamilies clustered at different levels of sequence identity: Sup.S30S35S40S50 … (S100) Phylo-Tuner algorithm Phylogenetic Occurrence Profile Matrix Species1 Species2Species3 Superfamily

44 Sup.S30S35S40S50 … (S100) Superfamily X Sup.S30S35S40S50 … (S100) Superfamily Y Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 5 10 E match E match <<<< E all_rest Euclidian distance: Phylo-Tuner Sp1 Sp2 Sp3 Sp4 Sp5 … Spn Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7. Cluster n 6 0 6 9 5 … 9 4 3 7 5 3 … 5 1 0 1 0 2 … 1 0 2 0 0 1 … 6 1 4 1 4 1 … 4 0 3 5 2 0 … 1 4 8 4 8 4 … 8..... …. 0 1 0 1 1 … 0 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7. Cluster n 3 0 6 0 4 … 10 4 3 7 5 6 … 5 0 0 1 0 2 … 1 1 2 1 0 1 … 6 1 4 0 4 1 … 4 0 4 5 2 0 … 1 2 6 4 8 4 … 7..... …. 0 1 0 1 1 … 0 Zs calculations Xi Zs = E match

45 Highly similar profiles correspond to pairs of families with significant similarity in GO functions Highly similar profiles correspond to pairs of families with significant similarity in GO functions true positivesfalse positives ratio of true positives to false positives Biological process Ranea et al. (2007) PLOS Comp. Biol.

46 Performance of Gene3D-BioMiner integrated methods assessed using a yeast genome dataset and semantic similarity of GO terms

47 Phylogenetic domain profiling PF1 100101110001110001 PF2 011100011101001100 PF3 100101110001110001 PF4 110001000100100000 PPI DDI (DPEA: Riley et al.) 460 completed genomes! 2 versions: Domain interactions derived from PDB Finn et al. 2005 Stein et al. 2005 SIMAP/BOINC for Pfam domain search known PPIs predicted PPIs

48 STRING – functional protein interactions 378 genomes Interaction evidence Genomic context Primary experiments Pathway databases Literature mining New network viewer Confidence view vs. evidence view Miniature protein structures Peer Bork, EMBL

49 Protein interaction networks Over 2 million interactions in 184 genomes, previously uncharacterised Filtering out promiscuous domains, excluding implausible interactions Kamburov A et al. 2007) Denoising inferred functional association networks obtained by gene fusion analysis. BMC Genomics, 2007; 8(1):460 Denoising Protein Interaction Networks Christos Ouzounis, CERTH

50 Evaluation of graph-based clustering algorithms for extracting complexes from protein interaction networks Evaluation protocol o Reference complexes: MIPS database o Test with altered networks: various proportions of random edge addition/removal. o Testing of all parametric conditions. o Definition of assessment statistics (Sensitivity, Positive Predictive Value, Accuracy) Reference network: MIPS complexesAltered network (100% edge additions, 40% removal) Sylvain Brohée and Jacques van Helden (2006). BMC Bioinformatics 7: 488

51 Acknowledgements Protein Families Michal Linial HUJI, Jerusalem Anne Lise Veuthey SIB, Swistzerland Dmitrij Frishmann GSF, Germany Alfonso Valencia CNIO, Spain Feature Based Prediction Gunnar Von Heijne STO, Sweden Rita Casadio UNIBO, Italy David JonesUCL, London Soren Brunak DTU, Denmark Protein Interactions Christos Ouzounis CERTH, Greece Jacques Van Helden ULB, Brussels

52 Network Analysis Tools (NeAT) A toolbox for the analysis of networks, clusters and pathways o Graph-based clustering o Path finding o Graph comparisons o Graph randomization o Graph alteration o … Web site: http://rsat.scmbb.ulb.ac.be/neat/http://rsat.scmbb.ulb.ac.be/neat/ Jaques Van Helden, ULB

53 Network Analysis of QTLs in Mouse QTL1 QTL2 QTL3 Novel genes can be discovered describing the trait in question Maps protein interaction network to an inferred QTL network Assigns functional roles to protein subnetworks on the basis of the phenotypic traits they are mapped to Christos Ouzounis, CERTH

54 DASMI – Distributed Annotation System for Molecular Interactions Based on the Distributed Annotation System (DAS) Interaction servers and visualization clients DASMI web: Client for inte- gration of protein and domain interactions and function, possible application of quality measures iPfam : Client for graphical visualization of various domain interaction data sets

55 Proportion of genome sequences which can be assigned to 2100 domain families of known structure in CATH

56 Conservation of enzyme function for homologous domains Conservation of EC number to 3 levels (%) Sequence identity same MDA CATH-1Pfam-1Pfam-2 MDA different MDA Number of pairs of relatives >50% sequence identity - 90% probability of having related functions If the domains have the same multidomain architecture (MDA) >30% sequence identity – 90% probability of having related functions


Download ppt "Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses."

Similar presentations


Ads by Google