Presentation on theme: "System approaches to the prediction of protein function"— Presentation transcript:
1 System approaches to the prediction of protein function Søren BrunakCenter for Biological Sequence AnalysisTechnical University of Denmark
2 40-60% proteins of unknown function in the human genome
3 Diverse functional categories of cell cycle regulated yeast proteins Level 1 GO categories for 349 cell cycle regulated yeast genes. Only 95 of these belong to the ”Cell Cycle” category (biological process).
4 Diverse functional categories for human nucleolus proteins Level 1 GO categories for 148 human genes located in the nucleolus. Only 5 of these belong to the ”Nucleolus” category (cellular component).
8 Transfer of functional information – in what space ? Recognize function in:Sequence space – sequence alignmentStructure space – structural comparisonGene expression spaces – array dataInteraction spaces – network/pathway extractionPaper space – text mining…Protein feature space
9 Predict orphan protein function in feature space Orphan sequences have to use the standard cellular machinery for sorting, post-translational modification, etc.Similar pattern of modification may imply similar functionPredict sequence attributes independently, e.g. local and global properties such as- post-translational modifications- localization signals- degradation signals- structure- composition, length, isoelectric point, ….Then integrate and correlate using neural networks
10 Serine phosphorylation sites Acceptor site Pos. TargetAKKG S EQES S-10 PKA (1CMK)GFGD S IEAQ S-87 Ovalbumin (1OVA)EVVG S AEAG S-350 Ovalbumin (1OVA)GDLG S CEFH S-80 Cystatin (1CEW)
13 Length distributions and functional role categories
14 Propeptide cleavage sites Post-translational processing by limited proteolysis of inactive secretory precursors produces active proteins and peptidesFurin specific (a)and otherproprotein convertasecleavage sites (b)
15 PCs activate a large variety of proteins Peptide hormones, neuropeptides, growth anddifferentiation factors, adhesion factors, receptors,blood coagulation factors, plasma proteins,extracellular matrix proteins, proteases,exogenous proteins such as coat glycoproteins frominfectious viruses (e.g. HIV-1 and Influenza) andbacterial toxins (e.g. diphtheria and anthrax toxin).PCs play an essential role in many vital biologicalprocesses like embryonic development and neuralfunction, and in viral and bacterial pathogenesis.PCs are implicated in pathologies such as cancerand neurodegenerative diseases.
16 Mucin-type O-glycosylation N-acetylgalactosamine (GalNAc) a-1 linked to the hydroxyl group of a serine or threonineResponsible for the high carbohydrate content of mucin proteins (>50% of the dry weight)Mucins, principal component of mucus, protects epithelial surfaces from dehydration, mechanical injury, proteases and pathogensMucin-type glycosylation contributes to this by changing the structure to a stiff extended one and charging the protein to make it bind more water
18 Positional preference of N-Glyc sites across cellular role categories
19 Functional classes predicted Functional role (Monica Riley categories)The original scheme had 14 categoriesReduced to 12 categories by skipping the category ”other” and combining replication and transcriptionEnzyme predictionEnzyme vs non-enzymeMajor enzyme class in the EC systemGene OntologyA subset of classes can be predictedSystems biology related categoriesFor example ’cell cycle regulated’, secreted, nucleolar
20 Predicting Gene Ontology categories The GO system is designed for proteins to belong to multiple classes rather than oneDifferent kinds of function can be annotated:Molecular functionBiological processCellular componentGO assigns the ”function” at several levels of detail rather than only one
21 The concept of ProtFunPredict as many biologically relevant features as we can from the sequenceTrain artificial neural networks for each categoryAssign a probability for each category from the NN outputs
27 Similar structure different functions Many examples exist of structurally similar proteins which have different functionsTwo PDB structures from the Cupredoxin superfamily1AOZ is an ascorbate oxidase (enzyme)1PLC is performing electron transport (non-enzyme)Despite their structural similarity, our method predicts both correctly
28 Performance on Gene Ontology categories (worst case)
29 Systems Biology – Whole system description Focus on whole systems, rather than individual unitsRequires identification of all units in the systemHigh diversity in biological systemsInference of system features/functions from experimental dataUltimate goal is in-silico modeling of the temporal aspects of the cell cycle in different organismsExample: Eukaryotic Cell Cycle
30 Microarray identification of periodic genes SynchronousYeast cellsDNA chipsGene expressionTemporal expressionPeriodic? ? ? ?Non-PeriodicLook for those with a periodic expression
31 Identification of periodicly expressed genes 1) Visual inspection of expression profiles (Cho et al., 1998)2) Fourier analysis and correlation with profiles of known genes (Spellman et al., 1998)3) Statistical modeling (single pulse model) (Zhao et al., 2001)70% 91% 47%104 known genesProblemsCho uses non-objective criteriaSpellman identifies too many genesZhao identifies less than half of previous identified cell cycle regulated genes
39 Top 250 genes predicted from the entire genome Among the ”top 250 predicted” genes not used for training are75 previous identified as cell cycle regulated genes175 new potentially cell cycle regulated genesFunctional groupingSubcellular localization
40 Experimental validation results More than 100 new periodic genes identified/validatedFor many of them, a role in the cell cycle is supported by other sources of evidenceAbout 30% of them have no known functional role
44 S phase feature snapshot 40% into the cell cycle the plots shows:High isoelectric pointMany nuclear proteinsShort proteinsLow potential for N-glycosylationLow potential for Ser/Thr-phosphorylationFew PEST regionsLow aliphatic index
46 Identify areas where prediction approaches can clean up noisy experimental data High-throughput proteomics dataDNA array dataStrength of prediction approaches can indeed becomplementary to the experimental data due toexperimental constraintsGenerate hypotheses on the dynamics ofprotein feature space, e.g. the periodicity of thephospho-proteome.
49 Acknowledgements People at CBS Lars Juhl Jensen Ramneek Gupta Febit AG + 20 othersKarin Julenius (O-glyc conservation)Thomas Skøt Jensen (cell cycle)Ulrik de Lichtenberg (cell cycle)Rasmus Wernersson (Febit experiments)Jannick Bendtsen (SecretomeP)Lars Kiemer (NucleolusP)Anders Fausbøll (NucleolusP)Thomas Schiritz-Ponten(new ProFun method)Febit AGPeer SmithCNB/CSIC, MadridAlfonso ValenciaJavier TamamesDamien DevosGunnar von Heijne, Stockholm (SecretomeP)
50 References www.cbs.dtu.dk/services/Protfun www.cbs.dtu.dk/cellcycle L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H.H. Stærfeldt, K. Rapacki, C. Workman, C.A.F. Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak, "Prediction of human protein function from post-translational modifications and localization features", J. Mol. Biol., 319, , 2002.L.J. Jensen, M. Skovgaard, and S. Brunak, "Prediction of novel archaeal enzymes from sequence derived features", Protein Sci., 11, , 2002.L.J. Jensen, R. Gupta, H.-H. Stærfeldt, and S. Brunak, "Prediction of human protein function according to Gene Ontology categories", Bioinformatics, 19, , 2003.L.J. Jensen, D.W. Ussery, and S. Brunak, "Functionality of system components: Conservation of protein function in protein feature space", Genome Res., Oct 14, 2003.U. de Lichtenberg, T.S. Jensen, L.J. Jensen, and S. Brunak, Protein feature based identification of cell cycle regulated proteins in yeast, J. Mol. Biol., 13, , 2003.