Presentation is loading. Please wait.

Presentation is loading. Please wait.

System approaches to the prediction of protein function Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark

Similar presentations


Presentation on theme: "System approaches to the prediction of protein function Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark"— Presentation transcript:

1 System approaches to the prediction of protein function Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark brunak@cbs.dtu.dk www.cbs.dtu.dk

2 40-60% proteins of unknown function in the human genome

3 Diverse functional categories of cell cycle regulated yeast proteins Level 1 GO categories for 349 cell cycle regulated yeast genes. Only 95 of these belong to the Cell Cycle category (biological process).

4 Diverse functional categories for human nucleolus proteins Level 1 GO categories for 148 human genes located in the nucleolus. Only 5 of these belong to the Nucleolus category (cellular component).

5 Pairwise alignment >carp Cyprinus carpio growth hormone 210 aa vs. >chicken Gallus gallus growth hormone 216 aa scoring matrix: BLOSUM50, gap penalties: -12/-2 40.6% identity; Global alignment score: 487 10 20 30 40 50 60 70 carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD ::. :...:.:. : :.. :: :::.:.:::: :::...::..::..:.:.:: :. chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE 10 20 30 40 50 60 70 80 80 90 100 110 120 130 140 150 carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN : ::.:::..:..:..:::.:. ::.:: : : ::..:.:. :.... ::: ::. ::..:.. :.:. chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G 90 100 110 120 130 140 150 160 170 180 190 200 210 carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL.: :.. :...:. :... ::.:::::.:::::::.:.:::.::::. chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI 170 180 190 200 210

6 An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily

7 1AOZ (129 aa) vs. 1PLC (99 aa) scoring matrix: BLOSUM50, gap penalties: -12/-2 15.5% identity;Global alignment score: -23 10 20 30 40 50 60 1AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH.... :.......:. :...:..:...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40 70 80 90 100 110 120 1AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI.: :... :. ::::....:. : : ::. :.. 1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90 1AOZ VDPPQGKKE :. 1PLC VN-------

8 Transfer of functional information – in what space ? Recognize function in: Sequence space – sequence alignment Structure space – structural comparison Gene expression spaces – array data Interaction spaces – network/pathway extraction Paper space – text mining … Protein feature space

9 Predict orphan protein function in feature space Orphan sequences have to use the standard cellular machinery for sorting, post- translational modification, etc. Similar pattern of modification may imply similar function Predict sequence attributes independently, e.g. local and global properties such as - post-translational modifications - localization signals - degradation signals - structure - composition, length, isoelectric point, …. Then integrate and correlate using neural networks

10 Acceptor sitePos. Target AKKG S EQESS-10 PKA (1CMK) GFGD S IEAQS-87 Ovalbumin (1OVA) EVVG S AEAGS-350 Ovalbumin (1OVA) GDLG S CEFHS-80 Cystatin (1CEW) Serine phosphorylation sites

11

12

13 Length distributions and functional role categories

14 Propeptide cleavage sites Post-translational processing by limited proteolysis of inactive secretory precursors produces active proteins and peptides Furin specific (a) and other proprotein convertase cleavage sites (b)

15 PCs activate a large variety of proteins Peptide hormones, neuropeptides, growth and differentiation factors, adhesion factors, receptors, blood coagulation factors, plasma proteins, extracellular matrix proteins, proteases, exogenous proteins such as coat glycoproteins from infectious viruses (e.g. HIV-1 and Influenza) and bacterial toxins (e.g. diphtheria and anthrax toxin). PCs play an essential role in many vital biological processes like embryonic development and neural function, and in viral and bacterial pathogenesis. PCs are implicated in pathologies such as cancer and neurodegenerative diseases.

16 Mucin-type O-glycosylation N-acetylgalactosamine (GalNAc) -1 linked to the hydroxyl group of a serine or threonine Responsible for the high carbohydrate content of mucin proteins (>50% of the dry weight) Mucins, principal component of mucus, protects epithelial surfaces from dehydration, mechanical injury, proteases and pathogens Mucin-type glycosylation contributes to this by changing the structure to a stiff extended one and charging the protein to make it bind more water

17 Mucin-type O-glycosylation site conservation

18 Positional preference of N-Glyc sites across cellular role categories

19 Functional classes predicted Functional role (Monica Riley categories) The original scheme had 14 categories Reduced to 12 categories by skipping the category other and combining replication and transcription Enzyme prediction Enzyme vs non-enzyme Major enzyme class in the EC system Gene Ontology A subset of classes can be predicted Systems biology related categories For example cell cycle regulated, secreted, nucleolar

20 Predicting Gene Ontology categories The GO system is designed for proteins to belong to multiple classes rather than one Different kinds of function can be annotated: Molecular function Biological process Cellular component GO assigns the function at several levels of detail rather than only one

21 The concept of ProtFun Predict as many biologically relevant features as we can from the sequence Train artificial neural networks for each category Assign a probability for each category from the NN outputs

22

23

24

25 An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily

26 1AOZ and 1PLC predictions # Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052 # Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690 # Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017

27 Similar structure different functions Many examples exist of structurally similar proteins which have different functions Two PDB structures from the Cupredoxin superfamily 1AOZ is an ascorbate oxidase ( enzyme ) 1PLC is performing electron transport ( non-enzyme ) Despite their structural similarity, our method predicts both correctly

28 Performance on Gene Ontology categories (worst case)

29 Eukaryotic Cell Cycle Example: Eukaryotic Cell Cycle Systems Biology – Whole system description Focus on whole systems, rather than individual units Requires identification of all units in the system High diversity in biological systems Inference of system features/functions from experimental data Ultimate goal is in-silico modeling of the temporal aspects of the cell cycle in different organisms

30 Microarray identification of periodic genes Synchronous Yeast cells DNA chipsGene expressionTemporal expression Look for those with a periodic expression Periodic? ? Non-Periodic

31 70% 91% 47% 104 known genes 1) Visual inspection of expression profiles (Cho et al., 1998) 2) Fourier analysis and correlation with profiles of known genes (Spellman et al., 1998) 3) Statistical modeling (single pulse model) (Zhao et al., 2001) Problems Cho uses non-objective criteria Spellman identifies too many genes Zhao identifies less than half of previous identified cell cycle regulated genes Identification of periodicly expressed genes

32 Sequence based machine learning approach Learn { consistensy filter Periodic genes Non-periodic genes ? Grey zone area (~5600 gener) Positive set (97 sequences) Negative set (556 sequences) 6200 genes Our novel strategy

33 Protein similarity in feature space Predicted features Ser/Thr phosphorylation Tyr phosphorylation Subcellular localization N-linked glycosylation O-linked glycosylation PEST regions Transmembrane helices Signal peptides Calculated features Aliphatic index Amino acid composition Number of positive amino acids Number of negative amino acids Extinction coefficient Instability index Isoelectric point Sequence length Hydrophobicity

34 Prediction of cell cycle regulated genes from protein sequence

35 Calculated features Aliphatic index Amino acid composition Number of positive amino aids Number of negative amino acids Extinction coefficient Instability index Isoelectric point Sequence length Hydrofobicity Protein features available Predicited features Ser/Thr hosphorylation Tyr phosphorylation Subcellular localization N-linked glycosylation O-linked glycosylation PEST regions Transmembrane helices Signal peptides Discriminative protein features Predicited features Ser/Thr phosphorylation Subcellular localization N-linked glycosylation O-linked glycosylation PEST regions Calculated features Aliphatic index Number of positive amino acids Extinctions coefficient Instability index Isoelectric point Sequence length

36 Features of cell cycle regulated genes used by neural net ensemble

37 Non-linear function prediction! Responds to single AA change

38

39 Subcellular localizationFunctional grouping Among the top 250 predicted genes not used for training are 75 previous identified as cell cycle regulated genes 175 new potentially cell cycle regulated genes Top 250 genes predicted from the entire genome

40 Experimental validation results More than 100 new periodic genes identified/validated For many of them, a role in the cell cycle is supported by other sources of evidence About 30% of them have no known functional role

41 High confidence set

42 The eukaryotic cell cycle The cell division process is divided into four phases: G 1 growth/synthesis S replication of DNA G 2 growth/synthesis Mmitosis/cell division

43 Temporal variation in feature space

44 S phase ? 40% into the cell cycle the plots shows: High isoelectric point Many nuclear proteins Short proteins Low potential for N-glycosylation Low potential for Ser/Thr-phosphorylation Few PEST regions Low aliphatic index S phase feature snapshot

45 S phase peaking genes

46 Identify areas where prediction approaches can clean up noisy experimental data High-throughput proteomics data DNA array data Strength of prediction approaches can indeed be complementary to the experimental data due to experimental constraints Generate hypotheses on the dynamics of protein feature space, e.g. the periodicity of the phospho-proteome.

47

48

49 Acknowledgements People at CBS Lars Juhl Jensen Ramneek Gupta + 20 others Karin Julenius (O-glyc conservation) Thomas Skøt Jensen (cell cycle) Ulrik de Lichtenberg (cell cycle) Rasmus Wernersson (Febit experiments) Jannick Bendtsen (SecretomeP) Lars Kiemer (NucleolusP) Anders Fausbøll (NucleolusP) Thomas Schiritz-Ponten (new ProFun method) Febit AG Peer Smith CNB/CSIC, Madrid Alfonso Valencia Javier Tamames Damien Devos Gunnar von Heijne, Stockholm (SecretomeP)

50 References www.cbs.dtu.dk/services/Protfun www.cbs.dtu.dk/cellcycle L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H.H. Stærfeldt, K. Rapacki, C. Workman, C.A.F. Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak, "Prediction of human protein function from post-translational modifications and localization features", J. Mol. Biol., 319, 1257-1265, 2002. L.J. Jensen, M. Skovgaard, and S. Brunak, "Prediction of novel archaeal enzymes from sequence derived features", Protein Sci., 11, 2894-2898, 2002. L.J. Jensen, R. Gupta, H.-H. Stærfeldt, and S. Brunak, "Prediction of human protein function according to Gene Ontology categories", Bioinformatics, 19, 635-642, 2003. L.J. Jensen, D.W. Ussery, and S. Brunak, "Functionality of system components: Conservation of protein function in protein feature space", Genome Res., Oct 14, 2003. U. de Lichtenberg, T.S. Jensen, L.J. Jensen, and S. Brunak, Protein feature based identification of cell cycle regulated proteins in yeast, J. Mol. Biol., 13, 663-674, 2003.


Download ppt "System approaches to the prediction of protein function Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark"

Similar presentations


Ads by Google