Presentation is loading. Please wait.

Presentation is loading. Please wait.

System approaches to the prediction of protein function

Similar presentations

Presentation on theme: "System approaches to the prediction of protein function"— Presentation transcript:

1 System approaches to the prediction of protein function
Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark

2 40-60% proteins of unknown function in the human genome

3 Diverse functional categories of cell cycle regulated yeast proteins
Level 1 GO categories for 349 cell cycle regulated yeast genes. Only 95 of these belong to the ”Cell Cycle” category (biological process).

4 Diverse functional categories for human nucleolus proteins
Level 1 GO categories for 148 human genes located in the nucleolus. Only 5 of these belong to the ”Nucleolus” category (cellular component).

5 Pairwise alignment >carp Cyprinus carpio growth hormone 210 aa vs.

6 An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily


8 Transfer of functional information – in what space ?
Recognize function in: Sequence space – sequence alignment Structure space – structural comparison Gene expression spaces – array data Interaction spaces – network/pathway extraction Paper space – text mining Protein feature space

9 Predict orphan protein function in feature space
Orphan sequences have to use the standard cellular machinery for sorting, post-translational modification, etc. Similar pattern of modification may imply similar function Predict sequence attributes independently, e.g. local and global properties such as - post-translational modifications - localization signals - degradation signals - structure - composition, length, isoelectric point, …. Then integrate and correlate using neural networks

10 Serine phosphorylation sites
Acceptor site Pos. Target AKKG S EQES S-10 PKA (1CMK) GFGD S IEAQ S-87 Ovalbumin (1OVA) EVVG S AEAG S-350 Ovalbumin (1OVA) GDLG S CEFH S-80 Cystatin (1CEW)



13 Length distributions and functional role categories

14 Propeptide cleavage sites
Post-translational processing by limited proteolysis of inactive secretory precursors produces active proteins and peptides Furin specific (a) and other proprotein convertase cleavage sites (b)

15 PCs activate a large variety of proteins
Peptide hormones, neuropeptides, growth and differentiation factors, adhesion factors, receptors, blood coagulation factors, plasma proteins, extracellular matrix proteins, proteases, exogenous proteins such as coat glycoproteins from infectious viruses (e.g. HIV-1 and Influenza) and bacterial toxins (e.g. diphtheria and anthrax toxin). PCs play an essential role in many vital biological processes like embryonic development and neural function, and in viral and bacterial pathogenesis. PCs are implicated in pathologies such as cancer and neurodegenerative diseases.

16 Mucin-type O-glycosylation
N-acetylgalactosamine (GalNAc) a-1 linked to the hydroxyl group of a serine or threonine Responsible for the high carbohydrate content of mucin proteins (>50% of the dry weight) Mucins, principal component of mucus, protects epithelial surfaces from dehydration, mechanical injury, proteases and pathogens Mucin-type glycosylation contributes to this by changing the structure to a stiff extended one and charging the protein to make it bind more water

17 Mucin-type O-glycosylation site conservation

18 Positional preference of N-Glyc sites across cellular role categories

19 Functional classes predicted
Functional role (Monica Riley categories) The original scheme had 14 categories Reduced to 12 categories by skipping the category ”other” and combining replication and transcription Enzyme prediction Enzyme vs non-enzyme Major enzyme class in the EC system Gene Ontology A subset of classes can be predicted Systems biology related categories For example ’cell cycle regulated’, secreted, nucleolar

20 Predicting Gene Ontology categories
The GO system is designed for proteins to belong to multiple classes rather than one Different kinds of function can be annotated: Molecular function Biological process Cellular component GO assigns the ”function” at several levels of detail rather than only one

21 The concept of ProtFun Predict as many biologically relevant features as we can from the sequence Train artificial neural networks for each category Assign a probability for each category from the NN outputs




25 An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily

26 1AOZ and 1PLC predictions
# Functional category AOZ PLC Amino_acid_biosynthesis Biosynthesis_of_cofactors Cell_envelope Cellular_processes Central_intermediary_metabolism Energy_metabolism Fatty_acid_metabolism Purines_and_pyrimidines Regulatory_functions Replication_and_transcription Translation Transport_and_binding # Enzyme/nonenzyme Enzyme Nonenzyme # Enzyme class Oxidoreductase (EC ) Transferase (EC ) Hydrolase (EC ) Lyase (EC ) Isomerase (EC ) Ligase (EC )

27 Similar structure different functions
Many examples exist of structurally similar proteins which have different functions Two PDB structures from the Cupredoxin superfamily 1AOZ is an ascorbate oxidase (enzyme) 1PLC is performing electron transport (non-enzyme) Despite their structural similarity, our method predicts both correctly

28 Performance on Gene Ontology categories (worst case)

29 Systems Biology – Whole system description
Focus on whole systems, rather than individual units Requires identification of all units in the system High diversity in biological systems Inference of system features/functions from experimental data Ultimate goal is in-silico modeling of the temporal aspects of the cell cycle in different organisms Example: Eukaryotic Cell Cycle

30 Microarray identification of periodic genes
Synchronous Yeast cells DNA chips Gene expression Temporal expression Periodic ? ? ? ? Non-Periodic Look for those with a periodic expression

31 Identification of periodicly expressed genes
1) Visual inspection of expression profiles (Cho et al., 1998) 2) Fourier analysis and correlation with profiles of known genes (Spellman et al., 1998) 3) Statistical modeling (single pulse model) (Zhao et al., 2001) 70% 91% 47% 104 known genes Problems Cho uses non-objective criteria Spellman identifies too many genes Zhao identifies less than half of previous identified cell cycle regulated genes

32 { Our novel strategy ? Sequence based ’’machine learning approach’’
consistensy filter Periodic genes Positive set (97 sequences) { ? Grey zone area (~5600 gener) Learn Negative set (556 sequences) Non-periodic genes 6200 genes

33 Protein similarity in ”feature space”
Predicted features Ser/Thr phosphorylation Tyr phosphorylation Subcellular localization N-linked glycosylation O-linked glycosylation PEST regions Transmembrane helices Signal peptides Calculated features Aliphatic index Amino acid composition Number of positive amino acids Number of negative amino acids Extinction coefficient Instability index Isoelectric point Sequence length Hydrophobicity

34 Prediction of cell cycle regulated genes from protein sequence

35 Protein features available
Predicited features Ser/Thr hosphorylation Tyr phosphorylation Subcellular localization N-linked glycosylation O-linked glycosylation PEST regions Transmembrane helices Signal peptides Calculated features Aliphatic index Amino acid composition Number of positive amino aids Number of negative amino acids Extinction coefficient Instability index Isoelectric point Sequence length Hydrofobicity Discriminative protein features Predicited features Ser/Thr phosphorylation Subcellular localization N-linked glycosylation O-linked glycosylation PEST regions Calculated features Aliphatic index Number of positive amino acids Extinctions coefficient Instability index Isoelectric point Sequence length

36 Features of cell cycle regulated genes used by neural net ensemble

37 Non-linear function prediction! Responds to single AA change


39 Top 250 genes predicted from the entire genome
Among the ”top 250 predicted” genes not used for training are 75 previous identified as cell cycle regulated genes 175 new potentially cell cycle regulated genes Functional grouping Subcellular localization

40 Experimental validation results
More than 100 new periodic genes identified/validated For many of them, a role in the cell cycle is supported by other sources of evidence About 30% of them have no known functional role

41 High confidence set

42 The eukaryotic cell cycle
The cell division process is divided into four phases: G1 growth/synthesis S replication of DNA G2 growth/synthesis M mitosis/cell division

43 Temporal variation in feature space

44 S phase feature snapshot
40% into the cell cycle the plots shows: High isoelectric point Many nuclear proteins Short proteins Low potential for N-glycosylation Low potential for Ser/Thr-phosphorylation Few PEST regions Low aliphatic index

45 S phase peaking genes

46 Identify areas where prediction approaches can clean up noisy experimental data
High-throughput proteomics data DNA array data Strength of prediction approaches can indeed be complementary to the experimental data due to experimental constraints Generate hypotheses on the dynamics of protein feature space, e.g. the periodicity of the phospho-proteome.



49 Acknowledgements People at CBS Lars Juhl Jensen Ramneek Gupta Febit AG
+ 20 others Karin Julenius (O-glyc conservation) Thomas Skøt Jensen (cell cycle) Ulrik de Lichtenberg (cell cycle) Rasmus Wernersson (Febit experiments) Jannick Bendtsen (SecretomeP) Lars Kiemer (NucleolusP) Anders Fausbøll (NucleolusP) Thomas Schiritz-Ponten (new ProFun method) Febit AG Peer Smith CNB/CSIC, Madrid Alfonso Valencia Javier Tamames Damien Devos Gunnar von Heijne, Stockholm (SecretomeP)

50 References
L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H.H. Stærfeldt, K. Rapacki, C. Workman, C.A.F. Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak, "Prediction of human protein function from post-translational modifications and localization features", J. Mol. Biol., 319, , 2002. L.J. Jensen, M. Skovgaard, and S. Brunak, "Prediction of novel archaeal enzymes from sequence derived features", Protein Sci., 11, , 2002. L.J. Jensen, R. Gupta, H.-H. Stærfeldt, and S. Brunak, "Prediction of human protein function according to Gene Ontology categories", Bioinformatics, 19, , 2003. L.J. Jensen, D.W. Ussery, and S. Brunak, "Functionality of system components: Conservation of protein function in protein feature space", Genome Res., Oct 14, 2003. U. de Lichtenberg, T.S. Jensen, L.J. Jensen, and S. Brunak, Protein feature based identification of cell cycle regulated proteins in yeast, J. Mol. Biol., 13, , 2003.

Download ppt "System approaches to the prediction of protein function"

Similar presentations

Ads by Google