Presentation is loading. Please wait.

Presentation is loading. Please wait.

FunCoup data integration and networks of functional coupling in eukaryotes Andrey Alexeyenko.

Similar presentations


Presentation on theme: "FunCoup data integration and networks of functional coupling in eukaryotes Andrey Alexeyenko."— Presentation transcript:

1 FunCoup data integration and networks of functional coupling in eukaryotes Andrey Alexeyenko

2 FunCoup is a data integration framework to discover functional coupling in eukaryotic proteomes with data from model organisms A Human B Human ? Find orthologs* Mouse Worm Fly Yeast High-throughput evidence * Remm M, Storm CE, Sonnhammer ELL (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314:1041-1052.

3 FunCoup is a naïve Bayesian network (NBN) Bayesian inference: Genes A and B are functionally coupled Genes A and B co- expressed P(C|E) = (P(C) * P(E|C)) / P(E) A B

4 Problem:Solution: Naïve Bayesian network. Calculate a belief change instead (likelihood ratios, LR) Absolute probabilities of FC are intractable. The full Bayesian network is impossible A B P(B|C), P(C|B) P(B|A), P(A|B) P(B|D), P(D|B) P(A|C ), P(C|A ) P(D|C), P(C|D) P(A|D ), P(D|A ) P(E|+) / P(E|-) A B P(E|+) / P(E|-)

5 gene evolution functional link Problem:Solution: Via groups of orthologs that emerged via the speciation How to establish optimal bridges between species?

6 Problem:Solution: Treat ALL inparalogs equally, and choose the BEST value In situatons with multiple inparalogs, how to deal with alternative evidence?

7 Problem:Solution: Render data uncorrelated with principal components analysis (PCA) Collected features are often telling the same: badly compatible with NBN X: Feature A Y: Feature B PC1 = α 11 X+ α 21 Y PC2 = α 21 X+ α 22 Y : a pair of proteins X: Feature A Y: Feature B Y X

8 Problem:Solution: Render features discrete A feature distribution shape may be unpredictable: hard to learn the “feature -> evidence” mapping

9 Problem: Solution: Find them individually for each data set and FC class, accounting for the joint “feature – class” distribution Distribution areas informative of FC may vary 01Pearson r + + + + + + + +++ +++ +++ ++ + ++ - - - ----- -- ------ - - -- - - -

10 Problem:Solution: Positive set Random set ________ Replace negative sets with randomly picked ones: Impossible to guarantee absence of FC in negative training sets Negative set Positive set not coupled proteins coupled proteins

11 Problem:Solution: Enforce confidence check and remove insignificant nodes Some LR are weak and arise due to non-representative sampling P(E|+) / P(E|-) A B P(E|+) / P(E|-) test

12 Problem:Solution: Multinet Decide which types of FC are needed (provide as positive training sets) and perform the previous steps customized Definitions and notions of FC vary A <> B P(E|+) / P(E|-) A| BA| B A <> B A || B A|BA|B

13 FunCoup’s web interface New! Hooper S., Bork P. Medusa: a simple tool for interaction graph analysis. Bioinformatics. 2005 Dec 15;21(24):4432-3. Epub 2005 Sep 27. http://www.sbc.su.se/~andale/funcoup.html

14 Proteins of the Parkinson’s disease pathway (KEGG #05020) Physical protein-protein interaction “Signaling” link Metabolic “non-signaling” link Multinet presents several link types in parallel

15 Multilateral data transfer Human Ciona Worm Mouse Rat Fly Yeast Arabidopsis PCA NBN Data from the same species is an important but not indispensable component of the framework. Hence, a network can be constructed for an organism with no experimental datasets at all.

16 FunCoup builds a network for an uncaracterized organism (C. intestinalis) Build multi-species clusters of ortologs (e.g. human + C.intestinalis + D.melanogaster + C.elegans) [*] Extend known metabolic pathway assignments to the novel organism (e.g. Ciona) Collect well-studied organisms’ data Using this data, train FunCoup on the set created in (2) Test each pair of proteins in the novel organism for being coupled *Alexeyenko A, Tamas I, Liu G, Sonnhammer ELL Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006 15;22(14):e9-e15.

17 Reconctructing the “regulatory blueprint”* in C. intestinalis *Imai KS, Levine M, Satoh N, Satou Y (2006) Regulatory blueprint for a chordate embryo. Science, 26:1183-7. Proteins of the “Regulatory Blueprint for a Chordate Embryo” [ * ] 18 links mentioned in [ * ] AND found by FunCoup Links found by FunCoup (about 140) The rest, 202 links from [*] that FunCoup did not find, not shown

18 Set of outgoing links of the “regulatory blueprint”

19 …and a tight cluster from it: 00562 Inositol phosphate metabolism 00632 Benzoate degradation via CoA ligation 00760 Nicotinate and nicotinamide metabolism 04310 Wnt signaling pathway 04330 Notch signaling pathway 04350 TGF-beta signaling pathway 04360 Axon guidance 04510 Focal adhesion 04512 ECM-receptor interaction 04514 Cell adhesion molecules 04520 Adherent junction 04530 Tight junction 04630 Jak-STAT signaling pathway 04640 Hematopoietic cell lineage 04670 Leukocyte transendothelial migration 04810 Regulation of actin cytoskeleton ADAM10 Myosin light chain 2 Cadherin EGF LAG seven-pass G-type receptor 2 Neurotrophic tyrosine kinase, receptor-related 3 Inferred KEGG pathways: …and annotations of human orthologs: The Ciona genes were not described, but may receive this annotation via orthology:

20 The limits of data integration

21 Condfidence estimation Sensitivity (from “gold standard” set of FC): Sens = TP / (TP + FN) Specificity (from a set of “No / not known FC”) Spec = TN / (TN + FP) Positive Predictive Value (from everything predicted by FunCoup): PPV = TP / (TP + FP) PPV answers the question: “How much should we trust the FunDoup predictions”

22

23

24 Correction of confidence by amount of evidence 1. Record the amount of information (AOE ~ non-empty values) that describes each pair of proteins A B 2. Correct each final Bayesian score: FBS’( A B ) = FBS( A B ) + beta * (M(AOE) – AOE( A B )); beta is the linear regression coefficient of: FBS = alpha + beta * AOE

25 Confidence saturated at FBS = 12.5

26 How the yeast complex entities are conserved? Log overlap between KEGG and Gavin et al., 2006

27 Conclusions http://FunCoup.sbc.su.se After the optimization, the naïve Bayesian network is well suited for collection/evaluation of sparse, diverse, and noisy features, and is, in itself, efficient to discover novel cases of FC Orthologs are optimal to transfer information across species The multiple class training enabled specific prediction of different types of functional coupling Across-species information flow is not symmetrical but reversible – hence the networks of uncharacterized proteomes In FunCoup In the Bayesian output, no missing values exist – thus a multivariate classification technique may be applied as a post- processor

28 Acknowledgements: Erik Sonnhammer Tomas Ohlson Mats Lindskog Kristoffer Forslund Gabriel Östlund Kevin O’Brien Carsten Daub

29 Validation Jack-knife procedure:  Take “positive” and “negative” sets  Split each randomly as 50:50  Use the first parts to train the algorithm, the second to test the performance  Repeat a number of times Analysis Of VAriance:  Introduce features A, B, C in the workflow of FunCoup (e.g., using PCA, selecting nodes of BN by relevance, ways of using ortholog data etc.)  Run FunCoup with all possible combinations of absence/presence of A, B, C to produce a balanced and orthogonal ANOVA design with replicates  Study effects of A,B,C or their combinations AxB, BxC,.. AxBxC to see if they influence the performance significantly (whereas all other effects did not exist)

30 Estimating quality of prediction Sensitivity: TP / (TP + FN) 1 - Specificity: FP / (FP + TN) Individual points represent varying cut-offs

31


Download ppt "FunCoup data integration and networks of functional coupling in eukaryotes Andrey Alexeyenko."

Similar presentations


Ads by Google