Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Similar presentations


Presentation on theme: "Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De."— Presentation transcript:

1 Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De Moor Assessor: Yves Moreau Database Issues in Biological Databases (DBiBD), January 8-9, 2005

2 Context x x x x x x x x x Linkage Analysis Positional Cloning NEFL RAB7 GARS GIB1 LMNA High-throughput technologies

3 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Concept Pathology / Biological process / … Gene Expression Literature Anatomical Expression Gene Regulation Protein Domains Functional Annotation Evolutionary Conservation …

4 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Concept Model with multiple submodels Training genes Training set Choose submodels TRAIN Candidate genes Test set One ranking for each submodel Combined ranking Order statistics SCORE gene i

5 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Order Statistics Given a set of n rank ratios for gene i - what is the probability of getting these ratios by chance alone? Joint probability density function of all n order statistics: Complexity O(n 2 )

6 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Setup 29 lists of disease genes from OMIM 5 lists of random genes from the human genome Foreach disease or random gene set do: Foreach gene in the set do: a. Leave one gene out b.TRAIN all submodels on the set minus the left-out gene c. Create a test set by adding left-out gene to [9, 49, 99] random genes d. SCORE the test set with all trained submodels e. RANK the genes in the test set according to their order statistics p-value end Calculate for a certain cut-off x the number of - TP: number of left-out genes ranked above x - FP: number of genes but left-out gene ranked above x - TN: number of genes but the left-out gene ranked below x - FN: number of left-out genes ranked below x Calculate sensitivity and specificity using the above mentioned values, plot (1-specificity) versus sensitivity to obtain a Rank ROC plot and calculate the area under the curve.

7 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Disease genes

8 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Disease genes - 29 human diseases (OMIM) = 29 gene sets disease genes with Ensembl identifier in total - average gene set contains 19 genes - smallest gene set = ALS with 4 genes - largest gene set = leukemia with 113 genes

9 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Textual data: TXTGate Sequence similarity: BLAST + Rank genes according to e-value Example: Presenilin 1 vs. Presenilin 2 e-value =

10 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Functional annotation: GO Functional annotation: Kegg Set of genes GO IDs observed frequencies Full Genome GO IDs GO-id expected frequencies GO IDs

11 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Protein information: InterPro Protein information: BIND Training genes + Interaction partners Test gene + Interaction partners Overlap?

12 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Gene expression: Microarray data Gene expression: ESTs - Model is average expression profile of training genes - Score test gene by calculating Pearson correlation Human gene expression atlas: Su et al. 47 normal human tissues

13 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Cis-regulatory elements: TFBSs Cis-regulatory elements: TFBS modules - Check human-mouse CNS blocks in upstream sequence of a test gene - Compare found motifs with motifs in training set ModuleSearcher: searches best combination of 3 TFs in 300 bp US of genes in training set ModuleScanner: scores test gene with model

14 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Similarity Statistical meta-analysis Vector-based similarity Fishers method Assume there are m independent tests of H 0. 1.For the i-th test calculate the corresponding p-value, p i. 2.If p i has a uniform distribution on [0,1], then –2Σlog p i has a χ 2 m distribution. T1 T3 T2 - Euclidean distance - Pearson correlation - Cosine similarity

15 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Correlation

16 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Rank ROC

17 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodel Rank ROC

18 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Bias towards known genes

19 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Screenshot

20 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Architecture ESAT Web server Linux cluster Java RMI SOAP messages

21 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Conclusions and Future - Different weighting for different submodels - Explore mathematical modeling techniques (neural nets, SVM) - Add more information models - Define best combination of submodels F - Allows integration of heterogeneous data - Solves problem of uncertainty - Solves multiple testing problem (Bonferroni correction) - Allows for cut-offs with statistical significance C

22 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Acknowledgements Bart De MoorStein Aerts Yves Moreau Patrick GlenissonSteven Van VoorenJoke Allemeersch

23 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Load training set

24 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Add submodels

25 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Train submodels

26 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Load candidate genes

27 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Score candidate genes with all submodels

28 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Results of scoring

29 Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Ranking visualized in sprintplot


Download ppt "Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De."

Similar presentations


Ads by Google