Presentation on theme: "Discovery Challenge Gene expression datasets On behalf of Olivier Gandrillon."— Presentation transcript:
Discovery Challenge Gene expression datasets On behalf of Olivier Gandrillon
SAGE data from the Cancer Anatomy Project Two datasets (public data on human cells) –822 * 74 –27 679 * 90 Questions to answer –Can we find synexpression groups? –Are we able « to group » cell types using gene expression profiles? –Can we obtain bi-sets, i.e., sets of genes associated to sets of cells which denote some relevant biological associations? –Can we find invariant genes?
« Quantitative » feedback Increasing number of submissions for analysing SAGE data –From 2 in Pisa to 7 in Porto 5 on the smaller expression matrix (minimal transcriptome), 2 on the larger Why the minimal transcriptome is preferred ;-)
Topics (1) Association rules (1) –Gasmi et al. –Extracting generic bases of association rules from SAGE data Yet another « cover » of association rules, considering the smaller data set, added-value w.r.t. previous work is unclear Class characterization (CBA-like approach) (1) –Hebert et al. –Mining delta-strong characterization rules in large SAGE dataset Yet another « cover » of association rules but with class characterization as the targeted application, some biological validation of the added-value … which is also expected given that the « data providers » are involved in the research
Topics (2) Clustering –Martinez et al. –Exploratory analysis of cancer SAGE data Added-value w.r.t. previous work unclear, including the first attempt to use clustering for global analysis of SAGE data (2001) Does cleaning improves cluster relevancy from a biological perspective? Why considering only the minimal transcriptome?
Topics (3) Supervized classification (4) –Hsuan-Tien Lin et al. –Analysis of SAGE results with combined learning techniques Using Support Vector Machines on the large SAGE data set for feature extraction and discriminating cancer librairies. Impossible to assess the added-value since the extracted model is not explicit from the paper. –Ylirinne –Analysis of the Gene expression data ith 4ft-Miner This is an application of GUHA method (descriptive rules) to the small SAGE matrix without any insight on the added-value.
Topics (4) Supervized classification (cont.) –Esseghir et al. –Localizing compact sets of genes involved in cancer diseases using an evolutionary connectionist approach Predicting cancer class values from the small SAGE dataset by means of neural networks and genetic algorithms. Results about gene selection/classifying accuracies have been given but the data providers have not been able to interpret the concrete results.
Topics (5) Supervized classification (cont.) –Alves et al. –Predictive analysis of gene expression data from human SAGE libraries Study the impact of dimensionnality reduction techniques on classification performances for the small dataset. It leads to an unexpected results that best classifying preformances are obtained when selecting the genes with relatively low expression and low variation levels. Does this remain true for the large one when no selection has been applied beforehand?
Conclusion Much better than last year … and we should encourage data miners to work on real-life biological data –What can be learned from this data … or what should not be learned Typical problem of false positive patterns Impact of data preprocessing (feature selection/construction) needs further research –Nobody has been using external sources of knowledge in order to support the biological interpretation … which is actually needed but also extremely hard
Discussion Shall we reduce drastically the number of genes and especially remove the ones with small expression? Is it reasonable to try to predict cancerous class values from such datasets?
What to do next? What molecular biologists can bring to machine learning/data mining researchers in the context of discovery challenges? –Real data, nice context for e-science, need for multiple expertise/collaborative research, etc What machine learning/data mining researchers can bring to molecular biologists in the context of discovery challenges? –New methods for data analysis, new methods for collecting data (e.g., suggestion of relevant wet biology experiments to optimize the return on investment), etc