Discovery Challenge Gene expression datasets On behalf of Olivier Gandrillon.

Slides:



Advertisements
Similar presentations
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Advertisements

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Putting genetic interactions in context through a global modular decomposition Jamal.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Huge Raw Data Cleaning Data Condensation Dimensionality Reduction Data Wrapping/ Description Machine Learning Classification Clustering Rule Generation.
S-SENCE Signal processing for chemical sensors Martin Holmberg S-SENCE Applied Physics, Department of Physics and Measurement Technology (IFM) Linköping.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
WRSTA, 13 August, 2006 Rough Sets in Hybrid Intelligent Systems For Breast Cancer Detection By Aboul Ella Hassanien Cairo University, Faculty of Computer.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Data Mining By Archana Ketkar.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Data Mining – Intro.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10
The Broad Institute of MIT and Harvard Classification / Prediction.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Overview of Bioinformatics 1 Module Denis Manley..
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Classification using Decision Trees 1.Data Mining and Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Knowledge Discovery and Data Mining 19 th Meeting Course Name: Business Intelligence Year: 2009.
Current Research Topic: Energy Model for Supporting Real Time Building Energy Management Objectives: Support the description of the building, organizing.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
Classification with Gene Expression Data
An Artificial Intelligence Approach to Precision Oncology
School of Computer Science & Engineering
Rule Induction for Classification Using
Bag-of-Visual-Words Based Feature Extraction
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Data Warehousing and Data Mining
Course Introduction CSC 576: Data Mining.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Evaluating Classifiers for Disease Gene Discovery
Presentation transcript:

Discovery Challenge Gene expression datasets On behalf of Olivier Gandrillon

SAGE data from the Cancer Anatomy Project Two datasets (public data on human cells) –822 * 74 – * 90 Questions to answer –Can we find synexpression groups? –Are we able « to group » cell types using gene expression profiles? –Can we obtain bi-sets, i.e., sets of genes associated to sets of cells which denote some relevant biological associations? –Can we find invariant genes?

« Quantitative » feedback Increasing number of submissions for analysing SAGE data –From 2 in Pisa to 7 in Porto 5 on the smaller expression matrix (minimal transcriptome), 2 on the larger Why the minimal transcriptome is preferred ;-)

Topics (1) Association rules (1) –Gasmi et al. –Extracting generic bases of association rules from SAGE data Yet another « cover » of association rules, considering the smaller data set, added-value w.r.t. previous work is unclear Class characterization (CBA-like approach) (1) –Hebert et al. –Mining delta-strong characterization rules in large SAGE dataset Yet another « cover » of association rules but with class characterization as the targeted application, some biological validation of the added-value … which is also expected given that the « data providers » are involved in the research

Topics (2) Clustering –Martinez et al. –Exploratory analysis of cancer SAGE data Added-value w.r.t. previous work unclear, including the first attempt to use clustering for global analysis of SAGE data (2001) Does cleaning improves cluster relevancy from a biological perspective? Why considering only the minimal transcriptome?

Topics (3) Supervized classification (4) –Hsuan-Tien Lin et al. –Analysis of SAGE results with combined learning techniques Using Support Vector Machines on the large SAGE data set for feature extraction and discriminating cancer librairies. Impossible to assess the added-value since the extracted model is not explicit from the paper. –Ylirinne –Analysis of the Gene expression data ith 4ft-Miner This is an application of GUHA method (descriptive rules) to the small SAGE matrix without any insight on the added-value.

Topics (4) Supervized classification (cont.) –Esseghir et al. –Localizing compact sets of genes involved in cancer diseases using an evolutionary connectionist approach Predicting cancer class values from the small SAGE dataset by means of neural networks and genetic algorithms. Results about gene selection/classifying accuracies have been given but the data providers have not been able to interpret the concrete results.

Topics (5) Supervized classification (cont.) –Alves et al. –Predictive analysis of gene expression data from human SAGE libraries Study the impact of dimensionnality reduction techniques on classification performances for the small dataset. It leads to an unexpected results that best classifying preformances are obtained when selecting the genes with relatively low expression and low variation levels. Does this remain true for the large one when no selection has been applied beforehand?

Conclusion Much better than last year … and we should encourage data miners to work on real-life biological data –What can be learned from this data … or what should not be learned Typical problem of false positive patterns Impact of data preprocessing (feature selection/construction) needs further research –Nobody has been using external sources of knowledge in order to support the biological interpretation … which is actually needed but also extremely hard

Discussion Shall we reduce drastically the number of genes and especially remove the ones with small expression? Is it reasonable to try to predict cancerous class values from such datasets?

What to do next? What molecular biologists can bring to machine learning/data mining researchers in the context of discovery challenges? –Real data, nice context for e-science, need for multiple expertise/collaborative research, etc What machine learning/data mining researchers can bring to molecular biologists in the context of discovery challenges? –New methods for data analysis, new methods for collecting data (e.g., suggestion of relevant wet biology experiments to optimize the return on investment), etc