Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Feature Selection Presented by: Nafise Hatamikhah
Applications to Bioinformatics: Microarray Data Mining
By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
A Technique for Advanced Dynamic Integration of Multiple Classifiers Alexey Tsymbal*, Seppo Puuronen**, Vagan Terziyan* *Department of Artificial Intelligence.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Copyright © 2002 KDnuggets Knowledge Discovery in Microarray Gene Expression Data Gregory Piatetsky-Shapiro IMA 2002 Workshop on Data-driven.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 5 Data mining : A Closer Look.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture Notes by Neşe Yalabık Spring 2011.
An Exercise in Machine Learning
Gene expression profiling identifies molecular subtypes of gliomas
Issues with Data Mining
Whole Genome Expression Analysis
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
Matlab Matlab Sigmoid Sigmoid Perceptron Perceptron Linear Linear Training Training Small, Round Blue-Cell Tumor Classification Example Small, Round Blue-Cell.
Chapter 9 – Classification and Regression Trees
Microarray - Leukemia vs. normal GeneChip System.
Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all.
Artificial Intelligence Project #3 : Analysis of Decision Tree Learning Using WEKA May 23, 2006.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
Today Ensemble Methods. Recap of the course. Classifier Fusion
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Artificial Intelligence Project #3 : Diagnosis Using Bayesian Networks May 19, 2005.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
October 2-3, 2015, İSTANBUL Boğaziçi University Prof.Dr. M.Erdal Balaban Istanbul University Faculty of Business Administration Avcılar, Istanbul - TURKEY.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Data Mining and Decision Support
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Brian Lukoff Stanford University October 13, 2006.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
©2003/04 Alessandro Bogliolo Analysis of gene expression by means of Microarrays.
Data Mining ICCM
Introduction Machine Learning 14/02/2017.
Rule Induction for Classification Using
Classification and Prediction
A Unifying View on Instance Selection
Classification and Prediction
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
CSCI N317 Computation for Scientific Applications Unit Weka
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
MIS2502: Data Analytics Classification Using Decision Trees
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey Mudd

Microarray devices obtain RNA expression levels from gene samples Data obtained can be used for a variety of medical purposes: diagnosis, predicting treatment outcome, etc. Data produced are typically large and complex, which makes data mining a useful task What is Microarray Data?

Crisp-DM: Cross-Industry Standard Process model for Data Mining Crisp-DM is a way of standardizing steps taken in a data mining process using high-level structure and terminology Useful for describing best practice Standardizing Data Mining Process

Typical number of records is small (<100) due to difficulty of collecting samples Typical number of attributes (genes) is large (many thousands) Can lead to false positives (correlation due to chance), over-fitting Paper suggests reducing number of genes examined (feature reduction) Microarray Data Analysis Issues

Thresholding: Determine appropriate range of values (authors used min:100, max 16,000 for Affymetrix arrays) Normalization: Required for clustering (authors used mean 0, stddev 1) Filtering: Remove attributes that do not vary enough across samples, such as: MaxValue(G)-MinValue(G)<500, MaxValue(G)/MinValue(G)<5 Data Cleaning and Preparation

Feature Selection Because of the large number of attributes/small number of samples, feature selection is important Use statistical measures to determine “best genes” for each class To avoid under representing some classes, apply heuristic of selecting equal number of genes from each class

Building Classification Models For this data, decision trees work poorly, neural nets work well Feature reduction alone not sufficient Test models using a varying number of genes from each class Five-fold sufficient, leave-one-out cross-validation considered most accurate

Case Study 1 Leukemia data, 2 classes (AML, ALL), 38 samples training, 34 samples test (separate samples) Filter to reduce number of genes, select top 100 based on T-values Build neural net models, 10 genes turned out to be best subset size 97% accuracy (33/34 test record correctly classified)

Case Study 2 Brain data, 5 classes, 42 samples (no separate test set) Same preprocessing as Case Study 1 Select top genes based on Signal to Noise measure, select equal number of genes per class Build neural net models, 12 genes per class (60 total) gave best results Lowest average error rate was 15%.

Case Study 3 Cluster analysis, with goal of discovering natural classes Leukemia data with 3 classes: ALL -> ALL-T and ALL-B Same preprocessing as before, also normalize values for clustering Used two clustering methods in Clementine package, both able to discover natural classes in data, to the authors’ satisfaction

Conclusions Ideas presented could be applicable to other domains where balance between attributes and samples is similar (cheminfomatics or drug design) Future work could evaluate cost-sensitive classification which minimize errors based on cost they inflict Principled methodology can lead to good results