Copyright © 2002 KDnuggets Knowledge Discovery in Microarray Gene Expression Data Gregory Piatetsky-Shapiro IMA 2002 Workshop on Data-driven.

Slides:



Advertisements
Similar presentations
Data Mining in Genomics: the dawn of personalized medicine
Advertisements

Microarray Technology and Applications
Biology and Cells All living organisms consist of cells. Humans have trillions of cells. Yeast - one cell. Cells are of many different types (blood, skin,
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Application of available statistical tools Development of specific, more appropriate statistical tools for use with microarrays Functional annotation of.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Applications to Bioinformatics: Microarray Data Mining
Gene Expression Chapter 9.
Getting the numbers comparable
By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Part II: Discriminative Margin Clustering Joint work with: Rob Tibshirani, Dept of Statistics Patrick O. Brown, School of Medicine Stanford University.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Microarrays Technology behind microarrays Data analysis approaches
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Alternative Splicing As an introduction to microarrays.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Analysis of microarray data
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Introduction to Bioinformatics
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
From motif search to gene expression analysis
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
The Broad Institute of MIT and Harvard Classification / Prediction.
Microarrays.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
CZ5211 Topics in Computational Biology Lecture 2: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel:
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Genomic Signal Processing Dr. C.Q. Chang Dept. of EEE.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Other uses of DNA microarrays
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 3.
Classification with Gene Expression Data
Gene Expression Analysis
Microarray - Leukemia vs. normal GeneChip System.
Microarray Technology and Applications
Molecular Classification of Cancer
Data Type 1: Microarrays
Presentation transcript:

Copyright © 2002 KDnuggets Knowledge Discovery in Microarray Gene Expression Data Gregory Piatetsky-Shapiro IMA 2002 Workshop on Data-driven Control and Optimization

IMA-2002 Workshop Copyright © 2002 KDnuggets 2 Data Mining Methodology is Critical! Data Mining is a Continuous Process! Following Correct Methodology is Critical! CRISP-DM methodology

IMA-2002 Workshop Copyright © 2002 KDnuggets 3 Overview  Molecular Biology Overview  Microarrays for Gene Expression  Classification on Microarray Data  avoiding false positives  wrapper approach  Microarrays for Modeling Dynamic Processes  finding causal networks and clusters

IMA-2002 Workshop Copyright © 2002 KDnuggets 4 Biology and Cells  All living organisms consist of cells.  Humans have trillions of cells. Yeast - one cell.  Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg)  Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA.

IMA-2002 Workshop Copyright © 2002 KDnuggets 5 DNA  DNA molecules are long double-stranded chains; 4 types of bases are attached to the backbone: adenine (A), guanine (G), cytosine (C), and thymine (T). A pairs with T, C with G.  A gene is a segment of DNA that specifies how to make a protein.  Human DNA has about 30-35,000 genes; Rice -- about 50-60,000, but shorter genes.

IMA-2002 Workshop Copyright © 2002 KDnuggets 6 Exons and Introns: Data and Logic?  exons are coding DNA (translated into a protein), which are only about 2% of human genome  introns are non-coding DNA, which provide structural integrity and regulatory (control) functions  exons can be thought of program data, while introns provide the program logic  Humans have much more control structure than rice

IMA-2002 Workshop Copyright © 2002 KDnuggets 7 Gene Expression  Cells are different because of differential gene expression.  About 40% of human genes are expressed at one time.  Gene is expressed by transcribing DNA into single-stranded mRNA  mRNA is later translated into a protein  Microarrays measure the level of mRNA expression

IMA-2002 Workshop Copyright © 2002 KDnuggets 8 Molecular Biology Overview Cell Nucleus Chromosome Protein Graphics courtesy of the National Human Genome Research Institute Gene (DNA) Gene (mRNA), single strand

IMA-2002 Workshop Copyright © 2002 KDnuggets 9 Gene Expression Measurement  mRNA expression represents dynamic aspects of cell  mRNA expression can be measured with latest technology  mRNA is isolated and labeled with fluorescent protein  mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser

IMA-2002 Workshop Copyright © 2002 KDnuggets 10 Gene Expression Microarrays The main types of gene expression microarrays:  Short oligonucleotide arrays (Affymetrix);  cDNA or spotted arrays (Brown/Botstein).  Long oligonucleotide arrays (Agilent Inkjet);  Fiber-optic arrays ...

IMA-2002 Workshop Copyright © 2002 KDnuggets 11 Affymetrix Microarrays 50um 1.28cm ~10 7 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Raw gene expression is intensity difference: PM - MM Raw image

IMA-2002 Workshop Copyright © 2002 KDnuggets 12 Microarray Potential Applications  Biological discovery  new and better molecular diagnostics  new molecular targets for therapy  finding and refining biological pathways  Recent examples  molecular diagnosis of leukemia, breast cancer,...  appropriate treatment for genetic signature  potential new drug targets

IMA-2002 Workshop Copyright © 2002 KDnuggets 13 Microarray Data Analysis Types  Gene Selection  find genes for therapeutic targets  avoid false positives (FDA approval ?)  Classification (Supervised)  identify disease  predict outcome / select best treatment  Clustering (Unsupervised)  find new biological classes / refine existing ones  exploration  …

IMA-2002 Workshop Copyright © 2002 KDnuggets 14 Microarray Data Mining Challenges  too few records (samples), usually < 100  too many columns (genes), usually > 1,000  Too many columns likely to lead to False positives  for exploration, a large set of all relevant genes is desired  for diagnostics or identification of therapeutic targets, the smallest set of genes is needed  model needs to be explainable to biologists

IMA-2002 Workshop Copyright © 2002 KDnuggets 15 Microarray Data Classification Prediction: ALL or AML Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 Data Mining model New sample Microarray chipsImages scanned by laser Datasets

IMA-2002 Workshop Copyright © 2002 KDnuggets 16 Data Preparation Issues (MAS-4)  Thresholding: usually min 20, max 16,000  For older Affy chips (new Affy chips do not have negative values)  Filtering - remove genes with insufficient variation  e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5  biological reasons  feature reduction for algorithmic  For clustering, normalize each gene (sample) separately to Mean = 0, Std. Dev = 1

IMA-2002 Workshop Copyright © 2002 KDnuggets 17 Classification  desired features:  robust in presence of false positives  understandable  return confidence/probability  fast enough  simplest approaches are most robust  advanced approaches can be more accurate

IMA-2002 Workshop Copyright © 2002 KDnuggets 18 FALSE POSITIVES PROBLEM  Not enough records (samples), usually < 100  Too many columns (genes), usually >>1,000  FALSE POSITIVES are very likely because of few records and many columns

IMA-2002 Workshop Copyright © 2002 KDnuggets 19 Controlling False Positives Class CD37 antigen Mean Difference between Classes: T-value = Significance: p=0.0007

IMA-2002 Workshop Copyright © 2002 KDnuggets 20 Controlling False Positives with Randomization Class Class Randomized Class Randomize T-value = -1.1 CD37 antigen Randomization is Less Conservative Preserves inner structure of data

IMA-2002 Workshop Copyright © 2002 KDnuggets 21 Controlling false positives with randomization, II Class Gene Class Rand Class Randomize 500 times Bottom 1% T-value = Select potentially interesting genes at 1% Gene

IMA-2002 Workshop Copyright © 2002 KDnuggets 22 Controlling False Positives: SAM (Statistical Analysis of Microarrays)  Tusher, Tibshirani, and Chu, Significance analysis of microarrays …, PNAS, Apr 2001  SAM software available from Tibshirani web site

IMA-2002 Workshop Copyright © 2002 KDnuggets 23 Why Separate Feature selection ?  most learning algorithms looks for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes  We first reduce number of genes by a linear method, e.g. T-values  Heuristic: select genes from each class  Then apply a favorite machine learning algorithm

IMA-2002 Workshop Copyright © 2002 KDnuggets 24 Feature selection approach  Rank genes by measure; select top  T-test for Mean Difference=  Signal to Noise (S2N) =  Other: Information-based, biological?  Almost any method works well with a good feature selection

IMA-2002 Workshop Copyright © 2002 KDnuggets 25 Gene Reduction improves Classification  most learning algorithms looks for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes  Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference  Heuristic: select equal # genes from each class  Then apply a favorite machine learning algorithm

IMA-2002 Workshop Copyright © 2002 KDnuggets 26 Wrapper approach to select the best gene set Select best 200 or so genes based on statistical measures Test models using 1,2,3, …, 10, 20, 30, 40,... genes with x- validation. Select gene set with lowest average error Heuristically, at least 10 genes overall

IMA-2002 Workshop Copyright © 2002 KDnuggets 27 Popular Classification Methods  Decision Trees/Rules  find smallest gene sets, but not robust false positives  Neural Nets - work well for reduced # of genes  K-nearest neighbor - robust for small # genes  TreeNet from authors of CART and MARS  networks of simple trees; very robust against outliers  Support Vector Machines (SVM)  good accuracy, does its own gene selection, but hard to understand ...

IMA-2002 Workshop Copyright © 2002 KDnuggets 28 Microarrays: An Example  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999  72 examples (38 train, 34 test), about 7,000 genes  well-studied (CAMDA-2000), good test example ALLAML Visually similar, but genetically very different

IMA-2002 Workshop Copyright © 2002 KDnuggets 29 Results on the test data  Genes selected and model trained on Train set ONLY!  Best Clementine neural net model used 10 genes per class  Evaluation on test data (34 samples) gives  1 or 2 errors (94-97% accuracy),  Note: all methods give error on sample 66, believed to be mis-classified by a pathologist

IMA-2002 Workshop Copyright © 2002 KDnuggets 30 Multi-class Data Analysis  Brain data, Pomeroy et al 2002, Nature (415), Jan 2002  42 examples, about 7,000 genes, 5 classes Photomicrographs of tumours (400x) a, MD (medulloblastoma) classis b, MD desmoplastic c, PNET d, rhabdoid e, glioblastoma Analysis also used Normal tissue, not shown

IMA-2002 Workshop Copyright © 2002 KDnuggets 31 Modeling with TreeNet  Build a model using top 3 genes from each class  Evaluate using cross-validation  Results: 95% accuracy:  1 error on training data, 1 on test

IMA-2002 Workshop Copyright © 2002 KDnuggets 32 TreeNet results for multi-class data Average cross-validation accuracy over 95% Original authors had accuracy of about 85% using nearest neighbor classifier.

IMA-2002 Workshop Copyright © 2002 KDnuggets 33 Clustering Goals  Find natural classes in the data  Identify new classes / gene correlations  Refine existing taxonomies  Support biological analysis / discovery  Different Methods  Hierarchical clustering, SOM's, etc

IMA-2002 Workshop Copyright © 2002 KDnuggets 34 Yeast SOM Clusters  Yeast Cell Cycle SOM.  (a) 6 × 5 SOM. The 828 genes that passed the variation filter were grouped into 30 clusters. Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expression level of each gene was normalized to have mean = 0 and SD = 1 across time points. Expression levels are shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression. n indicates the number of genes within each cluster. Note that multiple clusters exhibit periodic behavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail. Cluster 29 contains 76 genes exhibiting periodic behavior with peak expression in late G 1. Normalized expression pattern of 30 genes nearest the centroid are shown. (c) Centroids for SOM- derived clusters 29, 14, 1, and 5, corresponding to G 1, S, G 2 and M phases of the cell cycle, are shown.

IMA-2002 Workshop Copyright © 2002 KDnuggets 35 Yeast SOM Clusters

IMA-2002 Workshop Copyright © 2002 KDnuggets 36 Discovery of causal processes  A long term goal of Systems Biology is to discover the causal processes among genes, proteins, and other molecules in cells  Can this be done (in part) by using data from High Throughput experiments, such as microarrays?

IMA-2002 Workshop Copyright © 2002 KDnuggets 37 A Model of Galactose Utilization (manually discovered) T. Ideker, et al., Science 292 (May 4, 2001)

IMA-2002 Workshop Copyright © 2002 KDnuggets 38 Bayesian Causal Network Structure P(GAL4) P(GAL2 | GAL4) P(Intracellular Galactose | GAL2) Each variable is independent of its distant causes given all of its direct causes. Thanks to Greg Cooper, U. Pitt

IMA-2002 Workshop Copyright © 2002 KDnuggets 39 Bayesian Network Learned for Yeast Hartemink et al, Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models, PSB 2002 psb.stanford.edu/psb-online

IMA-2002 Workshop Copyright © 2002 KDnuggets 40 Future directions for Microarray Analysis  Algorithms optimized for small samples  Integration with other data  biological networks  medical text  protein data  Cost-sensitive classification algorithms  error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc.

IMA-2002 Workshop Copyright © 2002 KDnuggets 41 Integrate biological knowledge when analyzing microarray data (from Cheng Li, Harvard SPH) Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25

IMA-2002 Workshop Copyright © 2002 KDnuggets 42 GeneSpring Demo  Yeast data  Zoom all the way to bases  Yeast Cycle -- animation  Color -- expression strength

IMA-2002 Workshop Copyright © 2002 KDnuggets 43 Acknowledgements  Sridhar Ramaswamy, MIT Whitehead Institute  Pablo Tamayo, MIT Whitehead Institute  Greg Cooper, U. Pittsburgh  Tom Khabaza, SPSS

IMA-2002 Workshop Copyright © 2002 KDnuggets 44 Thank you! Further resources on Data Mining: Contact: Gregory Piatetsky-Shapiro: