Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets Connecticut College, October.

Similar presentations

Presentation on theme: "Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets Connecticut College, October."— Presentation transcript:

1 Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets Connecticut College, October 15, 2003

2 © 2003 KDnuggets 2 Overview Data Mining and Knowledge Discovery Genomics and Microarrays Microarray Data Mining

3 © 2003 KDnuggets 3 Trends leading to Data Flood More data is generated: Bank, telecom, other business transactions... Scientific Data: astronomy, biology, etc Web, text, and e-commerce More data is captured: Storage technology faster and cheaper DBMS capable of handling bigger DB

4 © 2003 KDnuggets 4 ____ __ __ Transformed Data Patterns and Rules Target Data Raw Dat a Knowledge Data Mining Transformation Interpretation & Evaluation Selection & Cleaning Integration Understanding Knowledge Discovery Process DATA Ware house Knowledge

5 © 2003 KDnuggets 5 Major Data Mining Tasks Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Estimation: predicting a continuous value Deviation Detection: finding changes Link Analysis: finding relationships

6 © 2003 KDnuggets 6 Major Application Areas for Data Mining Solutions Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web

7 © 2003 KDnuggets 7 Genome, DNA & Gene Expression An organisms genome is the program for making the organism, encoded in DNA Human DNA has about 30-35,000 genes A gene is a segment of DNA that specifies how to make a protein Cells are different because of differential gene expression About 40% of human genes are expressed at one time Microarray devices measure gene expression

8 © 2003 KDnuggets 8 Molecular Biology Overview Cell Nucleus Chromosome Protein Graphics courtesy of the National Human Genome Research Institute Gene (DNA) Gene (mRNA), single strand Gene expression

9 © 2003 KDnuggets 9 Affymetrix Microarrays 50um 1.28cm ~10 7 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Gene expression computed from PM and MM

10 © 2003 KDnuggets 10 Affymetrix Microarray Raw Image Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 Scanner enlarged section of raw image raw data

11 © 2003 KDnuggets 11 Microarray Potential Applications New and better molecular diagnostics New molecular targets for therapy few new drugs, large pipeline, … Outcome depends on genetic signature best treatment? Fundamental Biological Discovery finding and refining biological pathways Personalized medicine ?!

12 © 2003 KDnuggets 12 Microarray Data Mining Challenges Avoiding false positives, due to too few records (samples), usually < 100 too many columns (genes), usually > 1,000 Model needs to be robust in presence of noise For reliability need large gene sets; for diagnostics or drug targets, need small gene sets Estimate class probability Model needs to be explainable to biologists

13 © 2003 KDnuggets 13 False Positives in Astronomy cartoon used with permission

14 © 2003 KDnuggets 14 Preparation 2-Class Multi- Class Clustering CATs: Clementine Application Templates CATs - examples of complete data mining processes Microarray CAT

15 © 2003 KDnuggets 15 Key Ideas Capture the complete process X-validation loop w. feature selection inside Randomization to select significant genes Internal iterative feature selection loop For each class, separate selection of optimal gene sets Neural nets – robust in presence of noise Bagging of neural nets

16 © 2003 KDnuggets 16 Microarray Classification Train data Feature and Parameter Selection Evaluation Test data Data Model Building

17 © 2003 KDnuggets 17 Classification: External X-val Train data Feature and Parameter Selection Evaluation Test data Gene Data T r a i n FinalTest Data Model Building Final Model Final Results

18 © 2003 KDnuggets 18 Measuring false positives with randomization Class Gene Class Rand Class Randomize 500 times Bottom 1% T-value = Select potentially interesting genes at 1% Gene

19 © 2003 KDnuggets 19 Gene Reduction improves Classification most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference Heuristic: select equal # genes from each class Then apply a favorite machine learning algorithm

20 © 2003 KDnuggets 20 Iterative Wrapper approach to selecting the best gene set Test models using 1,2,3, …, 10, 20, 30, 40,..., 100 top genes with x-validation. Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class For randomized algorithms, average 10+ Cross-validation runs! Select gene set with lowest average error

21 © 2003 KDnuggets 21 Clementine stream for subset selection by x-validation

22 © 2003 KDnuggets 22 Microarrays: ALL/AML Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, examples (38 train, 34 test), about 7,000 genes well-studied (CAMDA-2000), good test example ALLAML Visually similar, but genetically very different

23 © 2003 KDnuggets 23 Gene subset selection: one X- validation Single Cross-Validation run

24 © 2003 KDnuggets 24 Gene subset selection: multiple cross-validation runs For ALL/AML data, 10 genes per class had the lowest error: (<1%) Point in the center is the average error from 10 cross- validation runs Bars indicate 1 st. dev above and below

25 © 2003 KDnuggets 25 ALL/AML: Results on the test data Genes selected and model trained on Train set ONLY! Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples): 33 correct predictions (97% accuracy), 1 error on sample 66 Actual Class AML, Net prediction: ALL other methods consistently misclassify sample misclassified by a pathologist?

26 © 2003 KDnuggets 26 Pediatric Brain Tumour Data 92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Childrens Hospital Outer cross-validation with gene selection inside the loop Ranking by absolute T-test value (selects top positive and negative genes) Select best genes by adjusted error for each class Bagging of 100 neural nets

27 © 2003 KDnuggets 27 Selecting Best Gene Set Minimizing Combined Error for all classes is not optimal Average, high and low error rate for all classes

28 © 2003 KDnuggets 28 Error rates for each class Error rate Genes per Class

29 © 2003 KDnuggets 29 Evaluating One Network ClassError rate MED2.1% MGL17% RHB24% EPD9% JPA19% *ALL*8.3% Averaged over 100 Networks:

30 © 2003 KDnuggets 30 Bagging 100 Networks Note: suspected error on one sample (labeled as MED but consistently classified as RHB) ClassIndividual Error Rate Bag Error rate Bag Avg Conf MED2.1%2% (0)*98% MGL17%10%83% RHB24%11%76% EPD9%091% JPA19%081% *ALL*8.3%3% (2)*92%

31 © 2003 KDnuggets 31 AF1q: New Marker for Medulloblastoma? AF1Q ALL1-fused gene from chromosome 1q transmembrane protein Related to leukemia (3 PUBMED entries) but not to Medulloblastoma

32 © 2003 KDnuggets 32 Future directions for Microarray Analysis Algorithms optimized for small samples Integration with other data biological networks medical text protein data Cost-sensitive classification algorithms error cost depends on outcome (dont want to miss treatable cancer), treatment side effects, etc.

33 © 2003 KDnuggets 33 Acknowledgements Eric Bremer, Childrens Hospital (Chicago) & Northwestern U. Greg Cooper, U. Pittsburgh Tom Khabaza, SPSS Sridhar Ramaswamy, MIT/Whitehead Institute Pablo Tamayo, MIT/Whitehead Institute

34 © 2003 KDnuggets 34 Thank you Further resources on Data Mining: Microarrays: Contact: Gregory Piatetsky-Shapiro:

Download ppt "Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets Connecticut College, October."

Similar presentations

Ads by Google