Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by John Quackenbush, Ph.D. at the June 10, 2003 meeting of the Pharmacology Toxicology Subcommittee of the Advisory Committee for Pharmaceutical.

Similar presentations


Presentation on theme: "Presented by John Quackenbush, Ph.D. at the June 10, 2003 meeting of the Pharmacology Toxicology Subcommittee of the Advisory Committee for Pharmaceutical."— Presentation transcript:

1

2 Presented by John Quackenbush, Ph.D. at the June 10, 2003 meeting of the Pharmacology Toxicology Subcommittee of the Advisory Committee for Pharmaceutical Science

3 Challenges in Data Management and Analysis for Microarrays FDA 10 June 2003

4 Selecting the Appropriate Platform

5 ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC ACGTAGCTAGCTGATCGTAGCTAGC CGTAGCTAGCTGATCGTAGCTAGCT CGTAGCTAGCTGATCGTAGCTAGCT GTAGCTAGCTGATCGTAGCTAGCTA GTAGCTAGCTGATCGTAGCTAGCTA TAGCTAGCTGATCGTAGCTAGCTAG TAGCTAGCTGATCGTAGCTAGCTAG AGCTAGCTGATCGTAGCTAGCTAGC AGCTAGCTGATCGTAGCTAGCTAGC GCTAGCTGATCGTAGCTAGCTAGCT GCTAGCTGATCGTAGCTAGCTAGCT CTAGCTGATCGTAGCTAGCTAGCTA CTAGCTGATCGTAGCTAGCTAGCTA TAGCTGATCGTAGCTAGCTAGCTAG TAGCTGATCGTAGCTAGCTAGCTAG AGCTGATCGTAGCTAGCTAGCTAGC AGCTGATCGTAGCTAGCTAGCTAGC GCTGATCGTAGCTAGCTAGCTAGCT GCTGATCGTAGCTAGCTAGCTAGCT CTGATCGTAGCTAGCTAGCTAGCTG CTGATCGTAGCTAGCTAGCTAGCTG TGATCGTAGCTAGCTAGCTAGCTGA TGATCGTAGCTAGCTAGCTAGCTGA GATCGTAGCTAGCTAGCTAGCTGAT GATCGTAGCTAGCTAGCTAGCTGAT ATCGTAGCTAGCTAGCTAGCTGATC ATCGTAGCTAGCTAGCTAGCTGATC Design and synthesize chips Affymetrix GeneChip™ Expression Analysis Generate DNA Sequence ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC

6 Affymetrix GeneChip™ Expression Analysis Obtain RNA Samples Prepare Fluorescently Labeled Probes ControlTest Scan chips Analyze PM MM Hybridize and wash chips

7 Microbial ORFs Design PCR Primers PCR Products Eukaryotic Genes Select cDNA clones PCR Products Microarray Overview I For each plate set, many identical replicas Microarray Slide (with 60,000 or more spotted genes) + Microtiter Plate Many different plates containing different genes

8 Microarray Gene Chip Overview II Obtain RNA Samples Prepare Fluorescently Labeled Probes ControlTest Hybridize,Wash MeasureFluorescence in 2 channels red/green Analyze the data to identify patterns of gene expression

9 GeneSpots on an Array FluorescenceIntensity ExpressionMeasurement TissueSelectionDifferentialState/StageSelection RNA Preparation and Labeling CompetitiveHybridization Microarray Expression Analysis

10 Lack of standardization makes direct comparison of results a challenge Lot-to-log variation in arrays can introduce artifacts – are the results dependent on the biology or on the arrays (or technician or reagent lots or....) Commercial arrays provide a standard and remove some design considerations (one sample, one array), but cost up to 10x (or greater) more than in-house arrays Arrays demand good LIMS systems for sample tracking Platform-related issues

11 Microarray Analysis

12 Choose an experimentally interesting and tractable model system Design an experiment with comparisons between related variants Include sufficient biological replication to make good estimates Hybridize and collect data Normalize and filter Mine data for biological patterns of expression Integrate expression data with other ancillary data such, including genotype, phenotype, the genome, and its annotation General Microarray Strategy

13 Annotating and Comparing Arrays

14 TIGR Gene Indices home page www.tigr.org/tdb/tgi ~60 species >16,000,000 sequences

15 The Mouse Gene Index The Mouse Gene Index

16 A TC Example

17 Babak Parvizi GO Terms and EC Numbers

18 The TIGR Gene Indices The TIGR Gene Indices Dan Lee, Ingeborg Holt

19 Tentative Orthologues And Paralogues Building TOGs: Reflexive, Transitive Closure Thanks to Woytek Makałowski and Mark Boguski

20 TOGA: An Sample Alignment: bithoraxoid-like protein

21

22 Gene Finding in Humans is easy! Razvan Sultana

23 Gene Finding in Humans is easy? Razvan Sultana

24 Gene Finding in Humans is difficult? Razvan Sultana

25 Gene Finding in Humans is difficult? Razvan Sultana A genome and its annotation is only a hypothesis that must be tested.

26 http://pga.tigr.org/tools.shtml RESOURCERER Jennifer Tsai

27 RESOURCERER: An Example

28 RESOURCERER: Using Genetic Markers Next step: Integrate QTLs

29 The “complete” genome is incomplete Gene names are not yet well defined One gene may have many names One gene may have many sequences One sequence may have many names Analysis and interpretation depends on well annotated gene sets Gene names, Gene Ontology Assignments, and pathway information Cross-species comparisons require good knowledge of orthologues and paralogues Annotation Issues

30 Tools and Techniques for Array Analysis

31 Design the experiment Perform the hybridizations and generate images Analyze images to identify genes and expression levels (hybridization intensities) Normalize expression levels to facilitate comparisons Analyze expression data to find biologically relevant patterns Analysis steps

32 MADAM: Microarray Data Manager Available with OSI source and MySQL Available with OSI source and MySQL Joseph White Jerry Li Alexander Saeed Vasily Sharov Syntek Inc. MAGE-ML export by June

33 Goal is to measure ratios of gene expression levels (ratio) i = R i /G i where R i /G i are, respectively, the measured intensities for the ith spot. In a self-self hybridization, we would expect all ratios to be equal to one: R i /G i = 1 for all i. But they may not be. Why not? Unequal labeling efficiencies for Cy3/Cy5 Unequal labeling efficiencies for Cy3/Cy5 Noise in the system Noise in the system Differential expression Differential expression Normalization brings (appropriate) ratios back to one. Why Normalize Data?

34 Data exhibits an intensity-dependent structure Uncertainty in measurements is greater at lower intensities Uncertainty in ratio measurements generally greater at lower intensities Plot log 2 (R/G) vs. log 2 (R*G) [variation: Terry Speed’s M-A plot with (½ )*log 2 (R*G)] The Starting Point: The R-I Plot

35 Lowess Normalization Why LOWESS? Why LOWESS? A SD = 0.346 Observations 1.Intensity-dependent structure 2.Data not mean centered at log 2 (ratio) = 0

36 LOWESS (Cont’d) Local linear regression model Local linear regression model Tri-cube weight function Tri-cube weight function Least Squares Least Squares Estimated values of log 2 (Cy5/Cy3) as function of log 10 (Cy3*Cy5) A SD = 0.346

37 LOWESS Results

38 MIDAS: Data Analysis Wei Liang Available with source Available with source Variance Stabilization, Adding Error Models, MAANOVA, Automated Reporting

39 MeV: Data Mining Tools Alexander Saeed Alexander Sturn Nirmal Bhagabati John Braisted Syntek Inc. Datanaut, Inc. Datanaut, Inc. Available with OSI source Available with OSI source

40 Goal is identify genes (or experiments) which have “similar” patterns of expression This is a problem in data mining “Clustering Algorithms” are most widely used Types Agglomerative: Hierarchical Agglomerative: Hierarchical Divisive: k-means, SOMs Divisive: k-means, SOMs Others: Principal Component Analysis (PCA) Others: Principal Component Analysis (PCA) All depend on how one measures distance Multiple Experiments?

41 Similar expression Crucial concept for understanding clustering Each gene is represented by a vector where coordinates are its values log(ratio) in each experiment x = log(ratio) expt1 x = log(ratio) expt1 y = log(ratio) expt2 y = log(ratio) expt2 z = log(ratio) expt3 z = log(ratio) expt3 etc. etc. Expression Vectors x yz

42 Crucial concept for understanding clustering Each gene is represented by a vector where coordinates are its values log(ratio) in each experiment x = log(ratio) expt1 x = log(ratio) expt1 y = log(ratio) expt2 y = log(ratio) expt2 z = log(ratio) expt3 z = log(ratio) expt3 etc. etc. For example, if we do six experiments, Gene 1 = (-1.2, -0.5, 0, 0.25, 0.75, 1.4) Gene 1 = (-1.2, -0.5, 0, 0.25, 0.75, 1.4) Gene 2 = (0.2, -0.5, 1.2, -0.25, -1.0, 1.5) Gene 2 = (0.2, -0.5, 1.2, -0.25, -1.0, 1.5) Gene 3 = (1.2, 0.5, 0, -0.25, -0.75, -1.4) Gene 3 = (1.2, 0.5, 0, -0.25, -0.75, -1.4) etc. etc. Expression Vectors

43 These gene expression vectors of log(ratio) values can be used to construct an expression matrix Expression Matrix Expt 1 Expt 2 Expt 3 Expt 4 Expt 5 Expt 6 Gene 1 -1.2 -0.5 0 0.25 0.75 1.4 Gene 2 0.2 -0.5 1.2 -0.25 -1.0 1.5 Gene 3 1.2 0.5 0 -0.25 -0.75 -1.4 etc. This is often represented as a red/green colored matrix

44 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 The Expression Matrix is a representation of data from multiple microarray experiments. Each element is a log ratio, usually log 2 (Cy5/Cy3) Red indicates a positive log ratio, i.e, Cy5 > Cy3 Green indicates a negative log ratio, i.e., Cy5 < Cy3 Black indicates a log ratio of zero, i. e., Cy5 and Cy3 are very close in value Gray indicates missing data Expression Matrix

45 Expression Vectors As Points in ‘Expression Space’ Experiment 1 Experiment 2 Experiment 3 Similar Expression Exp 1 Exp 2 Exp 3 G1G1G1G1 G2 G3 G4 G5 x yz

46 There is no standard method for data analysis The same algorithm with a small change in parameters (such as distance metric) can produce very different results Data normalization plays a big role in identifying “differentially expressed” genes Much of the apparent disparity in microarray datasets can be attributed to differences in data analysis methods, from image processing to normalization to data mining Analysis Issues

47 Data Reporting Standards

48 What data should we collect? Nature Genetics 29, December 2001 <http://www.mged.org> MAGE-ML – XML-based data exchange format EVERYTHING

49 Publications on Microarray Data Exchange Standards MIAME Standards: Nature family, Cell family, EMBO reports, Bioinformatics, Genome Research, Genome Biology, Science, The Lancet, Science, and others….

50 MIAME Standards are a start, but still evolving Implementation will require further development of ontologies to create standard descriptors MIAME-Tox represents an attempt to extend this to toxicology Software must be developed to read/write MAGE-ML Public databases need to be extended to meet Tox needs Standardization Issues

51 Science

52 Integrating Expression with other data

53 Innate Immunity Adaptive Immunity Pathophysiologic Conditions Immunomodulatory Genes Sepsis ARDS Asthma Sepsis ARDS Asthma Antigen Presentation Cytokines and Adhesion Proteins Cytokines and Adhesion Proteins CD14 LPS TLR Proteins NF-  B IBIB IBIB Inflammatory Cell Recruitment Inflammatory Cell Recruitment LBP Degradation NIK TRAF-6 MyD88 IRAK2 BPI Adapted from Godowski. NEJM 1999; 340:1835 MD-2 David Schwartz

54 C57BL/6DBA/2 BXD5BXD29BXD39BXD42 Examples P1 L1L1L1L1 L2 L3 H1 H2 H3 P2 P1+ H3 + P2+ H2+ L1+ H1 + L2 + L3 + R (P1+P2) 53 Hybridizations P1P2 P1+ L1 H1 L1+H1+ P2+ Result: ~425 “significant” genes

55 C57BL/6DBA/2 BXD5BXD29BXD39BXD42 Examples IDEA: Build QTL Maps and use those to filter expression data Goal: Find differentially expressed genes genetically linked to response

56 525 Genes in QTL Genes by Microarray 426 426 Microarray Expression-QTL Consensus Candidate Genes 46 Candidate genes for follow-up and validation

57 BG076932 annexin A1 (Anxa1) BG085317 arginase type II (Arg2) BG064781 cytidine 5'-triphosphate synthase (Ctps) BG085740 ets-related transcription facto BG063515 ferritin heavy chain (Fth) BG078398 MARCKS-like protein (Mlp) AW556835 protein tyrosine phosphatase, non-receptor type 2 (Ptpn2) BG077485 ring finger protein (C3HC4 type) 19 (Rnf19) BG085186 surfactant protein-D gene AW550270 tenascin C (Tnc) BG065761 tumor necrosis factor, alpha-induced protein 2 (Tnfaip2) BG074379co-chaperone mt-GrpE#2 precursor putative BG080688CSF-1 BG067349C-type lectin Mincle BG073439DKFZp564O1763 AW551388E2F-like transcriptional repressor protein BG076460glutamate-cysteine ligase catalytic subunit (GLCLC) BG080666gly96 BG067921GTP binding protein BG072974DKFZp547B146 BG070296DKFZp566F164 BG074109Hsp86-1 BG077487hypoxia inducible factor 1 BG078274I kappa B alpha gene BG084405IAP-1 BG069214inhibitor of apoptosis protein 1 BG067127interferon regulatory factor 1 BG080268KC BG070106lipocalin BG064651MAIL BG063925metallothionein II BG077818metallothionein-I BG073108MHC class III region RD BG064928mitogen-responsive 96 BG072801S100A9 BG086320SDF-1-beta BG072793T-cell activating protein BG073446TH1 protein BG072227TNFa BG068491BG071081BG067341BG067620BG067670BG066678BG071169 Candidate Gene Set for LPS response

58 036912 z zz z zz z zz z zz z zz Sleep Deprivation Studies in Mouse z zz z zz z zz z zz z zz z zz z zz z zz z zz z zz

59 Experimental Paradigm Compare gene expression between sleeping and sleep-deprived mice in cortex and hypothalamus Perform 3 biological replicates Normalize and filter data and use data mining techniques to select distinct patterns of gene expression Use Gene Ontology (GO) assignments to classify genes by cellular localization, molecular function, biological process Use GO analysis to develop an understanding of response

60 Differential Expression in Cortex Energy Metabolism Transcription; Mitochondrial and Ribosomal Proteins Stress Response Metabolism and Signal Transduction

61 Differential Expression in Hypothalamus Sleep signaling

62 Predicting Outcome

63 Patients present with tumors, many of which are indistinguishable. Histology can provide some information, but these have little predictive power. Microarrays provide a “fingerprint” that can serve as a phenotypic measure that may be linked to outcome. This is a huge problem in data mining. The problem

64 The problem in pictures: Adenocarcinomas

65 32k Human Arrays

66 cDNA Multi-Organ Cancer Classifier hierarchical clustering (Pearson correlation) UNSUPERVISED CLASSIFICATION Artificial neural network training and validation SUPERVISEDCLASSIFICATION 77 tumor samples; 144 hybridization assays Normalization and flip-dye replica consistency check Statistical filtering of genes (Kruskal-Wallis H-test) 685 genes 685 genes breast ovary lung p < 0.05 Divide experiments into training and validation sets Validation 25% Training 75%

67 Input data: A list of genes with expression levels Output data: A tumor type call Neural Networks and Cancer “hidden layers” allow complex connections

68 Training: Adjusts weights and connections Neural Networks and Cancer Breast Tumor

69 Tumor Type Number of Samples Array Platform Bladder19 U95, HU6800 Breast42 U95, HU6800, TIGR 32k Central Nervous – Atypical Teratoid/Rhandoid 10HU6800 Central Nervous Glioma 10HU6800 Central Nervous - Medulloblastoma 70HU6800 Colon41 U95, HU6800, TIGR 32k Stomach/EG Junction 30 U95, TIGR 32k Kidney31 U95, HU6800, TIGR 32k Leukemia – Acute Lymphocyite B Cell 10HU6800 Leukemia – Acute Lymphocyite T Cell 10HU6800 Leukemia – Acute Myelogenous 10HU6800 Lung – Adenocarcinoma 71 U95, HU6800, TIGR 32k Lung – Squamous Cell Carcinoma 21U95 Lymphoma - Follicular 11HU6800 Lymphoma – Large B Cell 11HU6800 Melanoma10HU6800 Mesothelioma10HU6800 Ovary44 U95, HU6800, TIGR 32k Pancreas26 Prostate42 U95, HU6800 Uterus10HU6800 Tumors in the Universal Classifier 543 tumor samples 21 tumor types 95% of all cancers

70 Data Acquisition Normalization and Scaling Statistical Screening Neural Network Training and Validation Microarray Database Training Set Tumor 1 Tumor 2 Tumor 3 Tumor 4 Tumor 5 … Tumor n Test Set Tumor 1 Tumor 2 Tumor 3 Tumor 4 Tumor 5 … Tumor n Classifier All Normalized and Scaled Genes Kruskal-Wallis Bonferoni f(x) Correlative Gene Subset U95A=124 Hu6800=136 U95A Hu6800 Gene 1 2.2 Gene 2 0.5 Gene 3 1.2 … U95AHu6800 TIGR … Average Across Chips using Reference Gene-by-Gene using Reference Gene-by-Gene

71 We collected 540 expression profiles 21 tumor types 21 tumor types 95% of all cancers 95% of all cancers 10 Independent Classifiers 75% of data for training, 25% for test 75% of data for training, 25% for test Average ~88% accuracy Average ~88% accuracy Web based Classifier available So far, 7 of 8* in classification So far, 7 of 8* in classification 84% accuracy in classifying primary source 84% accuracy in classifying primary source of mets of mets * Bad RNA Summary

72 Statistical significance is not the same as biological significance If you perturb a system, many genes change their expression levels Multiple pathways and features in the data can be revealed through different analysis methods Genes which are good for classification or prognostics may not be biologically relevant Extracting meaning from microarrays will require new software and tools The most important thing we need is more data collected and stored in a standard fashion Further challenges in analysis?

73 The “complete” genomes are incomplete Many of the signatures we see on arrays do not have immediate biological implications Most often genes are included on the arrays that are used solely for normalization Larger datasets may reveal diagnostic or prognostic patterns that are not obvious at present Reported “variation” in the assays must be understood Differences in laboratory and analysis protocols are likely sources Differences in laboratory and analysis protocols are likely sources There is a need to define QC and analysis standards There is a need to define QC and analysis standards There is clearly a need for a large database of expression profiles linked to other relevant ancillary information Barriers to Toxicology Applications

74 Science is built with facts as a house is with stones – but a collection of facts is no more a science than a heap of stones is a house. – Jules Henri Poincare – Jules Henri Poincare

75 Nobody in the game of football should be called a genius. A genius is somebody like Norman Einstein. -Joe Theisman, Former quarterback

76 The TIGR Gene Index Team Foo Cheung Svetlana Karamycheva Yudan Lee Babak Parvizi Geo Pertea Razvan Sultana Jennifer Tsai John Quackenbush Joseph White Funding provided by the Department of Energy and the National Science Foundation TIGR Human/Mouse/Arabidopsis Expression Team Emily Chen Bryan Frank Renee Gaspard Jeremy Hasseman Heenam Kim Lara Linford Simon Kwong John Quackenbush Shuibang Wang Yonghong Wang Ivana Yang Yan Yu Array Software Hit Team Nirmal Bhagabati John Braisted Tracey Currier Jerry Li Wei Liang John Quackenbush Alexander I. Saeed Vasily Sharov Mathangi Thaiagarjian Joseph White Assistant Sue Mineo Funding provided by the National Cancer Institute, the National Heart, Lung, Blood Institute, and the National Science Foundation H. Lee Moffitt Center/USF Timothy J. Yeatman Greg Bloom TIGR PGA Collaborators Norman Lee Renae Malek Hong-Ying Wang Truong Luu Bobby Behbahani TIGR Faculty, IT Group, and Staff <johnq@tigr.org> Acknowledgments PGA Collaborators Gary Churchill (TJL) Greg Evans (NHLBI) Harry Gavaras (BU) Howard Jacob (MCW) Anne Kwitek (MCW) Allan Pack (Penn) Beverly Paigen (TJL) Luanne Peters (TJL) David Schwartz (Duke) Emeritus Jennifer Cho (TGI) Ingeborg Holt (TGI) Feng Liang (TGI) Kristie Abernathy (mA) Sonia Dharap(mA) Julie Earle-Hughes (mA) Cheryl Gay (mA) Priti Hegde (mA) Rong Qi (mA) Erik Snesrud (mA)


Download ppt "Presented by John Quackenbush, Ph.D. at the June 10, 2003 meeting of the Pharmacology Toxicology Subcommittee of the Advisory Committee for Pharmaceutical."

Similar presentations


Ads by Google