Presentation on theme: "Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine."— Presentation transcript:
1 Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D.IUPUIIndiana Center for Systems Biology & Personalized Medicine
2 Polyp and Colorectal Cancer Polyp vs. Colorectal CancerBenign tumors of the large intestine.Does not invade nearby tissue or spread to other parts of the body.If not removed from the large intestine, may become malignant (cancerous) over time.Most of the cancers of the large intestine are believed to have developed from Polyp.Photo Courtesy of National Cancer InstituteColon Cancer vs. Rectal CancerShare many commonalities, including molecular mechanisms.Tend to be treated differently.
3 Colorectal Cancer Molecular Pathways A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp
4 Omics/Clinical Data Source Proteomics/Metabolomics/Lipdomics/Clinical Data LC-MS ProteomicsH=80PR=72CR=40N=192NMR MetabolomicsH=53PP=35CR=15N=103Vitamin DH=83PP=81CR=31N=195GC/GC MS MetabolomicsH=83PP=84CR=30N=197Oxidative StressH=50PP=32CR=12N=94LipdomicsH=47PP=35CR=15N=97DietH=70PP=54CR=29N=153
5 Scientific Questions to Answer Data AnalysisWhich Omics data has the best prediction power?Which features in Omics data are important?Data MiningDoes integration of Omics data improve the prediction?Which combination of Omics data has the best prediction power?Knowledge DiscoveryWhy those features in Omics data have the best prediction power?
6 Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics DataIntegrative Data Mining
7 Proteomics Data Description Group: Bindley Biosciences Center at Purdue UniversityInstruments: Agilent's chip cube coupled the XCT PLUS ESI ion trapData format at CCE webportal: mzXMLNumber of Samples: Normal: 80; PolyP:72; Colorectal: 40
8 LC-MS Proteomics Data Processing LC/MS data “heat map”Total Ion Chromatogram (TIC) summarized from enhanced heat mapImage Enhanced LC/MS data “heat map”Methods Adapted from N. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp
9 LC-MS Major Protein Identification ~25-28 characteristic proteins /sample identified Identify Most Informative TIC R.T. “Grid”Use Mascot to Search for Protein ID at R.T. Grid RegionsApply the R.T. Grid to Original SpectraNoScanRTUniprot_IDScoreExpectEvidence1119139.48ADAD2_HUMAN383.32229265.87NNMT_HUMAN431.13372429.15ZSA5D_HUMAN421.24656749.8BRAF_HUMAN402.2479511621276.6RGS7_HUMAN470.39613101407.2TTC9C_HUMAN356.3716691713.9CP042_HUMAN3.1818661879.1HXD11_HUMAN348.4919871980.3ING4_HUMAN1021142086ZN423_HUMAN331123532285.7CL065_HUMAN373.91225392441.3CA5BL_HUMAN0.41327222594.7NPDC1_HUMAN3.61428742722.2DJC27_HUMAN3.81530012828.5BORG4_HUMAN1631652965.1KC1G1_HUMAN271734403196.1TPPC5_HUMAN1836563377.6UB2D3_HUMAN0.991939973665.5TM208_HUMAN8.12042573885.4ZBED3_HUMAN2923
10 Proteomics Result Interpretation Proteins Identified from Colon Cancer and Health GroupProteins Interacted with High-Frequency Proteins from Colon Cancer GroupUniprot_IDFrequency in Colon (10)Frequency in Health (10)Evidence in PubMedBRAF_HUMAN3508DMP46_HUMANNNMT_HUMAN14MRP_HUMANSTK33_HUMANUniprot_IDGeneProtein NameEvidence in PubMedBRAF1_HUMANBRAFSerine/threonine-protein kinase B-raf508P53_HUMANTP53Cellular tumor antigen p53443CD44_HUMANCD44CD44 antigen411MDM2_HUMANMDM2E3 ubiquitin-protein ligase Mdm2131BCR_HUMANBCRBreakpoint cluster region protein59LCK_HUMANLCKTyrosine-protein kinase Lck29Q7RTZ3_HUMANCAV1_HUMANCAV1Caveolin-121PNPH_HUMANPNPPurine nucleoside phosphorylase13CBL_HUMANCBLE3 ubiquitin-protein ligase CBL11RAF1_HUMANRAF1RAF proto-oncogene serine/threonine-protein kinase10CD38_HUMANCD38ADP-ribosyl cyclase 18NNMT_HUMANNNMTNicotinamide N-methyltransferase4IRAK1_HUMANIRAK1Interleukin-1 receptor-associated kinase 13DMPK_HUMANDMPKMyotonin-protein kinase2ITA5_HUMANITGA5Integrin alpha-51ITB1_HUMANITGB1Integrin beta-1ZAP70_HUMANZAP70Tyrosine-protein kinase ZAP-70
11 Proteomics Result Interpretation A Network Biology Context Protein Network Constructed from the Top 3 Differential ProteinsGreen-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)
12 Proteomics Result Interpretation A Biological Pathway Context BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)
13 Proteomics Result Interpretation A Biological Pathway Context for NNMT NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)
14 Roadmap Knowledge Discovery of Metabolomics Data Knowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics DataNMR DataGCxGC MS DataIntegrative Data Mining
15 Metabolomics Data Description Group: Daniel Raftery Laboratory at Purdue UniversityNMR DataInstruments: Bruker Avance 500MHz, NMRData format at CCE webportal: Excel spreadsheetNumber of Samples: Normal: 53; PolyP:35; Colorectal: 15GCxGC MS DataInstruments: LECO Pegasus 4D GCxGC-TOFNumber of Samples: Normal: 83; Polyp: 84; Colorectal:30Can identify low abundance of metabolites (i.e. highly sensitive) : GCxGC MSHighly reproduced and quantifiable : NMR
16 NMR Data Analysis Workflow Signal ProcessingReport only significant metabolitesSolvent not mentioned on the web; After checking the group’s previous publication, they use water when doing the experiment(I. R. Lanza, S. Zhang, L. E. Ward, H. Karakelides, D. Raftery, and S. Nair, "Quantitative Metabolomics by 1H-NMR and LC-MS/MS Confirms Altered Metabolic Pathways in Diabetes," PLoS ONE, 5, 1-10 (2010). )Extract peaks’ ppmSample_ID12Top1Delta-HexanolactoneTop2HypotaurineTop32,3-Diphosphoglyceric acidDiethanolamineTop43,7-Dimethyluric acidTop53-Phosphoglyceric acidMethyl isobutyl ketoneTop61,3,7-Trimethyluric acidTop7Cysteine-S-sulfateTop8L-AllothreonineTop9Top10Search Against Human Metabolome Database (2.5) to identify metabolites
17 NMR Peak Metabolite Identification using Human Metabolomics Database 1) Input the peak lists2) Get the metabolites; leave out those with fewer than 2 matches
18 Significant Metabolites Identified from NRM Metabolomics Data Marker metabolites?Shared metabolitesGroupMetabolitesPolyp vs HealthD-Arabitol,D-Pantethine(2/35 vs 0/53)Colorectal vs PolypNoneColorectal vs HealthD-Arabitol (2/15 vs 0/53)Population Frequency =Heat map of (0,1) for colorectal group;Clustering method defaults to “completet”(the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward", "single", "complete", "average", "mcquitty","median" or "centroid".) Distance defaults to euclidean
19 D-Arabitol Identified from NMR Results Involved in Pentose and Glucuronate Interconversions Pathways SMPDB: small molecule pathway database(http://www.smpdb.ca/)Red indicates the hit metabolites in the pathways
20 Roadmap Knowledge Discovery of Metabolomics Data Knowledge Discovery of Proteomics DataKnowledge Discovery of Metabolomics DataNMR DataGCxGC MS DataIntegrative Data Mining
21 Results from GCxGC MS Data I Metabolite identification is more straightforward Polyp vs HealthyColorectal vs PolypColorectal vs HealthyMetabolitesMethanesulfinic acid, trimethylsilyl esterAcetic acid, (methoxyimino)-, trimethylsilyl esterButanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl esterPropanoic acid, 2-(methoxyimino)-, trimethylsilyl esterPentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl esterL-Valine, N-(trimethylsilyl)-, trimethylsilyl esterHexanedioic acid, bis(2-ethylhexyl) esterCholesterol trimethylsilyl etherMefloquinePentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) esterHexanoic acid, trimethylsilyl esterCyclohexane, 1,3,5-trimethyl-2-octadecyl-Tetradecanoic acid, trimethylsilyl esterHexanoic acid, 2-(methoxyimino)-, trimethylsilyl esterpsi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'-tetrahydro-1,1'-dimethoxy-2,2'-dioxo-3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7-pentamethyl-Silanol, trimethyl-, pyrophosphate (4:1)Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl esterTrimethylsilyl ether of glycerolL-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl esterEthylbis(trimethylsilyl)amineCyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenyl-Benzene, (1-hexadecylheptadecyl)-
22 Results from GCxGC MS Data II A. Polyp vs HealthyB. Polyp vs ColorectalC. Colorectal vs HealthyClustering method defaults to “completet”(the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward", "single", "complete", "average", "mcquitty","median" or "centroid".) Distance defaults to euclidean
23 Intensity based Heat map Population Frequency based Heat map Comparative Results (Intensity vs. Population) Marker Metabolite Panel Clustering of three groupsHere we analyzed on 28 samples in total(9 normal,9 polyp, 10 colorectal) in order to be consistent with the people in proteomic analysis (i.e. try to eliminate the individual difference when doing the omics data intergration)The heat maps here only show 30 metabolites for representation purpose; the rest data is available upon request; the intensity value has been log2 transformed when generating the heatmap since the raw value is too largeIntensity based Heat mapPopulation Frequency based Heat map
24 Metabolites identified from GCxGC MS Results Involved in Fatty Acid Biosynthesis Pathways SMPDB: small molecule pathway database(http://www.smpdb.ca/)Red indicates the hit metabolites in the pathways
25 Roadmap Integrative Data Mining Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics DataIntegrative Data Mining
26 Data Set Description Diet, Lipidomics, Oxidative and VD # of features and the total # of subjects variesThree classes are balanced to the least common denominatorHealthy vs. PolypHealthy vs. ColorectalPolyp vs. ColorectalDietLipidOxidativeVDTotal Subjects1509794195Total Features384932
27 Predictive Modeling Methods ClassificationModelHypothesisHypothesisHypothesisRaw DatasetClean DatasetData PreprocessingFiltering outliers (three standard deviations away from mean)Data Normalization (transforming to the 0-1 range)Binned categorical data using Quantile binning methodMissing Value TreatmentReplaced with the mean value of the attribute in groupSupport vector machines (SVM) Classifier KernelRadial Basis Function (RBF) kernel are usedFeature Selection MethodsApproach #1: Two sample unpaired T-tests at 5% significance level.Approach #2: SVM Attribute Evaluator with Ranker Algorithm.Features from T-tests are filtered using p-valuesK-fold Cross-validation
28 Dietary Attributes as Predictors Colorectal vs. HealthyPolyp vs. HealthyP-value2.53E-029.57E-013.71E-025.60E-02P-value2.38E-024.21E-014.11E-021.21E-01SaladIce creamTomatoRiceEggTeaMilkShellfishSVM Predictor Accuracy = 64%SVM Predictor Accuracy = 65%
29 Lipidomics T-Tests Results Significant Features Selected from T Test with their corresponding p valueFeaturesPolyp vs. HealthyPolyp vs. ColorectalColorectal vs. Healthy16:0/18:1 PE1.76E-0224:1 Cer6.90E-03LPE 18:1<1.00E-04LPE 20:01.50E-032.00E-04An-16:0 LPA3.23E-02An-18:1 LPA3.38E-021.33E-02AA1.13E-0218:2 LPA4.50E-0320:4 LPA2.40E-0222:6 FA4.28E-023.24E-02LPE 16:03.08E-023.40E-03LPE 18:03.90E-031.00E-042.18E-02
30 Integrating lipidomics with clinical features Performance comparisons Without Clinical FeaturesWith Clinical FeaturesAccuracyPolyp vs. Healthy0.55Colorectal vs. Healthy*0.60Polyp vs. Colorectal *Accuracy(without pre-selection)(with t-test pre-selection)(automatic selection)Polyp vs. Healthy0.540.710.78Colorectal vs. Healthy*0.570.630.73Polyp vs. Colorectal *0.700.900.87* Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.
31 Messages Individual Omics data set has variable predictive performance Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noiseIntegration of different Omics data with clinical data can improve predictive performance
32 AcknowledgmentWe thank all the members in our team.