Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine.

Similar presentations


Presentation on theme: "Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine."— Presentation transcript:

1 Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine

2 Polyp and Colorectal Cancer  Polyp vs. Colorectal Cancer Benign tumors of the large intestine. Does not invade nearby tissue or spread to other parts of the body. If not removed from the large intestine, may become malignant (cancerous) over time. Most of the cancers of the large intestine are believed to have developed from Polyp. Photo Courtesy of National Cancer Institute  Colon Cancer vs. Rectal Cancer Share many commonalities, including molecular mechanisms. Tend to be treated differently.

3 Colorectal Cancer Molecular Pathways A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp

4 Omics/Clinical Data Source Proteomics/Metabolomics/Lipdomics/Clinical Data Diet H=70 PP=54 CR=29 N=153 Oxidative Stress H=50 PP=32 CR=12 N=94 LC-MS Proteomics H=80 PR=72 CR=40 N=192 Vitamin D H=83 PP=81 CR=31 N=195 GC/GC MS Metabolomics H=83 PP=84 CR=30 N=197 Lipdomics H=47 PP=35 CR=15 N=97 NMR Metabolomics H=53 PP=35 CR=15 N=103

5 Scientific Questions to Answer  Data Analysis Which Omics data has the best prediction power? Which features in Omics data are important?  Data Mining Does integration of Omics data improve the prediction? Which combination of Omics data has the best prediction power?  Knowledge Discovery Why those features in Omics data have the best prediction power?

6 Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data Integrative Data Mining

7 Proteomics Data Description  Group: Bindley Biosciences Center at Purdue University  Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap  Data format at CCE webportal: mzXML  Number of Samples: Normal: 80; PolyP:72; Colorectal: 40

8 LC-MS Proteomics Data Processing LC/MS data “heat map” Total Ion Chromatogram (TIC) summarized from enhanced heat map Methods Adapted from N. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp Image Enhanced LC/MS data “heat map”

9 LC-MS Major Protein Identification ~25-28 characteristic proteins /sample identified Identify Most Informative TIC R.T. “Grid” Apply the R.T. Grid to Original Spectra Use Mascot to Search for Protein ID at R.T. Grid Regions NoScanRTUniprot_IDScoreExpectEvidence ADAD2_HUMAN NNMT_HUMAN ZSA5D_HUMAN BRAF_HUMAN RGS7_HUMAN TTC9C_HUMAN CP042_HUMAN HXD11_HUMAN ING4_HUMAN ZN423_HUMAN CL065_HUMAN CA5BL_HUMAN NPDC1_HUMAN DJC27_HUMAN BORG4_HUMAN KC1G1_HUMAN TPPC5_HUMAN UB2D3_HUMAN TM208_HUMAN ZBED3_HUMAN29230

10 Proteomics Result Interpretation Proteins Identified from Colon Cancer and Health Group Uniprot_ID Frequency in Colon (10) Frequency in Health (10) Evidence in PubMed BRAF_HUMAN30508 DMP46_HUMAN300 NNMT_HUMAN314 MRP_HUMAN130 STK33_HUMAN030 Uniprot_IDGeneProtein Name Evidence in PubMed BRAF1_HUMANBRAF Serine/threonine-protein kinase B- raf508 P53_HUMANTP53Cellular tumor antigen p53443 CD44_HUMANCD44CD44 antigen411 MDM2_HUMANMDM2E3 ubiquitin-protein ligase Mdm2131 BCR_HUMANBCRBreakpoint cluster region protein59 LCK_HUMANLCKTyrosine-protein kinase Lck29 Q7RTZ3_HUMANLCKTyrosine-protein kinase Lck29 CAV1_HUMANCAV1Caveolin-121 PNPH_HUMANPNPPurine nucleoside phosphorylase13 CBL_HUMANCBLE3 ubiquitin-protein ligase CBL11 RAF1_HUMANRAF1 RAF proto-oncogene serine/threonine-protein kinase10 CD38_HUMANCD38ADP-ribosyl cyclase 18 NNMT_HUMANNNMTNicotinamide N-methyltransferase4 IRAK1_HUMANIRAK1 Interleukin-1 receptor-associated kinase 13 DMPK_HUMANDMPKMyotonin-protein kinase2 ITA5_HUMANITGA5Integrin alpha-51 ITB1_HUMANITGB1Integrin beta-11 ZAP70_HUMANZAP70Tyrosine-protein kinase ZAP-701 Proteins Interacted with High-Frequency Proteins from Colon Cancer Group

11 Proteomics Result Interpretation A Network Biology Context Protein Network Constructed from the Top 3 Differential Proteins Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)

12 Proteomics Result Interpretation A Biological Pathway Context BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)

13 NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome) Proteomics Result Interpretation A Biological Pathway Context for NNMT

14 Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data NMR Data GCxGC MS Data Integrative Data Mining

15 Metabolomics Data Description Group: Daniel Raftery Laboratory at Purdue University 1. NMR Data  Instruments: Bruker Avance 500MHz, NMR  Data format at CCE webportal: Excel spreadsheet  Number of Samples: Normal: 53; PolyP:35; Colorectal: GCxGC MS Data  Instruments: LECO Pegasus 4D GCxGC-TOF  Data format at CCE webportal: Excel spreadsheet  Number of Samples: Normal: 83; Polyp: 84; Colorectal:30

16 NMR Data Analysis Workflow Extract peaks’ ppm Search Against Human Metabolome Database (2.5) to identify metabolites Report only significant metabolites Sample_ID12 Top1Delta-Hexanolactone Top2Hypotaurine Top3 2,3-Diphosphoglyceric acidDiethanolamine Top4Diethanolamine3,7-Dimethyluric acid Top53-Phosphoglyceric acidMethyl isobutyl ketone Top63,7-Dimethyluric acid1,3,7-Trimethyluric acid Top71,3,7-Trimethyluric acidCysteine-S-sulfate Top8L-Allothreonine Top9 Top10 Signal Processing

17 NMR Peak Metabolite Identification using Human Metabolomics Database 1) Input the peak lists 2) Get the metabolites; leave out those with fewer than 2 matches

18 Significant Metabolites Identified from NRM Metabolomics Data GroupMetabolites Polyp vs HealthD-Arabitol,D-Pantethine(2/35 vs 0/53) Colorectal vs PolypNone Colorectal vs HealthD-Arabitol (2/15 vs 0/53) Population Frequency = Marker metabolites?Shared metabolites

19 D-Arabitol Identified from NMR Results Involved in Pentose and Glucuronate Interconversions Pathways

20 Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data NMR Data GCxGC MS Data Integrative Data Mining

21 Results from GCxGC MS Data I Metabolite identification is more straightforward Polyp vs HealthyColorectal vs PolypColorectal vs Healthy Metabolites Methanesulfinic acid, trimethylsilyl esterAcetic acid, (methoxyimino)-, trimethylsilyl esterButanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester Propanoic acid, 2-(methoxyimino)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Hexanedioic acid, bis(2-ethylhexyl) esterMethanesulfinic acid, trimethylsilyl esterCholesterol trimethylsilyl ether MefloquinePentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester Hexanoic acid, trimethylsilyl ester Cyclohexane, 1,3,5-trimethyl-2-octadecyl-L-Valine, N-(trimethylsilyl)-, trimethylsilyl esterPentanoic acid, 2-(methoxyimino)-3- methyl-, trimethylsilyl ester Tetradecanoic acid, trimethylsilyl esterButanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester Hexanoic acid, 2-(methoxyimino)-, trimethylsilyl ester psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'- tetrahydro-1,1'-dimethoxy-2,2'-dioxo- Cyclohexane, 1,3,5-trimethyl-2-octadecyl-3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7- pentamethyl- Silanol, trimethyl-, pyrophosphate (4:1)Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester Trimethylsilyl ether of glycerolL-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl ester Ethylbis(trimethylsilyl)amine Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenyl- Benzene, (1-hexadecylheptadecyl)- Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester

22 Results from GCxGC MS Data II A. Polyp vs HealthyB. Polyp vs Colorectal C. Colorectal vs Healthy

23 Comparative Results (Intensity vs. Population) Marker Metabolite Panel Clustering of three groups Intensity based Heat map Population Frequency based Heat map

24 Metabolites identified from GCxGC MS Results Involved in Fatty Acid Biosynthesis Pathways

25 Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data Integrative Data Mining

26 Data Set Description  Diet, Lipidomics, Oxidative and VD  # of features and the total # of subjects varies  Three classes are balanced to the least common denominator  Healthy vs. Polyp  Healthy vs. Colorectal  Polyp vs. Colorectal DietLipidOxidativeVD Total Subjects Total Features384932

27 Predictive Modeling Methods  Data Preprocessing  Filtering outliers (three standard deviations away from mean)  Data Normalization (transforming to the 0-1 range)  Binned categorical data using Quantile binning method  Missing Value Treatment  Replaced with the mean value of the attribute in group  Support vector machines (SVM) Classifier Kernel  Radial Basis Function (RBF) kernel are used  Feature Selection Methods  Approach #1: Two sample unpaired T-tests at 5% significance level.  Approach #2: SVM Attribute Evaluator with Ranker Algorithm.  Features from T-tests are filtered using p-values  K-fold Cross-validation Classification Model Clean Dataset Raw Dataset Hypothesis

28 Dietary Attributes as Predictors Polyp vs. Healthy Colorectal vs. Healthy 2.38E E E E E E E E-02 SVM Predictor Accuracy = 64% SVM Predictor Accuracy = 65% P-value Ice cream Rice Tea Shellfish Salad Tomato Egg Milk

29 Lipidomics T-Tests Results Significant Features Selected from T Test with their corresponding p value FeaturesPolyp vs. HealthyPolyp vs. ColorectalColorectal vs. Healthy 16:0/18:1 PE1.76E-02 24:1 Cer6.90E-03 LPE 18:1 <1.00E-04 LPE 20:01.50E E-04 An-16:0 LPA 3.23E-02 An-18:1 LPA 3.38E E-02 AA 1.13E-02 18:2 LPA 1.13E E-03 20:4 LPA 2.40E-02 22:6 FA 4.28E E-02 LPE 16:0 3.08E E-03 LPE 18:0 3.90E E-04 LPE 18:1 2.18E-02

30 Integrating lipidomics with clinical features Performance comparisons Accuracy (without pre- selection) Accuracy (with t-test pre- selection) Accuracy (automatic selection) Polyp vs. Healthy Colorectal vs. Healthy* Polyp vs. Colorectal * * Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported. Accuracy Polyp vs. Healthy 0.55 Colorectal vs. Healthy* 0.60 Polyp vs. Colorectal * 0.60 Without Clinical Features With Clinical Features

31 Messages  Individual Omics data set has variable predictive performance  Need thorough statistical filtering + biological knowledge integration to battle inherent high- level of data noise  Integration of different Omics data with clinical data can improve predictive performance 31

32 Acknowledgment We thank all the members in our team.


Download ppt "Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine."

Similar presentations


Ads by Google