Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine.

Similar presentations


Presentation on theme: "Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine."— Presentation transcript:

1 Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery
Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine

2 Polyp and Colorectal Cancer
Polyp vs. Colorectal Cancer Benign tumors of the large intestine. Does not invade nearby tissue or spread to other parts of the body. If not removed from the large intestine, may become malignant (cancerous) over time. Most of the cancers of the large intestine are believed to have developed from Polyp. Photo Courtesy of National Cancer Institute Colon Cancer vs. Rectal Cancer Share many commonalities, including molecular mechanisms. Tend to be treated differently.

3 Colorectal Cancer Molecular Pathways
A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp

4 Omics/Clinical Data Source Proteomics/Metabolomics/Lipdomics/Clinical Data
LC-MS Proteomics H=80 PR=72 CR=40 N=192 NMR Metabolomics H=53 PP=35 CR=15 N=103 Vitamin D H=83 PP=81 CR=31 N=195 GC/GC MS Metabolomics H=83 PP=84 CR=30 N=197 Oxidative Stress H=50 PP=32 CR=12 N=94 Lipdomics H=47 PP=35 CR=15 N=97 Diet H=70 PP=54 CR=29 N=153

5 Scientific Questions to Answer
Data Analysis Which Omics data has the best prediction power? Which features in Omics data are important? Data Mining Does integration of Omics data improve the prediction? Which combination of Omics data has the best prediction power? Knowledge Discovery Why those features in Omics data have the best prediction power?

6 Roadmap Knowledge Discovery of Proteomics Data
Knowledge Discovery of Metabolomics Data Integrative Data Mining

7 Proteomics Data Description
Group: Bindley Biosciences Center at Purdue University Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap Data format at CCE webportal: mzXML Number of Samples: Normal: 80; PolyP:72; Colorectal: 40

8 LC-MS Proteomics Data Processing
LC/MS data “heat map” Total Ion Chromatogram (TIC) summarized from enhanced heat map Image Enhanced LC/MS data “heat map” Methods Adapted from N. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp

9 LC-MS Major Protein Identification ~25-28 characteristic proteins /sample identified
Identify Most Informative TIC R.T. “Grid” Use Mascot to Search for Protein ID at R.T. Grid Regions Apply the R.T. Grid to Original Spectra No Scan RT Uniprot_ID Score Expect Evidence 1 119 139.48 ADAD2_HUMAN 38 3.3 2 229 265.87 NNMT_HUMAN 43 1.1 3 372 429.15 ZSA5D_HUMAN 42 1.2 4 656 749.8 BRAF_HUMAN 40 2.2 479 5 1162 1276.6 RGS7_HUMAN 47 0.39 6 1310 1407.2 TTC9C_HUMAN 35 6.3 7 1669 1713.9 CP042_HUMAN 3.1 8 1866 1879.1 HXD11_HUMAN 34 8.4 9 1987 1980.3 ING4_HUMAN 10 2114 2086 ZN423_HUMAN 33 11 2353 2285.7 CL065_HUMAN 37 3.9 12 2539 2441.3 CA5BL_HUMAN 0.4 13 2722 2594.7 NPDC1_HUMAN 3.6 14 2874 2722.2 DJC27_HUMAN 3.8 15 3001 2828.5 BORG4_HUMAN 16 3165 2965.1 KC1G1_HUMAN 27 17 3440 3196.1 TPPC5_HUMAN 18 3656 3377.6 UB2D3_HUMAN 0.99 19 3997 3665.5 TM208_HUMAN 8.1 20 4257 3885.4 ZBED3_HUMAN 29 23

10 Proteomics Result Interpretation
Proteins Identified from Colon Cancer and Health Group Proteins Interacted with High-Frequency Proteins from Colon Cancer Group Uniprot_ID Frequency in Colon (10) Frequency in Health (10) Evidence in PubMed BRAF_HUMAN 3 508 DMP46_HUMAN NNMT_HUMAN 1 4 MRP_HUMAN STK33_HUMAN Uniprot_ID Gene Protein Name Evidence in PubMed BRAF1_HUMAN BRAF Serine/threonine-protein kinase B-raf 508 P53_HUMAN TP53 Cellular tumor antigen p53 443 CD44_HUMAN CD44 CD44 antigen 411 MDM2_HUMAN MDM2 E3 ubiquitin-protein ligase Mdm2 131 BCR_HUMAN BCR Breakpoint cluster region protein 59 LCK_HUMAN LCK Tyrosine-protein kinase Lck 29 Q7RTZ3_HUMAN CAV1_HUMAN CAV1 Caveolin-1 21 PNPH_HUMAN PNP Purine nucleoside phosphorylase 13 CBL_HUMAN CBL E3 ubiquitin-protein ligase CBL 11 RAF1_HUMAN RAF1 RAF proto-oncogene serine/threonine-protein kinase 10 CD38_HUMAN CD38 ADP-ribosyl cyclase 1 8 NNMT_HUMAN NNMT Nicotinamide N-methyltransferase 4 IRAK1_HUMAN IRAK1 Interleukin-1 receptor-associated kinase 1 3 DMPK_HUMAN DMPK Myotonin-protein kinase 2 ITA5_HUMAN ITGA5 Integrin alpha-5 1 ITB1_HUMAN ITGB1 Integrin beta-1 ZAP70_HUMAN ZAP70 Tyrosine-protein kinase ZAP-70

11 Proteomics Result Interpretation A Network Biology Context
Protein Network Constructed from the Top 3 Differential Proteins Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)

12 Proteomics Result Interpretation A Biological Pathway Context
BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data)

13 Proteomics Result Interpretation A Biological Pathway Context for NNMT
NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome)

14 Roadmap Knowledge Discovery of Metabolomics Data
Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data NMR Data GCxGC MS Data Integrative Data Mining

15 Metabolomics Data Description
Group: Daniel Raftery Laboratory at Purdue University NMR Data Instruments: Bruker Avance 500MHz, NMR Data format at CCE webportal: Excel spreadsheet Number of Samples: Normal: 53; PolyP:35; Colorectal: 15 GCxGC MS Data Instruments: LECO Pegasus 4D GCxGC-TOF Number of Samples: Normal: 83; Polyp: 84; Colorectal:30 Can identify low abundance of metabolites (i.e. highly sensitive) : GCxGC MS Highly reproduced and quantifiable : NMR

16 NMR Data Analysis Workflow
Signal Processing Report only significant metabolites Solvent not mentioned on the web; After checking the group’s previous publication, they use water when doing the experiment(I. R. Lanza, S. Zhang, L. E. Ward, H. Karakelides, D. Raftery, and S. Nair, "Quantitative Metabolomics by 1H-NMR and LC-MS/MS Confirms Altered Metabolic Pathways in Diabetes," PLoS ONE, 5, 1-10 (2010). ) Extract peaks’ ppm Sample_ID 1 2 Top1 Delta-Hexanolactone Top2 Hypotaurine Top3 2,3-Diphosphoglyceric acid Diethanolamine Top4 3,7-Dimethyluric acid Top5 3-Phosphoglyceric acid Methyl isobutyl ketone Top6 1,3,7-Trimethyluric acid Top7 Cysteine-S-sulfate Top8 L-Allothreonine Top9 Top10 Search Against Human Metabolome Database (2.5) to identify metabolites

17 NMR Peak Metabolite Identification using Human Metabolomics Database
1) Input the peak lists 2) Get the metabolites; leave out those with fewer than 2 matches

18 Significant Metabolites Identified from NRM Metabolomics Data
Marker metabolites? Shared metabolites Group Metabolites Polyp vs Health D-Arabitol,D-Pantethine(2/35 vs 0/53) Colorectal vs Polyp None Colorectal vs Health D-Arabitol (2/15 vs 0/53) Population Frequency = Heat map of (0,1) for colorectal group; Clustering method defaults to “completet”(the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward", "single", "complete", "average", "mcquitty","median" or "centroid".) Distance defaults to euclidean

19 D-Arabitol Identified from NMR Results Involved in Pentose and Glucuronate Interconversions Pathways
SMPDB: small molecule pathway database(http://www.smpdb.ca/) Red indicates the hit metabolites in the pathways

20 Roadmap Knowledge Discovery of Metabolomics Data
Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data NMR Data GCxGC MS Data Integrative Data Mining

21 Results from GCxGC MS Data I Metabolite identification is more straightforward
Polyp vs Healthy Colorectal vs Polyp Colorectal vs Healthy Metabolites Methanesulfinic acid, trimethylsilyl ester Acetic acid, (methoxyimino)-, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester Propanoic acid, 2-(methoxyimino)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Hexanedioic acid, bis(2-ethylhexyl) ester Cholesterol trimethylsilyl ether Mefloquine Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester Hexanoic acid, trimethylsilyl ester Cyclohexane, 1,3,5-trimethyl-2-octadecyl- Tetradecanoic acid, trimethylsilyl ester Hexanoic acid, 2-(methoxyimino)-, trimethylsilyl ester psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'-tetrahydro-1,1'-dimethoxy-2,2'-dioxo- 3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7-pentamethyl- Silanol, trimethyl-, pyrophosphate (4:1) Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester Trimethylsilyl ether of glycerol L-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl ester Ethylbis(trimethylsilyl)amine Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenyl- Benzene, (1-hexadecylheptadecyl)-

22 Results from GCxGC MS Data II
A. Polyp vs Healthy B. Polyp vs Colorectal C. Colorectal vs Healthy Clustering method defaults to “completet”(the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward", "single", "complete", "average", "mcquitty","median" or "centroid".) Distance defaults to euclidean

23 Intensity based Heat map Population Frequency based Heat map
Comparative Results (Intensity vs. Population) Marker Metabolite Panel Clustering of three groups Here we analyzed on 28 samples in total(9 normal,9 polyp, 10 colorectal) in order to be consistent with the people in proteomic analysis (i.e. try to eliminate the individual difference when doing the omics data intergration) The heat maps here only show 30 metabolites for representation purpose; the rest data is available upon request; the intensity value has been log2 transformed when generating the heatmap since the raw value is too large Intensity based Heat map Population Frequency based Heat map

24 Metabolites identified from GCxGC MS Results Involved in Fatty Acid Biosynthesis Pathways
SMPDB: small molecule pathway database(http://www.smpdb.ca/) Red indicates the hit metabolites in the pathways

25 Roadmap Integrative Data Mining Knowledge Discovery of Proteomics Data
Knowledge Discovery of Metabolomics Data Integrative Data Mining

26 Data Set Description Diet, Lipidomics, Oxidative and VD
# of features and the total # of subjects varies Three classes are balanced to the least common denominator Healthy vs. Polyp Healthy vs. Colorectal Polyp vs. Colorectal Diet Lipid Oxidative VD Total Subjects 150 97 94 195 Total Features 38 49 3 2

27 Predictive Modeling Methods
Classification Model Hypothesis Hypothesis Hypothesis Raw Dataset Clean Dataset Data Preprocessing Filtering outliers (three standard deviations away from mean) Data Normalization (transforming to the 0-1 range) Binned categorical data using Quantile binning method Missing Value Treatment Replaced with the mean value of the attribute in group Support vector machines (SVM) Classifier Kernel Radial Basis Function (RBF) kernel are used Feature Selection Methods Approach #1: Two sample unpaired T-tests at 5% significance level. Approach #2: SVM Attribute Evaluator with Ranker Algorithm. Features from T-tests are filtered using p-values K-fold Cross-validation

28 Dietary Attributes as Predictors
Colorectal vs. Healthy Polyp vs. Healthy P-value 2.53E-02 9.57E-01 3.71E-02 5.60E-02 P-value 2.38E-02 4.21E-01 4.11E-02 1.21E-01 Salad Ice cream Tomato Rice Egg Tea Milk Shellfish SVM Predictor Accuracy = 64% SVM Predictor Accuracy = 65%

29 Lipidomics T-Tests Results
Significant Features Selected from T Test with their corresponding p value Features Polyp vs. Healthy Polyp vs. Colorectal Colorectal vs. Healthy 16:0/18:1 PE 1.76E-02 24:1 Cer 6.90E-03 LPE 18:1 <1.00E-04 LPE 20:0 1.50E-03 2.00E-04 An-16:0 LPA 3.23E-02 An-18:1 LPA 3.38E-02 1.33E-02 AA 1.13E-02 18:2 LPA 4.50E-03 20:4 LPA 2.40E-02 22:6 FA 4.28E-02 3.24E-02 LPE 16:0 3.08E-02 3.40E-03 LPE 18:0 3.90E-03 1.00E-04 2.18E-02

30 Integrating lipidomics with clinical features Performance comparisons
Without Clinical Features With Clinical Features Accuracy Polyp vs. Healthy 0.55 Colorectal vs. Healthy* 0.60 Polyp vs. Colorectal * Accuracy (without pre-selection) (with t-test pre-selection) (automatic selection) Polyp vs. Healthy 0.54 0.71 0.78 Colorectal vs. Healthy* 0.57 0.63 0.73 Polyp vs. Colorectal * 0.70 0.90 0.87 * Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.

31 Messages Individual Omics data set has variable predictive performance
Need thorough statistical filtering + biological knowledge integration to battle inherent high-level of data noise Integration of different Omics data with clinical data can improve predictive performance

32 Acknowledgment We thank all the members in our team.


Download ppt "Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine."

Similar presentations


Ads by Google