CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877

CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore csccyz@nus.edu.sg http://xin.cz3.nus.edu.sgcsccyz@nus.edu.sg http://xin.cz3.nus.edu.sg

2 Why Microarray? Although there has been some improvements over the past 30 years still there exists no general way for: –Identifying new cancer classes –Assigning tumors to known classes In this paper they are introducing two general ways for –Class prediction of a new tumor –Class discovery of new unknown subclasses –Without using the previous biological information

3 Why Microarray? Why do we need to classify cancers? –The general way of treating cancer is to: Categorize the cancers in different classes Use specific treatment for each of the classes Traditional way –Morphological appearance.

4 Why Microarray? Why traditional ways are not enough ? –There exists some tumors in the same class with completely different clinical courses May be more accurate classification is needed –Assigning new tumors to known cancer classes is not easy e.g. assigning an acute leukemia tumor to one of the –AML –ALL

5 Cancer Classification Class discovery –Identifying new cancer classes Class Prediction –Assigning tumors to known classes

6 Cancer Genes and Pathways 15 cancer-related pathways, 291 cancer genes, 34 angiogenesis genes, 12 tumor immune tolerance genes Nature Medicine 10, 789-799 (2004); Nature Reviews Cancer 4, 177-183 (2004), 6, 613-625 (2006); Critical Reviews in Oncology/Hematology 59, 40-50 (2006) http://bidd.nus.edu.sg/group/trmp/trmp.asp http://bidd.nus.edu.sg/group/trmp/trmp.asp

Disease outcome prediction with microarray Patient i: Normal person j: SVM Patient Normal Important genes Most discriminative genes Patient i: Normal person j: SVM Patient Normal SignaturesPredictor-genes Better predictive power Clues to disease genes, drug targets

Disease outcome prediction with microarray Patient i: Normal person j: SVM Patient Normal Expected features of signatures: Composition: Certain percentages of cancer genes, genes in cancer pathways, and angiogenesis genesCertain percentages of cancer genes, genes in cancer pathways, and angiogenesis genesStability: Similar set of predictor-genes in different patient compositions measures under the same or similar conditionsSimilar set of predictor-genes in different patient compositions measures under the same or similar conditions How many genes should be in a signature? Class No of Genes or Pathways Cancer genes (oncogenes, tumor- suppressors, stability genes) 219 Cancer pathways15 Angiogenesis34 Cancer immune tolerance15

9 Class Prediction How could one use an initial collection of samples belonging to known classes to create a class Predictor? –Gathering samples –Hybridizing RNA’s to the microarray –Obtaining quantitative expression level of each gene –Identification of Informative Genes via Neighborhood Analysis –Weighted votes

10 Neighborhood Analysis We want to identify the genes whose expression pattern were strongly correlated with the class distinction to be predicted and ignoring other genes –Each gene is presented by an expression vector consisting of its expression level in each sample. –Counting no. of genes having various levels of correlation with ideal gene c. –Comparing with the correlation of randomly permuted c with it The results show an unusually high density of correlated genes!

11 Neighborhood analysis Idealized expression pattern

12 Class Predictor The General approach –Choosing a set of informative genes based on their correlation with the class distinction –Each informative gene casts a weighted vote for one of the classes –Summing up the votes to determine the winning class and the prediction strength

13 Computing Votes Each gene Gi votes for AML or ALL depending on : –If the expression level of the gene in the new tumor is nearer to the mean of Gi in AML or ALL The value of the vote is : –WiVi where: Wi reflects how well Gi is correlated with the class distinction Vi = | xi – (AML mean + ALL mean) / 2 | The prediction strength reflects: –Margin of victory –(Vwin – Vloose) / (Vwin + Vloose)

14 Class Predictor

15 Evaluation DATA –Initial Sample 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis. –Independent Sample 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML). Validation of Gene Voting –Initial Samples 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis. –Independent Samples 29 of 34 samples are strongly predicted with 100% accuracy.

16 Validation of Gene Voting

17 An early kind of analysis: unsupervised learning  learning disease sub-types Rb p53

18 Sub-type learning: seeking ‘natural’ groupings & hoping that they will be useful… Rb p53

19 E.g., for treatment Rb p53 Respond to treatment Tx1 Do not Respond to treatment Tx1

20 The ‘one-solution fits all’ trap Rb p53 Respond to treatment Tx2 Do not Respond to treatment Tx2

21 A more modern view: supervised learning

22 Predictive Biomarkers Predictive Biomarkers & Supervised Learning

23 Predictive Biomarkers & Supervised Learning

24 A more modern view 2: Unsupervised learning as structure learning

25 Causative biomarkers & (structural) unsupervised learning Causative Biomarkers

26 Supervised learning: the geometrical interpretation

27 10,000-50,000 (regular gene expression microarrays, aCGH, and early SNP arrays) 500,000 (tiled microarrays, SNP arrays) 10,000-300,000 (regular MS proteomics) >10, 000, 000 (LC-MS proteomics) This is the ‘curse of dimensionality problem’ If 2D looks good, what happens in 3D?

28 Some methods do not run at all (classical regression) Some methods give bad results Very slow analysis Very expensive/cumbersome clinical application Problems associated with high-dimensionality (especially with small samples)

29 Solution 1: dimensionality reduction

30 B A CD E T HIJ K QL MN PO Solution 2: feature selection

31 Over-fitting ( a model to your data)= building a model than is good in original data but fails to generalize well to fresh data Another (very real and unpleasant) problem Over-fitting

32 Over-fitting is directly related to the complexity of decision surface (relative to the complexity of modeling task)

33 Over-fitting is also caused by multiple validations & small samples

34 Over-fitting is also caused by multiple validations & small samples

35 A method to produce realistic performance estimates: nested n-fold cross-validation

36 How well supervised learning works in practice?

37Datasets Bhattacharjee2 - Lung cancer vs normals [GE/DX] Bhattacharjee2_I - Lung cancer vs normals on common genes between Bhattacharjee2 and Beer [GE/DX] Bhattacharjee3 - Adenocarcinoma vs Squamous [GE/DX] Bhattacharjee3_I - Adenocarcinoma vs Squamous on common genes between Bhattacharjee3 and Su [GE/DX] Savage - Mediastinal large B-cell lymphoma vs diffuse large B-cell lymphoma [GE/DX] Rosenwald4 - 3-year lymphoma survival [GE/CO] Rosenwald5 - 5-year lymphoma survival [GE/CO] Rosenwald6 - 7-year lymphoma survival [GE/CO] Adam - Prostate cancer vs benign prostate hyperplasia and normals [MS/DX] Yeoh - Classification between 6 types of leukemia [GE/DX-MC] Conrads - Ovarian cancer vs normals [MS/DX] Beer_I - Lung cancer vs normals (common genes with Bhattacharjee2) [GE/DX] Su_I - Adenocarcinoma vs squamous (common genes with Bhattacharjee3) [GE/DX Banez - Prostate cancer vs normals [MS/DX]

38 Methods: Gene Selection Algorithms ALL - No feature selection LARS - LARS HITON_PC - HITON_PC_W -HITON_PC+ wrapping phase HITON_MB - HITON_MB_W -HITON_MB + wrapping phase GA_KNN - GA/KNN RFE - RFE with validation of feature subset with optimized polynomial kernel RFE_Guyon - RFE with validation of feature subset with linear kernel (as in Guyon) RFE_POLY - RFE (with polynomial kernel) with validation of feature subset with polynomial optimized kernel RFE_POLY_Guyon - RFE (with polynomial kernel) with validation of feature subset with linear kernel (as in Guyon) SIMCA - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method SIMCA_SVM - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method with validation of feature subset by SVM WFCCM_CCR - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Clinical Cancer Research paper by Yamagata (analysis of microarray data) WFCCM_Lancet - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Lancet paper by Yanagisawa (analysis of mass-spectrometry data) UAF_KW - Univariate with Kruskal-Walis statistic UAF_BW - Univariate with ratio of genes between groups to within group sum of squares UAF_S2N - Univariate with signal-to-noise statistic

39 Classification Performance (average over all tasks/datasets)

40 How well dimensionality reduction and feature selection work in practice?

41 Number of Selected Features (average over all tasks/datasets)

42 Number of Selected Features (zoom on most powerful methods)

43 Number of Selected Features (average over all tasks/datasets)

CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877

Similar presentations

Presentation on theme: "CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877

Similar presentations

Presentation on theme: "CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877"— Presentation transcript:

Similar presentations

About project

Feedback