Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identification of high-quality cancer prognostic markers and metastasis network modules Edwin WangEdwin Wang Lab, Li et al., Nature Communications, 1:34,

Similar presentations


Presentation on theme: "Identification of high-quality cancer prognostic markers and metastasis network modules Edwin WangEdwin Wang Lab, Li et al., Nature Communications, 1:34,"— Presentation transcript:

1 Identification of high-quality cancer prognostic markers and metastasis network modules Edwin WangEdwin Wang Lab, Li et al., Nature Communications, 1:34, 2010

2 At least 60 years of research links the onset of cancer to errors during cell division.

3 How often do cells divide? We lose around 50 million (skin) cells every day. (ASU.edu)(ASU.edu) Erythrocytes (red blood cells) are continuously produced in the red bone marrow of large bones, at a rate of about 2 million per second in a healthy adult. (stackexchange/wikipedia)(stackexchange/wikipedia) By the time you finish reading this sentence, 50 million of your cells will have died and been replaced by others. (Science Museum)(Science Museum) Between 50 and 70 billion cells die each day due to apoptosis in the average human adult. For an average child between the ages of 8 and 14, approximately 20 billion to 30 billion cells die a day (wikipedia).billion(wikipedia). Either way, let’s just agree that at any given time, there are a lot of dead cells that need to be replaced by (stem) cell division.

4 What can go wrong? Cell division relies on duplication of DNA and chromosomes, which are error-prone processes. Even with error rates of “one in a million”, there would still be thousands of new, “defective” cells, every day. Genomic alterations such as rearrangements, chromosomal fragment amplifications and deletion. Tumor suppressor genes (e.g. P53, RB), monitor the genomic stability and prevent cell cycle progression if errors are detected. However, if these gate keepers (tumor suppressors) are themselves mutated, there are no check points left. In the subsequent divisions, new alterations can accumulate without any brakes.

5 Examples for chromosomal re- arrangement. In gene expression data, these could look like under/over-expression!

6 Frameshift The fat cat ate the wee rat. The fat caa tet hew eer at. In RNASeq data, this might look like “no change” … but the transcripts will never become viable.

7

8 Most Mutations and rearrangements in cancer cells are random and irrelevant, but they can affect the perceived “expression” of a lot of genes. Tumor cells often have many more 'passenger signals' than other types of cells, which means that the variability of gene expression profiles between individual tumors can be extremely high, and the 'real' cancer gene expression signals may be buried in these highly varied profiles. Unsupervised clustering of gene expression is to some extent just the clustering of noise!

9 Identification of high-quality cancer prognostic markers and metastasis network modules Edwin WangEdwin Wang Lab, Li et al., Nature Communications, 1:34, 2010

10 Motivation: why do we need prognostic markers? Often overtreated because of a failure to identify low-risk cancer patients. Almost 60–75% of women with early-stage breast cancer undergo a toxic therapy from which they will not receive any benefit, but instead will experience only side effects. Need to improve the capacity to predict whether a patient's cancer is going to recur after surgical removal. To be clinically practicable, low-risk patients should be associated with 10- year overall survival probabilities of at least 88% and 92% for ER+ and ER− tumors, respectively. No existing algorithm reaches this level of accuracy.

11 Datasets

12 Parameters

13 MMP9?!

14 RDS and RGS Random (virtual) datasets Random gene sets

15 How robust are prognostic markers?

16 stratification of patients into low-, intermediate- and high-risk groups in both the training set and in eight independent testing sets containing 1,375 samples.

17 MSS overview

18 RGS-GO

19 Accuracy ER+

20 Accuracy ER -

21 Kaplan Meier Curves ER+ samples

22 Kaplan Meier Curves ER- samples

23 Mutated and modulated (driver and signature) subnetworks

24 Conclusions In summary, we showed that the concept of 'one-step-clustering' of gene expression profiles, which has been dominantly used in the past decade, is not suitable for generating robust gene signatures.

25 We have developed a Multiple Survival Screening algorithm (MSS) for identifying high-quality cancer prognostic markers from the gene expression profiles of cancer samples. By applying the MSS algorithm to breast cancer samples, we have identified several marker sets which showed ~90% predicting accuracy across 8 independent breast cancer cohorts. We realized that the algorithm could be used for finding other biomarkers including drug response markers. We are describing the protocol with some comments based on our experience in using the algorithm.

26 The MSS Protocol The MSS algorithm includes 9 steps. Step 1 A survival gene pool is generated in this step by performing a genome-wide single gene survival analysis in a given training dataset. Generally the survival p-value is less than 0.05. To get more robust markers, we suggest that several such training datasets could be used to obtain several survival gene pools. These gene pools could be merged for the next step. Step 2 In this step, cancer hallmark related Gene Ontology (GO)-term-defined gene sets could be generated by functional annotation of the survival genes, which have been generated in Step 1, using GO analysis tools such as DAVID Bioinformatics Resources (http://david.abcc.ncifcrf.gov/). Normally each of the GO-term-defined gene sets contains 50~100 genes.. If a GO-term-defined gene set includes not many genes (i.g., less than 45), we usually discarded it. Alternatively, we combined the genes (to get the gene size between 50 and 100 genes) from a few GO-term-defined gene sets in which each GO-term-defined gene set contains less than 50 genes. If a GO-term- defined gene set includes many genes (i.g., more than 100), we ranked the genes by running the Steps 3-7 and took the top 60-80 genes for running the MSS. Step 3 Random gene sets (RGSs) have been generated in this step. We generated 1 million distinct RGSs from each selected GO-term-defined gene set. Each RGS contains 30 genes. More RGSs could be generated if you have powerful computer clusters, thus the biomarker might be more robust.

27 Step 4 Random datasets (RDSs) have been generated in this step using the training dataset. We normally generated 36 RDSs. However, more RDSs could be generated when powerful computer clusters are available. It is critical to maintain the same ratio of “good” and “bad” tumors as that in the original training set. Furthermore, it is better to have at least 60 samples for each RDS. It is better to make sure that these RDSs have the maximal difference of the samples. Steps 5-7 These steps are used to perform survival screenings of the RGSs on the RDSs. Several parameters need to be set in these steps. In the MSS algorithm, we selected the “predictive RGSs” whose survival p-value is less than 0.05 in more than 90% of the RDSs (RDS passing rate). However, we found the p-values could be less than 0.01, while the RDS passing rate could be between 75% and 95%. These parameters could be adjusted to have several thousand RGSs selected. The selected RGSs can be used to get the top 30 most frequent genes (a potential gene signature). If more than 30% of the 1 million RGSs have p-values less than 0.05 in more than 35 RDSs, this experiment will be discarded. The results might be come from data overfitting. On the other hand, if only a few hundred RGSs have p-values less than 0.05 in less than 80% of the RDSs, this experiment will also be discarded. In addition, it may need to run more RGSs (4 or 8 millions) to get enough selected RGSs. This depends on the datasets and the GO-term-defined gene set.

28 Step 8 A potential gene signature containing top-ranked 30 most frequent genes has been obtained from the selected RGSs. In fact, the range of the gene size of a potential gene signature could be between 20 and 30 based on the GO-term-defined gene contents and training datasets. Step 9 This step is to assess the reproducibility and stability of a potential gene signature. After running 2 distinct 1 million RGSs derived from one GO- term-defined gene set on the same RDS set, we examined how many genes are in common between the two top-ranked 30 genes. The common genes could be from 20 to 30 based on different experimental conditions. Furthermore, the top-ranked 30 genes may be different when different RDSs, RGSs, parameters and training data sets were used. However, the performance of the selected gene signatures should be robust in other independent testing datasets, for examples, they often have survival p- value (<0.05) in the testing datasets if the MSS protocol has been followed.

29 Kaplan Meier For each time interval, survival probability is calculated as the number of subjects surviving divided by the number of patients at risk. Subjects who have died, dropped out, or move out are not counted as “at risk” i.e., subjects who are lost are considered “censored” and are not counted in the denominator. Total probability of survival till that time interval is calculated by multiplying all the probabilities of survival at all time intervals preceding that time (by applying law of multiplication of probability to calculate cumulative probability). For example, the probability of a patient surviving two days after a kidney transplant can be considered to be probability of surviving the one day multiplied by the probability surviving the second day given that patient survived the first day. This second probability is called as a conditional probability. Although the probability calculated at any given interval is not very accurate because of the small number of events, the overall probability of surviving to each point is more accurate.


Download ppt "Identification of high-quality cancer prognostic markers and metastasis network modules Edwin WangEdwin Wang Lab, Li et al., Nature Communications, 1:34,"

Similar presentations


Ads by Google