Presentation on theme: "Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study."— Presentation transcript:
Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study
Outline Background and motivation COPA statistics Existing methods A new method Comparison of COPA statistics Application to lung cancer data
What is COPA Statistics? COPA = Cancer Outlier Profile Analysis Statistics designed to identify outliers in cancer gene expression profile Outliers
Motivation Differential gene expression(DGE) is widely used to identify over/under-expressed cancer genes. It assumes two distinguish populations: tumor and normal However, cancer is not a homogenous disease Genetically diverse Oncogene has hetergeneous activation pattern DGE may happen only in a subset of samples. COPA identifies DGE in a subset of cancer patients
Example of Cancer Heterogeneity Molecular Subsets of Lung Adenocarcinoma Pao W, Hutchinson K. 2012 Mar 6;18(3):349-51
COPA Methods in Literature Original COPA method Tomlins et al 2005 Outlier Sum (OS) Tibshirani and Hastie 2007 Outlier Robust T (ORT) Wu 2007 Likelihood Ratio Statistic (LRS) Hu 2008
Notation n1 = # of normal samples n2 = # of tumor samples n = n1+ n2 is the total # of samples X ij is the expression value for sample i and gene j x1x1 x2x2 x3x3 …x n1 X n1+1 X n1+2 …XiXi …XnXn For gene j (for simplicity index j is not shown below) : Normal samples (n1) Tumor samples (n2)
The Original COPA Method Tomlins et al (2005) proposed the original COPA method. Standardize each gene based on median and MAD Define COPA stats as the rth (r = 75, 90, 95) percentile of tumor samples Limitations: 1) Fixed r r= 90 th percentile, can only detect outliers with expression levels greater than those of 90% of the tumor samples Not efficient in differentiating the number of outliers 2) MAD is calculated over all samples Outliers can affect estimate of MAD
Outlier Sum (OS) Standardize each gene Median centering Scale on MAD based on normal samples Define OS statistic as sum of standardized data from outliers which is defined as data above Q3+IQR sum Improvement over COPA: 1.Outliers are defined based on data distribution (not fixed) 2.Take account of the number of outliers 3.Better scaling factor – MAD1
Outlier Robust T (ORT) Similar to OS Different centering (normal group median) and scaling factors (pooled MAD) Define ORT as Outlier threshold is based on normal group data only
Likelihood Ratio Statistic (LRS) Outlier => a change-point problem Groups normal and tumor samples separately, and sort them within each group in ascending order Separate all the samples into two groups at k-th tumor sample, k= n1+1,n1+2,…,n-1, and form a two-sample t statistic Define x (1) x (2) x (3) …x (n1) X (n1+1) X (n1+2) …X (i) …X (n) Normal sample (n1) Tumor sample (n2) where is sample standard deviation
Comments on LRS A maximum t statistic Does not provide an explicit definition of outliers Every gene provides a max(t) Need a significance measure (p value) to define outliers
A New Method – Maximum Square Difference (MSD) Similar to LRS, instead of using a t statistic, we can use a squared difference Define More sensitive when the number of outliers is small.
Comparison of the Methods - ROC Comparisons of the methods were evaluated based on simulation using ROC curves. When n1=n2=20, we simulate 8000 null genes from standard normal. We also simulate 2000 up-regulated genes with the number of up-regulated samples (out of 20) k = 2,5,10 and 15 from N(2,1) Based on the percentiles of copa statistic from the null genes, we define the detection threshold for false positive rate (FPR). The true positive rate (TPR) is TPR = Prob(copa>=threshold | up-regulated genes)
ROC (n1=n2=20, k=2 and 5)
Comparison of the Methods - FDR Comparison of methods can also be evaluated based on false discovery rate (FDR). Simulate n1=n2= 20, 50 samples with 10000 genes, among which 2000 are up-regulated in k tumor samples. For each detection threshold of copa statistic, FDR is the proportion of false positives among all positives. FDR = # of False Positives / All claimed positives =sum(copa >= c | null genes)/sum(copa>=c | all genes) A plot of FDR vs positive rate is created
Comparison of the Methods - Summary Our new MSD method performs the best when there is small percent (≤ 20%) of tumor samples differentially expressed (DE) - outliers. For moderate number of DE samples (20-50%), LRS performs better in ROC. For large number of DE samples (>50% tumors), t stats becomes more efficient. When relatively large number (>30%) of DE samples exist, MSD,LRS, ORT and T have comparable FDR.
Assess Significance The distributions of all COPA statistics are not known Analytic solution was not easily available Permutation test does not generate the correct null distribution. Simulation: Simulate COPA statistics under the null and derive the null distribution based on relatively large number of simulations, say, n=10000.
Distribution of COPA statistics Simulated null for n1=n2=20.
MSD Distribution Simulated under the null, 10000 genes, n1=n2=20, data from N(0,1) The figures display the pdf of both MSD and y=sqrt(MSD). Fitted dash line is a non- central Chi-square density function for MSD and a normal distribution for y.
MSD Distribution – Parameters Both and are functions of n1 and n2, as well as underlying gene expression distribution. If assume gene expression follows a N(0,1) distribution, then MSD parameter will be (n1,n2), 2 (n1,n2). Plots show is driven by n2, and is driven by n2/n1 ratio.
Outlier Identification COPA, OS and ORT define outlier samples in their methods. MSD and LRS do not provide an explicit definition of outliers The following procedure can be used for MSD (or LRS) outlier identification Calculate MSD for all genes Estimate p value of MSD based on simulated null Calculate FDR based on Benjamini-Hochberg method Define outliers as the samples above the max(MSD) sample index and with FDR<0.05
Application – Lung Cancer Data One of the drivers in NSCLC is EML4-ALK fusion (Soda et al 2007). ALK fusion was associated with high ALK gene expression (Zhang et al 2010) The prevalence of ALK fusion in NSCLC is about 5%. Xalkori® is a highly effective ALK inhibitor in treating NSCLC patients with ALK fusion.
NSCLC Expression Data The Cancer Genome Atlas (TCGA) has expression data generated from 57 normal lung samples and 355 lung adenocarcinoma samples. Expression data were obtained using RNAseq.
ALK Gene Expression No significant difference using t-test
ALK Outlier Analysis LRS method failed to find any outliers, MSD identified 16 outliers (4.5%)
ALK Gene Fusion ALK gene has 29 exons The break point of fusion is between E19 and E20. Normal ALK transcript EML4-ALK fusion 23222120EML4 or other partner 2322212016171819 downstream of ALK upstream of ALK Junction
RNAseq ALK Exon Expression RNAseq provide ways to measure exon level expression. Exon 20-29 showed high expression, Exon 1-19 had very low expression, an indication of fusion event.
Fusion Samples Among the 16 outliers samples, 7 samples showed fusion characteristics in exon expression.
Fusion Samples vs. Outlier Samples Of all 355 tumor samples, 8 showed fusion characteristics from exon expression (marked by “+”), they are in the top 20 samples in ALK mRNA expression.
Summary We proposed a new cancer outlier analysis method MSD and compared it to existing methods. MSD was shown to be more sensitive in detecting outliers when the prevalence of outliers was small (<20%).
References Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM., (2005), Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005 Oct 28;310(5748):644-8. Tibshirani R and Hastie, T, 2006, Outlier sums for differential gene expression analysis, Biostatistics 2007;8:2-8. Wu B. (2007), Cancer outlier differential gene expression detection. Biostatistics 2007;8:566-75. Hu, J, 2008, Cancer outlier detection based on likelihood ratio test, Bioinformatics (2008) 24(19): 2193-2199 Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, et al.: Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448:561-566. Zhang X, Zhang S, Yang X, Yang J, Zhou Q, et al. (2010) Fusion of EML4 and ALK is associated with development of lung adenocarcinomas lacking EGFR and KRAS mutations and is correlated with ALK expression. Mol Cancer 9: 188
Acknowledgements Fred Immermann Pfizer Oncology Research Unit at La Jolla, CA Computational Biology Asia Omics Project Team