Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.

Slides:



Advertisements
Similar presentations
Which Test? Which Test? Explorin g Data Explorin g Data Planning a Study Planning a Study Anticipat.
Advertisements

Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
OPC Koustenis, Breiter. General Comments Surrogate for Control Group Benchmark for Minimally Acceptable Values Not a Control Group Driven by Historical.
Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model Alex Lewin 1, Philippe Broët 2 and Sylvia.
Hypothesis Test II: t tests
Non-Parametric Statistics
Review bootstrap and permutation
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
THE CENTRAL LIMIT THEOREM
Lecture 3 Validity of screening and diagnostic tests
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Multiple testing and false discovery rate in feature selection
Differential Expression Analysis Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures.
Differentially expressed genes
Basic Elements of Testing Hypothesis Dr. M. H. Rahbar Professor of Biostatistics Department of Epidemiology Director, Data Coordinating Center College.
. Differentially Expressed Genes, Class Discovery & Classification.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
1 Test of significance for small samples Javier Cabrera.
Edpsy 511 Homework 1: Due 2/6.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Chapter 11: Inference for Distributions
SHOWTIME! STATISTICAL TOOLS IN EVALUATION DESCRIPTIVE VALUES MEASURES OF VARIABILITY.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Lucio Baggio - Lucio Baggio - False discovery rate: setting the probability of false claim of detection 1 False discovery rate: setting the probability.
Objectives 1.2 Describing distributions with numbers
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Skewness & Kurtosis: Reference
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
 The mean is typically what is meant by the word “average.” The mean is perhaps the most common measure of central tendency.  The sample mean is written.
Statistical Testing with Genes Saurabh Sinha CS 466.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
© Copyright McGraw-Hill 2004
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Parameter, Statistic and Random Samples
Mixture Modeling of the p-value Distribution
Differential Gene Expression
Gene expression.
Statistical Testing with Genes
CJT 765: Structural Equation Modeling
Mixture Modeling of the Distribution of p-values from t-tests
Mixture modeling of the distribution of p-values from t-tests
Uniform-Beta Mixture Modeling of the p-value Distribution
Significance Analysis of Microarrays (SAM)
Gerald Dyer, Jr., MPH October 20, 2016
Significance Analysis of Microarrays (SAM)
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Xing Hua, Haiming Xu, Yaning Yang, Jun Zhu, Pengyuan Liu, Yan Lu 
Taming Human Genetic Variability: Transcriptomic Meta-Analysis Guides the Experimental Design and Interpretation of iPSC-Based Disease Modeling  Pierre-Luc.
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Varying Intolerance of Gene Pathways to Mutational Classes Explain Genetic Convergence across Neuropsychiatric Disorders  Shahar Shohat, Eyal Ben-David,
Statistical Testing with Genes
Benjamini & Hochberg-corrected p < 0.05 Bonferroni-corrected
Use of Piecewise Weighted Log-Rank Test for Trials with Delayed Effect
Xing Hua, Haiming Xu, Yaning Yang, Jun Zhu, Pengyuan Liu, Yan Lu 
Supplementary Figure S1
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Perturbational Gene-Expression Signatures for Combinatorial Drug Discovery  Chen-Tsung Huang, Chiao-Hui Hsieh, Yun-Hsien Chung, Yen-Jen Oyang, Hsuan-Cheng.
Presentation transcript:

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study

Outline Background and motivation COPA statistics  Existing methods  A new method Comparison of COPA statistics Application to lung cancer data

What is COPA Statistics? COPA = Cancer Outlier Profile Analysis Statistics designed to identify outliers in cancer gene expression profile Outliers

Motivation Differential gene expression(DGE) is widely used to identify over/under-expressed cancer genes. It assumes two distinguish populations: tumor and normal However, cancer is not a homogenous disease Genetically diverse Oncogene has hetergeneous activation pattern DGE may happen only in a subset of samples. COPA identifies DGE in a subset of cancer patients

Example of Cancer Heterogeneity Molecular Subsets of Lung Adenocarcinoma Pao W, Hutchinson K Mar 6;18(3):349-51

COPA Methods in Literature Original COPA method Tomlins et al 2005 Outlier Sum (OS) Tibshirani and Hastie 2007 Outlier Robust T (ORT) Wu 2007 Likelihood Ratio Statistic (LRS) Hu 2008

Notation n1 = # of normal samples n2 = # of tumor samples n = n1+ n2 is the total # of samples X ij is the expression value for sample i and gene j x1x1 x2x2 x3x3 …x n1 X n1+1 X n1+2 …XiXi …XnXn For gene j (for simplicity index j is not shown below) : Normal samples (n1) Tumor samples (n2)

The Original COPA Method Tomlins et al (2005) proposed the original COPA method. Standardize each gene based on median and MAD Define COPA stats as the rth (r = 75, 90, 95) percentile of tumor samples Limitations: 1) Fixed r r= 90 th percentile, can only detect outliers with expression levels greater than those of 90% of the tumor samples Not efficient in differentiating the number of outliers 2) MAD is calculated over all samples Outliers can affect estimate of MAD

Outlier Sum (OS) Standardize each gene Median centering Scale on MAD based on normal samples Define OS statistic as sum of standardized data from outliers which is defined as data above Q3+IQR sum Improvement over COPA: 1.Outliers are defined based on data distribution (not fixed) 2.Take account of the number of outliers 3.Better scaling factor – MAD1

Outlier Robust T (ORT) Similar to OS Different centering (normal group median) and scaling factors (pooled MAD) Define ORT as Outlier threshold is based on normal group data only

Likelihood Ratio Statistic (LRS) Outlier => a change-point problem Groups normal and tumor samples separately, and sort them within each group in ascending order Separate all the samples into two groups at k-th tumor sample, k= n1+1,n1+2,…,n-1, and form a two-sample t statistic Define x (1) x (2) x (3) …x (n1) X (n1+1) X (n1+2) …X (i) …X (n) Normal sample (n1) Tumor sample (n2) where is sample standard deviation

Comments on LRS A maximum t statistic Does not provide an explicit definition of outliers Every gene provides a max(t) Need a significance measure (p value) to define outliers

A New Method – Maximum Square Difference (MSD) Similar to LRS, instead of using a t statistic, we can use a squared difference Define More sensitive when the number of outliers is small.

Comparison of the Methods - ROC Comparisons of the methods were evaluated based on simulation using ROC curves. When n1=n2=20, we simulate 8000 null genes from standard normal. We also simulate 2000 up-regulated genes with the number of up-regulated samples (out of 20) k = 2,5,10 and 15 from N(2,1) Based on the percentiles of copa statistic from the null genes, we define the detection threshold for false positive rate (FPR). The true positive rate (TPR) is TPR = Prob(copa>=threshold | up-regulated genes)

ROC (n1=n2=20, k=2 and 5)

ROC (n1=n2=50,k=5,10)

Comparison of the Methods - FDR Comparison of methods can also be evaluated based on false discovery rate (FDR). Simulate n1=n2= 20, 50 samples with genes, among which 2000 are up-regulated in k tumor samples. For each detection threshold of copa statistic, FDR is the proportion of false positives among all positives. FDR = # of False Positives / All claimed positives =sum(copa >= c | null genes)/sum(copa>=c | all genes) A plot of FDR vs positive rate is created

FDR : n1=n2=20, k=2, 5 Fraction of genes declared positive

FDR : n1=n2=50, k=5, 10 Fraction of genes declared positive

Comparison of the Methods - Summary Our new MSD method performs the best when there is small percent (≤ 20%) of tumor samples differentially expressed (DE) - outliers. For moderate number of DE samples (20-50%), LRS performs better in ROC. For large number of DE samples (>50% tumors), t stats becomes more efficient. When relatively large number (>30%) of DE samples exist, MSD,LRS, ORT and T have comparable FDR.

Assess Significance The distributions of all COPA statistics are not known Analytic solution was not easily available Permutation test does not generate the correct null distribution. Simulation: Simulate COPA statistics under the null and derive the null distribution based on relatively large number of simulations, say, n=10000.

Distribution of COPA statistics Simulated null for n1=n2=20.

MSD Distribution Simulated under the null, genes, n1=n2=20, data from N(0,1) The figures display the pdf of both MSD and y=sqrt(MSD). Fitted dash line is a non- central Chi-square density function for MSD and a normal distribution for y.

MSD Distribution – Parameters Both  and  are functions of n1 and n2, as well as underlying gene expression distribution. If assume gene expression follows a N(0,1) distribution, then MSD parameter will be  (n1,n2),  2 (n1,n2). Plots show  is driven by n2, and  is driven by n2/n1 ratio.

Outlier Identification COPA, OS and ORT define outlier samples in their methods. MSD and LRS do not provide an explicit definition of outliers The following procedure can be used for MSD (or LRS) outlier identification Calculate MSD for all genes Estimate p value of MSD based on simulated null Calculate FDR based on Benjamini-Hochberg method Define outliers as the samples above the max(MSD) sample index and with FDR<0.05

Application – Lung Cancer Data One of the drivers in NSCLC is EML4-ALK fusion (Soda et al 2007). ALK fusion was associated with high ALK gene expression (Zhang et al 2010) The prevalence of ALK fusion in NSCLC is about 5%. Xalkori® is a highly effective ALK inhibitor in treating NSCLC patients with ALK fusion.

NSCLC Expression Data The Cancer Genome Atlas (TCGA) has expression data generated from 57 normal lung samples and 355 lung adenocarcinoma samples. Expression data were obtained using RNAseq.

ALK Gene Expression No significant difference using t-test

ALK Outlier Analysis LRS method failed to find any outliers, MSD identified 16 outliers (4.5%)

ALK Gene Fusion ALK gene has 29 exons The break point of fusion is between E19 and E20. Normal ALK transcript EML4-ALK fusion EML4 or other partner downstream of ALK upstream of ALK Junction

RNAseq ALK Exon Expression RNAseq provide ways to measure exon level expression. Exon showed high expression, Exon 1-19 had very low expression, an indication of fusion event.

Fusion Samples Among the 16 outliers samples, 7 samples showed fusion characteristics in exon expression.

Fusion Samples vs. Outlier Samples Of all 355 tumor samples, 8 showed fusion characteristics from exon expression (marked by “+”), they are in the top 20 samples in ALK mRNA expression.

Summary We proposed a new cancer outlier analysis method MSD and compared it to existing methods. MSD was shown to be more sensitive in detecting outliers when the prevalence of outliers was small (<20%).

References Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM., (2005), Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science Oct 28;310(5748): Tibshirani R and Hastie, T, 2006, Outlier sums for differential gene expression analysis, Biostatistics 2007;8:2-8. Wu B. (2007), Cancer outlier differential gene expression detection. Biostatistics 2007;8: Hu, J, 2008, Cancer outlier detection based on likelihood ratio test, Bioinformatics (2008) 24(19): Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, et al.: Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448: Zhang X, Zhang S, Yang X, Yang J, Zhou Q, et al. (2010) Fusion of EML4 and ALK is associated with development of lung adenocarcinomas lacking EGFR and KRAS mutations and is correlated with ALK expression. Mol Cancer 9: 188

Acknowledgements Fred Immermann Pfizer Oncology Research Unit at La Jolla, CA Computational Biology Asia Omics Project Team