Presentation is loading. Please wait.

Presentation is loading. Please wait.

The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,

Similar presentations


Presentation on theme: "The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,"— Presentation transcript:

1 The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data
The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1, Mark Lennon1, Katja Remlinger1, David Willé1 Target Sciences Statistics, Target Sciences, R&D, GSK Department of Statistics, North Carolina State University The RNA-Seq Big Idea RNA-Sequencing (RNA-Seq) is a high-throughput sequencing technology that is noted for its capability to measure nucleotide sequences from multiple samples in parallel. As experiments using RNA-Seq technology are becoming more commonplace throughout the drug discovery process, statisticians need to learn more about the technology to be able to advise scientists on design and analysis aspects. This will help to guard against related errors which could result in wasted resources and erroneous conclusions. Statistical Design for RNA-Seq It is generally assumed: lane effect < flow cell effect/run effect < library preparation effect << biological effect. Replication is essential for estimating and decreasing the experimental error, and thus to detect the biological (treatment) effect more precisely. Sequencing depth is a unique characteristic for RNA-Seq. It is often estimated as the number of total mapped sequences. From a statistical perspective, designs that maximize biological replication at the sacrifice of reduced sequencing depth per experimental unit are preferred [3]. RNA-Seq experiments are a multiphase process: In the in-vivo phase, experimental units are assigned to different treatments, using basic design principles. In the in-vitro phase, biological samples (e.g. tissues) are collected from each experimental unit and RNA is extracted. Finally, cDNA fragments are prepared and RNA-Seq is used to determine the identity of millions of reads from each fragment and a count matrix is generated. Ideally, the design strategies of all three phases should be coordinated, for example by confounding nuisance factors from all three phases [1]. Background: Biology & Technology Statistical Analysis of Count Data Example of Sequencer RNA is converted to cDNA and fragmented. Reads are sequenced and then aligned to reference genome or transcript or assembled de novo. RNA-Seq measures RNA transcripts (the blue line). The resulting data can comprise up to about 55,000 count measurements per sample. The count data collected from the RNA-Seq experiments are typically modelled by a negative binomial distribution or a Poisson distribution with overdispersion. Both approaches allow for over dispersion of counts between genes. DESeq2 [4] is a popular R package which is used to analyze RNA-Seq count data. The statistical model implemented in the package is described as: Flow Cell Flow Cell 2 Usually, sequencing occurs within lanes of flow-cells (see Figure above). Either individual or multiple samples (libraries) can be sequenced within each lane. When multiplexing, barcodes enable attribution of a read to the appropriate library. Best Practices for Designing RNA-Seq Experiments Treatment Sources of variation that are specific to RNA-Seq experiments can come from lane/flow cell effects, and run-batch effects. We can use basic statistical principles such as blocking and randomization to avoid confounding of these effects with factors of interest, such as treatment effects [2]. As shown in the Figure on the right, multiplexing is one way to eliminate confounding caused by batch or lane effects. Figure 6-2, Chapter 6, Molecular Biology of the Cell. 4th edition. Alberts B, Johnson A, Lewis J, et al. New York: Garland Science; 2002. Biological reps Acknowledgements The RNA-Seq Big Idea Team would like to thank our colleagues across GSK R&D for giving generously of their time and data to enable us to develop our statistical consulting skills in this rapidly-developing area. RNA extraction X represents the design matrix, with rows j representing samples, and columns r representing factor levels or continuous covariates. Bar-code and pool Preparation for sequencing References Datta, S. and Nettleton, D., Statistical Analysis of Next Generation Sequencing Data. Springer. Auer, P.L. and Doerge, R.W., Statistical design and analysis of RNA sequencing data. Genetics, 185(2), pp Liu, Y., Zhou, J. and White, K.P., RNA-seq differential expression studies: more sequence or more replication?. Bioinformatics, 30(3), pp Love MI, Huber W and Anders S (2014) Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology, 15, pp. 550. Turro, E., Astle, W.J. and Tavaré, S., Flexible analysis of RNA-seq data using mixed effects models. Bioinformatics, 30(2), pp Sun, S., Hood, M., Scott, L., Peng, Q., Mukherjee, S., Tung, J. and Zhou, X., Differential expression analysis for RNAseq using Poisson mixed models. bioRxiv, p The DESeq2 package allows for a variety of statistical models including multi-way ANOVA and continuous covariates. One drawback is that it does not allow for inclusion of random effects. As the cost of RNA-Seq has decreased significantly and more complex study designs (e.g. cross-over, repeated measures) are being used, the need for more sophisticated statistical models is growing. Several authors [5,6] have proposed possible solutions, but this topic continues to be an active field of research. Sequence technical reps Figure 4 from Statistical Design and Analysis of RNA Sequencing Data. Paul L. Auer, R. W. Doerge. Genetics June 1, 2010 vol. 185 no. 2, ; DOI: /genetics


Download ppt "The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,"

Similar presentations


Ads by Google