The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,

Slides:



Advertisements
Similar presentations
Statistical Concepts and Methodologies for Data Analyses Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University.
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Statistics in Science  Statistical Analysis & Design in Research Structure in the Experimental Material PGRM 10.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Processing of miRNA samples and primary data analysis
1 Introduction to Experimental Design 1/26/2009 Copyright © 2009 Dan Nettleton.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Chapter 28 Design of Experiments (DOE). Objectives Define basic design of experiments (DOE) terminology. Apply DOE principles. Plan, organize, and evaluate.
Molecular Biology of the Cell
Essential Cell Biology
RNA-Seq and RNA Structure Prediction
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
RNAseq analyses -- methods
Control of Gene Expression
Introduction to RNAseq
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
No reference available
Lecture 12 RNA – seq analysis.
1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture 9 Review.
Molecular Biology of the Cell
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Molecular Biology of the Cell
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Bioinformatics core facility, OUS/UiO
Statistics Behind Differential Gene Expression
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
apeglm: Shrinkage Estimators for Differential Expression of RNA-Seq
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
Gene expression.
EBV Persistence in Memory B Cells In Vivo
Differential Expression from RNA-seq
Comparative Analysis of Single-Cell RNA Sequencing Methods
Genome organization and Bioinformatics
Molecular Biology of the Cell
A Correlated Random Effects Hurdle Model for Detecting Differentially Expressed Genes in Discrete Single Cell RNA Sequencing Data Michael Sekula Department.
Taichi Umeyama, Takashi Ito  Cell Reports 
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Continues Probability Distributions and Estimation
Assessing changes in data – Part 2, Differential Expression with DESeq2
Introduction to Experimental Design
Gene Expression Analysis
Volume 7, Issue 3, Pages e12 (September 2018)
ADAR Regulates RNA Editing, Transcript Stability, and Gene Expression
Transcriptomics Data Visualization Using Partek Flow Software
Volume 9, Issue 4, Pages (October 2017)
EBV Persistence in Memory B Cells In Vivo
Differential protein, mRNA, lncRNA and miRNA regulation by p53.
Sequence Analysis - RNA-Seq 2
LLC cells exhibit a blunted response to IFNγ in vitro and in vivo compared with CMT167. LLC cells exhibit a blunted response to IFNγ in vitro and in vivo.
Schematic representation of a transcriptomic evaluation approach.
Taichi Umeyama, Takashi Ito  Cell Reports 
Differential Expression of RNA-Seq Data
The Technology and Biology of Single-Cell RNA Sequencing
Design Issues Lecture Topic 6.
Presentation transcript:

The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1, Mark Lennon1, Katja Remlinger1, David Willé1 Target Sciences Statistics, Target Sciences, R&D, GSK yaqing.x.zhao@gsk.com; †erika.j.cule@gsk.com Department of Statistics, North Carolina State University yzhao15@ncsu.edu The RNA-Seq Big Idea RNA-Sequencing (RNA-Seq) is a high-throughput sequencing technology that is noted for its capability to measure nucleotide sequences from multiple samples in parallel. As experiments using RNA-Seq technology are becoming more commonplace throughout the drug discovery process, statisticians need to learn more about the technology to be able to advise scientists on design and analysis aspects. This will help to guard against related errors which could result in wasted resources and erroneous conclusions. Statistical Design for RNA-Seq It is generally assumed: lane effect < flow cell effect/run effect < library preparation effect << biological effect. Replication is essential for estimating and decreasing the experimental error, and thus to detect the biological (treatment) effect more precisely. Sequencing depth is a unique characteristic for RNA-Seq. It is often estimated as the number of total mapped sequences. From a statistical perspective, designs that maximize biological replication at the sacrifice of reduced sequencing depth per experimental unit are preferred [3]. RNA-Seq experiments are a multiphase process: In the in-vivo phase, experimental units are assigned to different treatments, using basic design principles. In the in-vitro phase, biological samples (e.g. tissues) are collected from each experimental unit and RNA is extracted. Finally, cDNA fragments are prepared and RNA-Seq is used to determine the identity of millions of reads from each fragment and a count matrix is generated. Ideally, the design strategies of all three phases should be coordinated, for example by confounding nuisance factors from all three phases [1]. Background: Biology & Technology Statistical Analysis of Count Data Example of Sequencer RNA is converted to cDNA and fragmented. Reads are sequenced and then aligned to reference genome or transcript or assembled de novo. RNA-Seq measures RNA transcripts (the blue line). The resulting data can comprise up to about 55,000 count measurements per sample. The count data collected from the RNA-Seq experiments are typically modelled by a negative binomial distribution or a Poisson distribution with overdispersion. Both approaches allow for over dispersion of counts between genes. DESeq2 [4] is a popular R package which is used to analyze RNA-Seq count data. The statistical model implemented in the package is described as: Flow Cell 1 Flow Cell 2 Usually, sequencing occurs within lanes of flow-cells (see Figure above). Either individual or multiple samples (libraries) can be sequenced within each lane. When multiplexing, barcodes enable attribution of a read to the appropriate library. Best Practices for Designing RNA-Seq Experiments Treatment Sources of variation that are specific to RNA-Seq experiments can come from lane/flow cell effects, and run-batch effects. We can use basic statistical principles such as blocking and randomization to avoid confounding of these effects with factors of interest, such as treatment effects [2]. As shown in the Figure on the right, multiplexing is one way to eliminate confounding caused by batch or lane effects. Figure 6-2, Chapter 6, Molecular Biology of the Cell. 4th edition. Alberts B, Johnson A, Lewis J, et al. New York: Garland Science; 2002. Biological reps Acknowledgements The RNA-Seq Big Idea Team would like to thank our colleagues across GSK R&D for giving generously of their time and data to enable us to develop our statistical consulting skills in this rapidly-developing area. RNA extraction X represents the design matrix, with rows j representing samples, and columns r representing factor levels or continuous covariates. Bar-code and pool Preparation for sequencing References Datta, S. and Nettleton, D., 2014. Statistical Analysis of Next Generation Sequencing Data. Springer. Auer, P.L. and Doerge, R.W., 2010. Statistical design and analysis of RNA sequencing data. Genetics, 185(2), pp.405-416. Liu, Y., Zhou, J. and White, K.P., 2014. RNA-seq differential expression studies: more sequence or more replication?. Bioinformatics, 30(3), pp.301-304. Love MI, Huber W and Anders S (2014) Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology, 15, pp. 550. Turro, E., Astle, W.J. and Tavaré, S., 2014. Flexible analysis of RNA-seq data using mixed effects models. Bioinformatics, 30(2), pp.180-188. Sun, S., Hood, M., Scott, L., Peng, Q., Mukherjee, S., Tung, J. and Zhou, X., 2016. Differential expression analysis for RNAseq using Poisson mixed models. bioRxiv, p.073403. The DESeq2 package allows for a variety of statistical models including multi-way ANOVA and continuous covariates. One drawback is that it does not allow for inclusion of random effects. As the cost of RNA-Seq has decreased significantly and more complex study designs (e.g. cross-over, repeated measures) are being used, the need for more sophisticated statistical models is growing. Several authors [5,6] have proposed possible solutions, but this topic continues to be an active field of research. Sequence technical reps Figure 4 from Statistical Design and Analysis of RNA Sequencing Data. Paul L. Auer, R. W. Doerge. Genetics June 1, 2010 vol. 185 no. 2, 405-416; DOI: 10.1534/genetics