DEG 2014.10.22 Mi-kyoung Seo.

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

12/04/2017 RNA seq (I) Edouard Severing.
Simon v2.3 RNA-Seq Analysis Simon v2.3.
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
Introduction To Next Generation Sequencing (NGS) Data Analysis
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Microarray Type Analyses using Second Generation Sequencing
RNA-seq Analysis in Galaxy
RNA-Seq data analysis Qi Liu Department of Biomedical Informatics
Biases in RNA-Seq data Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions)
mRNA-Seq: methods and applications
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Lecture 10. Microarray and RNA-seq
RNA-Seq and RNA Structure Prediction
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
RNA-Seq Visualization
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
RNAseq analyses -- methods
Lecture 11. Microarray and RNA-seq II
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-Seq Analysis Simon V4.1.
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
Introduction To Next Generation Sequencing (NGS) Data Analysis
1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.
RNA-seq workshop COUNTING & HTSEQ Erin Osborne Nishimura.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Introduction to RNAseq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq visualization with cummeRbund.
The iPlant Collaborative
RNA-seq: Quantifying the Transcriptome
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
No reference available
Lecture 12 RNA – seq analysis.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions) Biases in RNA-Seq.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Statistics Behind Differential Gene Expression
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
Moderní metody analýzy genomu
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,
High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO
Introductory RNA-Seq Transcriptome Profiling
Introduction To Next Generation Sequencing (NGS) Data Analysis
Learning to count: quantifying signal
Additional file 2: RNA-Seq data analysis pipeline
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Differential Expression of RNA-Seq Data
Presentation transcript:

DEG 2014.10.22 Mi-kyoung Seo

RNA-seq for DEG Sequencing FASTQ Data quality control FastQC / FASTX-Toolkit Mapping TopHat2 HTSeq Transcripts assembly Cufflinks Final transcripts assembly Differential expression analysis Cuffdiff (or R) DESeq, EdgeR.. Visualization CummeRbund

RNA-Seq versus microarrays A. Comparison of the number of expressed genes detected by RNA-Seq and microarrays Fig. 2. RNA-Seq versus microarrays. Evaluation of the sensitivity of RNA-Seq over microarrays on the same RNA source and based on 13,118 genes represented on the array. (A) Comparison of the number of expressed genes detected by RNA-Seq and microarrays. Values for relaxed (at least one read) and stringent (at least five reads) RNA-Seq parameters are in bold or in brackets, respectively. (B) Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays. Genes missed by microarrays are shown with gray (HEK) and black (B cells) bars. Genes detected by microarrays are shown with light red (HEK) and dark red (B cells) bars. B. Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays. Genes missed by microarrays are shown with gray (HEK) and black (B cells) bars Sultan M et al. 2008

RNA-seq vs. microarray From Sonia Tarazona

Definition From Sonia Tarazona

Source of variable Between-lane normalization Library size (sequencing depth or library size) Within-lane normalization Gene-specific biases: length, GC-content Mappability of reads Differences on the counts distribution among samples. Count data with RNA-seq biases  Normalization  DEG

DEG Differentially expressed gene A gene is declared differentially expressed if an observed difference or change in read counts between two experimental conditions is statistically significant. Statistical framework for RNA-seq

Variance depends strongly on the mean Distribution Technical replicate Poisson Biological replicate Negative binomial Poisson v = μ Poisson + constant CV v = μ + α μ2 (edgeR) Poisson + local regression v = μ + f(μ2) (DESeq) Poisson distribution Negative binomial distribution

RNA-seq within a library (sample) Lg2=3 Lg1=6 Yg1=6 Yg2=3 Expressiong1=1 Expressiong2=1 Read count ∝ Expression of a given gene ∝ Transcript length

RNA-seq within different libraries (comparison of two samples) For gene 1, Lg1=6 Yl1=6 Yl2=12 Ll1=600 Ll2=1200 Expressiong1l1=1 Expressiong1l2=1 Read count ∝ Expression of a given gene ∝ Transcript length ∝ Library size

RPKM Reads Per Kilobase per Million mapped reads FPKM, Fragments per kilobase per million fragments reads, which is suitable for paired-end reads (Garber et al. 2011) The number of reads of the region RPKM = Length of region/103 x Total number of mapped read/109 109 x C RPKM (X) = N x L C is the number of mappable reads on feature (transcript, exons..) N is the total number of mappable reads in the experiment (in millions) L is the sum of the exons (in kb) Mortazavi et al (2008) Nature Methods

RPKM’s drawback The fact that a small number of highly expressed genes can generate a big portion of the total reads (Bullard, et al., 2010) complicates normalization. Even after normalization based on length (e.g., RPKM), longer transcripts or genes are still more prone to be called as differentially expressed than shorter ones using t-test (Oshlack and Wakefield, 2009).

Gene length bias sequencing array Differential expression as a function of transcript length. 33% of highest expressed genes 33% of lowest expressed genes Oshlack and Wakefield (2009) Biology Direct.

Gene length bias Let X be the measured number of reads in a library mapping to a specific transcript. m = E(X) = cNL N : the total number of transcripts L: the length of the gene C: proportionality constant Var(X) = m = cNL Poisson random variable DEG between two samples of the same library size test if the difference in counts from a particular gene between two samples of the same library size is significantly different from zero using a t-test E(D)/S.E.(D) = δ

Gene length bias Dividing by gene length The distribution is no longer Poisson and μ' ≠ Var(μ').

Technical and biological replicates Nagalakshmi et al. (2008) have found that counts for the same gene from different technical replicates have a variance equal to the mean (Poisson). counts for the same gene from different biological replicates have a variance exceeding the mean (overdispersion). Marioni et al. (2008) have looked confirmed the first fact. “ We find that the sequencing data are highly reproducible, with few systematic differences among technical replicates. Statistically, we find that the variation across technical replicates can be captured using a Poisson model, with only a small proportion (∼0.5%) of genes showing clear deviations from this model.”

RNA-Seq as draws from infinite urn Imagine taking N colored balls from an urn which contains >> N balls The colors are genes, and the balls are fragments in the library A column of the count matrix is then multinomial(N,p) BRCA1 BRCA2 library (sample)

Binomial 이항분포는 시행횟수 n과 성공률 p인 두개의 모수를 갖고 있으며, X가 모수 n,p를 갖는 이항분포에 따름을 기호 X~B(n,p)로 나타내기도 한다.

Problems with Poisson Poisson v = μ Poisson distribution Poisson + constant CV v = μ + α μ2 (edgeR) Poisson + local regression v = μ + f(μ2) (DESeq) Poisson distribution Negative binomial distribution

DEG tools The basic idea is that the count data is over-dispersed and modeled using a negative binomial distribution. Poisson distribution (mean=variance) + Overdispersion => Negative binomial distribution * DEG tools for RNA-seq DEGSeq (Wang et al.): Poisson distribution edgeR (Robinson et al., 2010): Exact test based on Negative Binomial distribution. DESeq (Anders and Huber, 2010): Exact test based on Negative Binomial

RNA-seq for DEG

Tuxedo protocol Align the RNA-seq reads to the genome Condition A Condition B C1_R1_1.fq C1_R1_2.fq C1_R2_1.fq C1_R2_2.fq C1_R3_1.fq C1_R3_2.fq C2_R1_1.fq C2_R1_2.fq C2_R2_1.fq C2_R2_2.fq C2_R3_1.fq C2_R3_2.fq Align the RNA-seq reads to the genome 1| Map the reads for each sample to the reference genome: $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C2_R3_2.fq

Tuxedo protocol Assemble expressed genes and transcripts 2| Assemble transcripts for each sample: 3| Create a file called assemblies.txt that lists the assembly file for each sample. The file should contain the following lines: $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam ./C1_R1_clout/transcripts.gtf ./C2_R2_clout/transcripts.gtf ./C1_R2_clout/transcripts.gtf ./C2_R1_clout/transcripts.gtf ./C1_R3_clout/transcripts.gtf ./C2_R3_clout/transcripts.gtf assemblies.txt

Tuxedo protocol Assemble expressed genes and transcripts 4| Run Cuffmerge on all your assemblies to create a single merged transcriptome annotation: Identify differentially expressed genes and transcripts 5| Run Cuffdiff by using the merged transcriptome assembly along with the BAM files from TopHat for each replicate: $ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,./C1_R3_thout/accepted_hits.bam ./C2_R1_thout/accepted_hits.bam,./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam –L Cancer,Normal C1.bam,C2.bam,C3.bam N1.bam,N2.bam,N3.bam