2 RNA-seq for DEG Sequencing FASTQ Data quality control FastQC / FASTX-ToolkitMappingTopHat2HTSeqTranscripts assemblyCufflinksFinal transcripts assemblyDifferential expression analysisCuffdiff (or R)DESeq, EdgeR..VisualizationCummeRbund
3 RNA-Seq versus microarrays A. Comparison of the number of expressedgenes detected by RNA-Seq and microarraysFig. 2. RNA-Seq versus microarrays. Evaluationof the sensitivity of RNA-Seq over microarrayson the same RNA source and basedon 13,118 genes represented on the array.(A) Comparison of the number of expressedgenes detected by RNA-Seq and microarrays. Values for relaxed (at least one read) andstringent (at least five reads) RNA-Seq parameters are in bold or in brackets, respectively.(B) Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays.Genes missed by microarrays are shown with gray (HEK) and black (B cells) bars. Genesdetected by microarrays are shown with light red (HEK) and dark red (B cells) bars.B. Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays.Genes missed by microarrays are shown with gray (HEK) and black (B cells) barsSultan M et al. 2008
6 Source of variable Between-lane normalization Library size (sequencing depth or library size)Within-lane normalizationGene-specific biases: length, GC-contentMappability of readsDifferences on the counts distribution among samples.Count data with RNA-seq biases Normalization DEG
7 DEG Differentially expressed gene A gene is declared differentially expressed if an observed difference or change in read counts between two experimental conditions is statistically significant.Statistical framework for RNA-seq
8 Variance depends strongly on the mean DistributionTechnical replicatePoissonBiological replicateNegative binomialPoisson v = μPoisson + constant CV v = μ + α μ2 (edgeR)Poisson + local regression v = μ + f(μ2) (DESeq)Poisson distributionNegative binomial distribution
9 RNA-seq within a library (sample) Lg2=3 Lg1=6 Yg1=6 Yg2=3 Expressiong1=1 Expressiong2=1Read count ∝ Expression of a given gene∝ Transcript length
10 RNA-seq within different libraries (comparison of two samples) For gene 1,Lg1=6Yl1=6Yl2=12Ll1=600Ll2=1200Expressiong1l1=1Expressiong1l2=1Read count ∝ Expression of a given gene∝ Transcript length∝ Library size
11 RPKM Reads Per Kilobase per Million mapped reads FPKM, Fragments per kilobase per million fragments reads, which is suitable for paired-end reads (Garber et al. 2011)The number of reads of the regionRPKM=Length of region/103 x Total number of mapped read/109109 x CRPKM (X)=N x LC is the number of mappable reads on feature (transcript, exons..)N is the total number of mappable reads in the experiment (in millions)L is the sum of the exons (in kb)Mortazavi et al (2008) Nature Methods
12 RPKM’s drawbackThe fact that a small number of highly expressed genes can generate a big portion of the total reads (Bullard, et al., 2010) complicates normalization.Even after normalization based on length (e.g., RPKM), longer transcripts or genes are still more prone to be called as differentially expressed than shorter ones using t-test (Oshlack and Wakefield, 2009).
13 Gene length bias sequencing array Differential expression as a function of transcript length.33% of highest expressed genes33% of lowest expressed genesOshlack and Wakefield (2009) Biology Direct.
14 Gene length biasLet X be the measured number of reads in a library mapping to a specific transcript.m = E(X) = cNLN : the total number of transcriptsL: the length of the geneC: proportionality constantVar(X) = m = cNLPoisson random variableDEG between two samples of the same library sizetest if the difference in counts from a particular gene between two samples of the samelibrary size is significantly different from zero using a t-testE(D)/S.E.(D) = δ
15 Gene length bias Dividing by gene length The distribution is no longer Poisson and μ' ≠ Var(μ').
16 Technical and biological replicates Nagalakshmi et al. (2008) have found thatcounts for the same gene from different technical replicates have a variance equal to the mean (Poisson).counts for the same gene from different biological replicates have a variance exceeding the mean (overdispersion).Marioni et al. (2008) have looked confirmed the first fact.“ We find that the sequencing data are highly reproducible, with few systematic differences among technical replicates. Statistically, we find that the variation across technical replicates can be captured using a Poisson model, with only a small proportion (∼0.5%) of genes showing clear deviations from this model.”
17 RNA-Seq as draws from infinite urn Imagine taking N colored balls from an urn which contains >> N ballsThe colors are genes, and the balls are fragments in the libraryA column of the count matrix is then multinomial(N,p)BRCA1BRCA2library (sample)
18 Binomial 이항분포는 시행횟수 n과 성공률 p인 두개의 모수를 갖고 있으며, X가 모수 n,p를 갖는 이항분포에 따름을 기호 X~B(n,p)로 나타내기도 한다.
19 Problems with Poisson Poisson v = μ Poisson distribution Poisson + constant CV v = μ + α μ2 (edgeR)Poisson + local regression v = μ + f(μ2) (DESeq)Poisson distributionNegative binomial distribution
20 DEG toolsThe basic idea is that the count data is over-dispersed and modeled using a negative binomial distribution.Poisson distribution (mean=variance)+ Overdispersion=> Negative binomial distribution* DEG tools for RNA-seqDEGSeq (Wang et al.): Poisson distributionedgeR (Robinson et al., 2010): Exact test based on Negative Binomialdistribution.DESeq (Anders and Huber, 2010): Exact test based on Negative Binomial
22 Tuxedo protocol Align the RNA-seq reads to the genome Condition ACondition BC1_R1_1.fq C1_R1_2.fqC1_R2_1.fq C1_R2_2.fqC1_R3_1.fq C1_R3_2.fqC2_R1_1.fq C2_R1_2.fqC2_R2_1.fq C2_R2_2.fqC2_R3_1.fq C2_R3_2.fqAlign the RNA-seq reads to the genome1| Map the reads for each sample to the reference genome:$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq$ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C2_R3_2.fq
23 Tuxedo protocol Assemble expressed genes and transcripts 2| Assemble transcripts for each sample:3| Create a file called assemblies.txt that lists the assembly file for each sample. The file should contain the following lines:$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam./C1_R1_clout/transcripts.gtf./C2_R2_clout/transcripts.gtf./C1_R2_clout/transcripts.gtf./C2_R1_clout/transcripts.gtf./C1_R3_clout/transcripts.gtf./C2_R3_clout/transcripts.gtfassemblies.txt
24 Tuxedo protocol Assemble expressed genes and transcripts 4| Run Cuffmerge on all your assemblies to create a single merged transcriptome annotation:Identify differentially expressed genes and transcripts5| Run Cuffdiff by using the merged transcriptome assembly along with the BAM files from TopHat for each replicate:$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,./C1_R3_thout/accepted_hits.bam ./C2_R1_thout/accepted_hits.bam,./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam–L Cancer,Normal C1.bam,C2.bam,C3.bam N1.bam,N2.bam,N3.bam