Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA-Seq data analysis Xuhua Xia University of Ottawa

Similar presentations


Presentation on theme: "RNA-Seq data analysis Xuhua Xia University of Ottawa"— Presentation transcript:

1 RNA-Seq data analysis Xuhua Xia University of Ottawa xxia@uottawa.ca http://dambe.bio.uottawa.ca

2 RNA-Seq Gene2Gene1 Gene3 Genome Transcriptome FASTQ files: @SEQ_ID1.1 GATTTGGGGTTCAAAGCA... + !''*((((***+))%%+... @SEQ_ID2.1 GATTTGGGGTTCAAAGCA... + !''*((((***+))%%+......... RNA-Seq SRA files for Data storage, transmission and analysis

3 Next-Generation sequencing FASTQ files: @SEQ_ID1.1 NATTTGGGGTTCAAAGCA... + !''*((((***+))%%+... @SEQ_ID2.1 GATTTGGGGTTCAAAGCA... + %''*((((***+))%%+......... De novo genome assembly Sequence reads matching/aligning against a known genome Key research objectives: Differential gene expression Ribosomal profiling Alternative splicing Gene discovery Signal at TSS and TTS …… Submission to one of the three data centers (NCBI, DDBJ, EBI): SRA (sequence read archive) compressed files Download by researchers Submission Storage, transmission and analysis Quality assessment Phred quality score Q=-10log 10 p, where p is base-calling error probability. 1.Global quality assessment 2.Read-specific quality assessment 3.Site-specific quality assessment

4 Quality assessment: Nucleotide SRR1536586: Single, ReadLen = 50SRR892245: Paired, ReadLen = 100SRR2056426: Paired, ReadLen = 250

5 Read-based quality SRR2056426: Paired, excluding N-containing read SRR1536586: Single, ReadLen = 50

6 Site-specific quality by nucleotide Fully resolved paired readsPaired reads containing unresolved nucleotides SRR2056426 Read 1 Read 2 Read 1 Read 2

7 Gene expression Gene2Gene1 Gene3 Genome Transcriptome Count N 1 = 6 N 2 = 29 N 3 = 4 N TMR = 2230000 (TMR: total mapped reads); L 1 = 500 nt, L 2 = 3000 nt, L 3 = 400 nt GE: FPKM 1 = (1000*N 1 /L 1 )*(1000000/N TMR ) FPKM 2 = (1000*29/3000)/2.23 FPKM 3 = (1000*4/400)/2.23 = 5.38 = 4.33 = 4.48 FPKM: Fragments Per Kilobase of exon per Million reads: "per kilobase": fair comparison among genes; "per million reads": fair comparison among samples BLAST, FASTA, etc. (more details later)

8 Paralogue B Identical segment Different but with clear homology Homology lost in evolution Paralogue A N A.H = 6 N B.H = 3 N A.U = 4 N B.U = 3 N I = 29 P A = (N A.H + N A.U )/(N A.H + N B.H + N A.U + N B.U ) = (6+4)/(6+4+3+3) = 0.625 N A = N A.H + N A.U + N I *P A = 6+4+29*0.625 = 28.125 N B = N B.H + N B.U + N I *(1-P A ) = 3+3+29*0.375 = 16.875 Scale N A and N B to FPKM Subscripts: H - different but homologous; I - identical segment; U - unique/divergent segment Gene expression with duplicated genes

9 Paralogue B Identical segment Different but with clear homology Homology lost in evolution Paralogue A N A.H = 6 N B.H = 3 N A.U = 4 N B.U = 3 N I = 29 Two alternatives: 1. P A = N A.H /(N A.H + N B.H ) = 6/(6+3) = 0.66667 2. n A.H = 6/L H ; n B.H = 3/L H ; n A.U = 4/L A.U ; n B.U = 3/L B.U P A = (n A.H + n A.U )/(n A.H + n B.H + n A.U + n B.U ) N A = N A.H + N A.U + N I *P A N B = N B.H + N B.U + N I *(1-P A ) Duplicated genes of different lengths L B.U L A.U

10 Paralogue B Identical segment Different but with clear homology Homology lost in evolution Paralogue A N A.H = 6 N B.H = 2 N C.H = 1 N A.U = 4 N B.U = 2 N C.U = 1 N I = 29 P A = (N A.H + N A.U )/(N A.H + N B.H + N A.U + N B.U + N B.H + N A.U + N B.U ) = (6+4)/(6+4+2+2+1+1) = 0.625 N A = N A.H + N A.U + N I *P A = 6+4+29*0.625 = 28.125 N B = N B.H + N B.U + N I *P B N C = N C.H + N C.U + N I *P C Subscripts: H - different but homologous; I - identical segment; U - unique/divergent segment Gene expression with duplicated genes Paralogue C

11 Multiple paralogues Slide 11 PG1309 PG2204 PG3101 GeneN H N I N U PG3 PG2 PG1 600 102 510 N 3 = 102+101+600*P 3 N 2 = 204+510*204/(204+309)+600*P 2 N 1 = 309+510*309/(204+309)+600*P 1 P 3 = (102+101)/(102+101+510+204+309)  0.1656 P 2 = (1-P 3 )*204/(204+309) = 0.3318 P 1 = (1-P 3 )*309/(204+309) = 0.5026 N 3 = 302.35 N 2 = 605.90 N 1 = 917.75 More details later on tree reconstruction

12 Ribosomal density Xuhua Xia Slide 12 Mean density adjusted for mRNA length. The confounding effect of elongation efficiency Xia et al. 2011 Genetics

13 Poly(A) length & protein synthesis Xuhua Xia Slide 13 Xia et al. 2011 Genetics

14 Transcription: TSS and TTS AUG… …UAA TSS 1 TSS 2 TTS 1 TTS 2 Exp. 1 Exp. 2

15 Alternative splicing E3E3 E1E1 E2E2 I1I1 I2I2 5'SS 3'SS E3E3 E1E1 E2E2 E3E3 E1E1 Alternative splicing Cell type 1 Cell type 2

16 New approaches in data analysis RNA-Seq data files are too large: –Among the 4717 RNA-Seq studies on human, available at NCBI on Jun. 10, 2015, 141 studies each contributed more than 1TB of nucleotide bases. –Even NCBI has found it difficult to keep pace with the explosive growth of RNA-Seq data. The RNA-Seq data do not need to be so huge. – SRR1536586.sra (E. coli K12) contains 6,503,557 sequences of 50 nt each, but 195310 sequences are all identical, all from sites 929-978 in E. coli 23S rRNA genes. There is no information lost if all these 195310 identical sequences is listed by a single sequence with a sequence ID such as SeqID_195310. Xuhua Xia Slide 16

17 Most frequent 50-mers in SRR1536586.sra GeneN copy GeneN copy LSU rRNA195310LSU rRNA14193 LSU rRNA86308hisR (2) 13720 5S rRNA73440hisR (2) 13618 LSU rRNA58400LSU rRNA13615 SSU rRNA47323LSU rRNA13012 LSU rRNA456955S rRNA13001 LSU rRNA36258LSU rRNA12820 5S rRNA33674LSU rRNA12695 SSU rRNA30417LSU rRNA12523 LSU rRNA29508SSU rRNA11696 5S rRNA28187LSU rRNA11298 LSU rRNA24982glnX_V (1) 11081 SSU rRNA232865S rRNA10968 LSU rRNA199915S rRNA10890 SSU rRNA192685S rRNA10750 glnX_V (1) 18652b3555|b3556 (3) 10513 LSU rRNA18381LSU rRNA10362 hisR (2) 18354LSU rRNA10164 LSU rRNA18300LSU rRNA10000 LSU rRNA17113trpT9955 glnX_V (1) 16902rpsE (4) 9877 LSU rRNA16796LSU rRNA9090 LSU rRNA14642rplV (4) 9071

18 Next-Generation sequencing FASTQ files: @SEQ_ID1.1 NATTTGGGGTTCAAAGCA... + !''*((((***+))%%+... @SEQ_ID2.1 GATTTGGGGTTCAAAGCA... + %''*((((***+))%%+......... De novo genome assembly Sequence reads matching/aligning against a known genome Key research objectives: Differential gene expression Ribosomal profiling Alternative splicing Gene discovery Signal at TSS and TTS …… Submission to one of the tree data centers (NCBI, DDBJ, EBI): SRA (sequence read archive) compressed files FASTAQ+ file: >SeqGroup1_3 GATTTGGGGTTCA >SeqGroup2_391 GATTTGGGGTTCAAAGCA >SeqGroup3_92 GATTTGGGGTTCAAAGCA >SeqGroup4_512 GATTTGGGGTTCAAAGCA...... Download by researchers Submission Download Storage, transmission and analysis

19 Formatted BLAST output Xuhua Xia Slide 19 b0001|190_255,SeqGr49062_16,100.00,49,0,0,18,66,1,49,3e-019,91.6 b0001|190_255,SeqGr382517_1,100.00,48,0,0,19,66,1,48,1e-018,89.8 b0001|190_255,SeqGr536414_1,100.00,46,0,0,21,66,1,46,2e-017,86.1 b0001|190_255,SeqGr181138_10,100.00,45,0,0,22,66,1,45,5e-017,84.2 b0001|190_255,SeqGr138539_1,100.00,44,0,0,23,66,1,44,2e-016,82.4 b0001|190_255,SeqGr297866_1,100.00,42,0,0,25,66,1,42,3e-015,78.7 b0002|337_2799,SeqGr935243_1,100.00,50,0,0,185,234,1,50,4e-018,93.5 b0002|337_2799,SeqGr925087_1,100.00,50,0,0,1398,1447,1,50,4e-018,93.5 b0002|337_2799,SeqGr922536_1,100.00,50,0,0,2050,2099,1,50,4e-018,93.5 b0002|337_2799,SeqGr918509_1,100.00,50,0,0,201,250,1,50,4e-018,93.5 ……

20 Gene expression output Xuhua Xia Slide 20 GeneSeqLenCountCount/KbFPKM thrL|190_25566761151.515389.894 thrA|337_2799246329631203.004407.328 thrB|2801_373393311211201.501406.819 thrC|3734_5020128717821384.615468.82 yaaX|5234_553029797326.599110.584 yaaA|C5683_6459777113145.43149.242 yaaJ|C6529_7959143114399.9333.836 talB|8238_919195415611636.268554.028 mog|9306_9893588289491.497166.417 yaaH|C9928_10494567100176.36759.716 yaaW|C10643_113567141318.2076.165 yaaI|C11382_1178640524.9381.672 dnaK|12163_14079191768633580.0731212.186 dnaJ|14168_15298113116711477.454500.255 insL1|15445_165571113584524.708177.662 mokC|C16751_169602102095.23832.247 hokC|C16751_16903153639.21613.278 nhaA|17489_186551167518443.873150.292 ……………


Download ppt "RNA-Seq data analysis Xuhua Xia University of Ottawa"

Similar presentations


Ads by Google