RNA-Seq data analysis Xuhua Xia University of Ottawa

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

Exploring the Human Transcriptome
Lecture 4: DNA transcription
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Introduction to Short Read Sequencing Analysis
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Chapter 4 Transcription and Translation. The Central Dogma.
RNA-seq Analysis in Galaxy
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
Central dogma: from genome to proteins I: Transcription Haixu Tang School of Informatics.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Signposts for translation initiation: An illustration of formulating a research project Xuhua Xia
Introduction to Short Read Sequencing Analysis
Xuhua Xia Signposts for translation initiation: An illustration of formulating a research project Xuhua Xia.
Protein Synthesis. DNA acts like an "instruction manual“ – it provides all the information needed to function the actual work of translating the information.
RNAseq analyses -- methods
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
RNA and Protein Synthesis
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
The iPlant Collaborative
1 Genes and How They Work Chapter Outline Cells Use RNA to Make Protein Gene Expression Genetic Code Transcription Translation Spliced Genes – Introns.
RNA Sequencing I: De novo RNAseq
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Protein Synthesis Chapter Protein synthesis- the production of proteins The amount and kind of proteins produced in a cell determine the structure.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Transcriptomics Sequencing. over view The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non coding RNA produced.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Introduction to RNAseq
PROTEIN SYNTHESIS HOW GENES ARE EXPRESSED. BEADLE AND TATUM-1930’S One Gene-One Enzyme Hypothesis.
The iPlant Collaborative
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Chapter – 10 Part II Molecular Biology of the Gene - Genetic Transcription and Translation.
bacteria and eukaryotes
RNA Quantitation from RNAseq Data
Placental Bioinformatics
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Introductory RNA-Seq Transcriptome Profiling
Protein Synthesis Genetics.
RNA-Seq data analysis Xuhua Xia University of Ottawa
Translation initiation and co-evolution between SD and aSD in bacteria
RNA-Seq data analysis Xuhua Xia University of Ottawa
Transcription -The main purpose of transcription is to create RNA from DNA because RNA leaves the nucleus to carry out its functions but DNA does not -A.
Genome organization and Bioinformatics
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Additional file 2: RNA-Seq data analysis pipeline
Comparison Of DNA And RNA Synthesis in Prokaryotes and Eukaryotes
Schematic representation of a transcriptomic evaluation approach.
Expression of metabolic gene families and ABC transporters in the midgut. Expression of metabolic gene families and ABC transporters in the midgut. Expression.
Presentation transcript:

RNA-Seq data analysis Xuhua Xia University of Ottawa

RNA-Seq Gene2Gene1 Gene3 Genome Transcriptome FASTQ GATTTGGGGTTCAAAGCA... + GATTTGGGGTTCAAAGCA... + !''*((((***+))%% RNA-Seq SRA files for Data storage, transmission and analysis

Next-Generation sequencing FASTQ NATTTGGGGTTCAAAGCA... + GATTTGGGGTTCAAAGCA... + %''*((((***+))%% De novo genome assembly Sequence reads matching/aligning against a known genome Key research objectives: Differential gene expression Ribosomal profiling Alternative splicing Gene discovery Signal at TSS and TTS …… Submission to one of the three data centers (NCBI, DDBJ, EBI): SRA (sequence read archive) compressed files Download by researchers Submission Storage, transmission and analysis Quality assessment Phred quality score Q=-10log 10 p, where p is base-calling error probability. 1.Global quality assessment 2.Read-specific quality assessment 3.Site-specific quality assessment

Quality assessment: Nucleotide SRR : Single, ReadLen = 50SRR892245: Paired, ReadLen = 100SRR : Paired, ReadLen = 250

Read-based quality SRR : Paired, excluding N-containing read SRR : Single, ReadLen = 50

Site-specific quality by nucleotide Fully resolved paired readsPaired reads containing unresolved nucleotides SRR Read 1 Read 2 Read 1 Read 2

Gene expression Gene2Gene1 Gene3 Genome Transcriptome Count N 1 = 6 N 2 = 29 N 3 = 4 N TMR = (TMR: total mapped reads); L 1 = 500 nt, L 2 = 3000 nt, L 3 = 400 nt GE: FPKM 1 = (1000*N 1 /L 1 )*( /N TMR ) FPKM 2 = (1000*29/3000)/2.23 FPKM 3 = (1000*4/400)/2.23 = 5.38 = 4.33 = 4.48 FPKM: Fragments Per Kilobase of exon per Million reads: "per kilobase": fair comparison among genes; "per million reads": fair comparison among samples BLAST, FASTA, etc. (more details later)

Paralogue B Identical segment Different but with clear homology Homology lost in evolution Paralogue A N A.H = 6 N B.H = 3 N A.U = 4 N B.U = 3 N I = 29 P A = (N A.H + N A.U )/(N A.H + N B.H + N A.U + N B.U ) = (6+4)/( ) = N A = N A.H + N A.U + N I *P A = *0.625 = N B = N B.H + N B.U + N I *(1-P A ) = *0.375 = Scale N A and N B to FPKM Subscripts: H - different but homologous; I - identical segment; U - unique/divergent segment Gene expression with duplicated genes

Paralogue B Identical segment Different but with clear homology Homology lost in evolution Paralogue A N A.H = 6 N B.H = 3 N A.U = 4 N B.U = 3 N I = 29 Two alternatives: 1. P A = N A.H /(N A.H + N B.H ) = 6/(6+3) = n A.H = 6/L H ; n B.H = 3/L H ; n A.U = 4/L A.U ; n B.U = 3/L B.U P A = (n A.H + n A.U )/(n A.H + n B.H + n A.U + n B.U ) N A = N A.H + N A.U + N I *P A N B = N B.H + N B.U + N I *(1-P A ) Duplicated genes of different lengths L B.U L A.U

Paralogue B Identical segment Different but with clear homology Homology lost in evolution Paralogue A N A.H = 6 N B.H = 2 N C.H = 1 N A.U = 4 N B.U = 2 N C.U = 1 N I = 29 P A = (N A.H + N A.U )/(N A.H + N B.H + N A.U + N B.U + N B.H + N A.U + N B.U ) = (6+4)/( ) = N A = N A.H + N A.U + N I *P A = *0.625 = N B = N B.H + N B.U + N I *P B N C = N C.H + N C.U + N I *P C Subscripts: H - different but homologous; I - identical segment; U - unique/divergent segment Gene expression with duplicated genes Paralogue C

Multiple paralogues Slide 11 PG1309 PG2204 PG3101 GeneN H N I N U PG3 PG2 PG N 3 = *P 3 N 2 = *204/( )+600*P 2 N 1 = *309/( )+600*P 1 P 3 = ( )/( )  P 2 = (1-P 3 )*204/( ) = P 1 = (1-P 3 )*309/( ) = N 3 = N 2 = N 1 = More details later on tree reconstruction

Ribosomal density Xuhua Xia Slide 12 Mean density adjusted for mRNA length. The confounding effect of elongation efficiency Xia et al Genetics

Poly(A) length & protein synthesis Xuhua Xia Slide 13 Xia et al Genetics

Transcription: TSS and TTS AUG… …UAA TSS 1 TSS 2 TTS 1 TTS 2 Exp. 1 Exp. 2

Alternative splicing E3E3 E1E1 E2E2 I1I1 I2I2 5'SS 3'SS E3E3 E1E1 E2E2 E3E3 E1E1 Alternative splicing Cell type 1 Cell type 2

New approaches in data analysis RNA-Seq data files are too large: –Among the 4717 RNA-Seq studies on human, available at NCBI on Jun. 10, 2015, 141 studies each contributed more than 1TB of nucleotide bases. –Even NCBI has found it difficult to keep pace with the explosive growth of RNA-Seq data. The RNA-Seq data do not need to be so huge. – SRR sra (E. coli K12) contains 6,503,557 sequences of 50 nt each, but sequences are all identical, all from sites in E. coli 23S rRNA genes. There is no information lost if all these identical sequences is listed by a single sequence with a sequence ID such as SeqID_ Xuhua Xia Slide 16

Most frequent 50-mers in SRR sra GeneN copy GeneN copy LSU rRNA195310LSU rRNA14193 LSU rRNA86308hisR (2) S rRNA73440hisR (2) LSU rRNA58400LSU rRNA13615 SSU rRNA47323LSU rRNA13012 LSU rRNA456955S rRNA13001 LSU rRNA36258LSU rRNA S rRNA33674LSU rRNA12695 SSU rRNA30417LSU rRNA12523 LSU rRNA29508SSU rRNA S rRNA28187LSU rRNA11298 LSU rRNA24982glnX_V (1) SSU rRNA232865S rRNA10968 LSU rRNA199915S rRNA10890 SSU rRNA192685S rRNA10750 glnX_V (1) 18652b3555|b3556 (3) LSU rRNA18381LSU rRNA10362 hisR (2) 18354LSU rRNA10164 LSU rRNA18300LSU rRNA10000 LSU rRNA17113trpT9955 glnX_V (1) 16902rpsE (4) 9877 LSU rRNA16796LSU rRNA9090 LSU rRNA14642rplV (4) 9071

Next-Generation sequencing FASTQ NATTTGGGGTTCAAAGCA... + GATTTGGGGTTCAAAGCA... + %''*((((***+))%% De novo genome assembly Sequence reads matching/aligning against a known genome Key research objectives: Differential gene expression Ribosomal profiling Alternative splicing Gene discovery Signal at TSS and TTS …… Submission to one of the tree data centers (NCBI, DDBJ, EBI): SRA (sequence read archive) compressed files FASTAQ+ file: >SeqGroup1_3 GATTTGGGGTTCA >SeqGroup2_391 GATTTGGGGTTCAAAGCA >SeqGroup3_92 GATTTGGGGTTCAAAGCA >SeqGroup4_512 GATTTGGGGTTCAAAGCA Download by researchers Submission Download Storage, transmission and analysis

Formatted BLAST output Xuhua Xia Slide 19 b0001|190_255,SeqGr49062_16,100.00,49,0,0,18,66,1,49,3e-019,91.6 b0001|190_255,SeqGr382517_1,100.00,48,0,0,19,66,1,48,1e-018,89.8 b0001|190_255,SeqGr536414_1,100.00,46,0,0,21,66,1,46,2e-017,86.1 b0001|190_255,SeqGr181138_10,100.00,45,0,0,22,66,1,45,5e-017,84.2 b0001|190_255,SeqGr138539_1,100.00,44,0,0,23,66,1,44,2e-016,82.4 b0001|190_255,SeqGr297866_1,100.00,42,0,0,25,66,1,42,3e-015,78.7 b0002|337_2799,SeqGr935243_1,100.00,50,0,0,185,234,1,50,4e-018,93.5 b0002|337_2799,SeqGr925087_1,100.00,50,0,0,1398,1447,1,50,4e-018,93.5 b0002|337_2799,SeqGr922536_1,100.00,50,0,0,2050,2099,1,50,4e-018,93.5 b0002|337_2799,SeqGr918509_1,100.00,50,0,0,201,250,1,50,4e-018,93.5 ……

Gene expression output Xuhua Xia Slide 20 GeneSeqLenCountCount/KbFPKM thrL|190_ thrA|337_ thrB|2801_ thrC|3734_ yaaX|5234_ yaaA|C5683_ yaaJ|C6529_ talB|8238_ mog|9306_ yaaH|C9928_ yaaW|C10643_ yaaI|C11382_ dnaK|12163_ dnaJ|14168_ insL1|15445_ mokC|C16751_ hokC|C16751_ nhaA|17489_ ……………