Presentation on theme: "An Introduction to Studying Expression Data Through RNA-seq By Jason Van Houten."— Presentation transcript:
An Introduction to Studying Expression Data Through RNA-seq By Jason Van Houten
Outline Why do we study RNA? What is RNA-seq? Issues with RNA quality Brief overview on how to make RNA-seq libraries Choices to make (depth, paired-end, cost, strand specificity, ect…) Examples
Before we start… Please ask questions!! ▫If I’m not clear or have a comment, don’t be afraid to stop me. There is more than one right way to do this type of analysis ▫What I present is only a variation I am not the expert! I am always learning new things as well so if you have a thought or opinion don’t be afraid to chime in as well
Gene Expression The Central Dogma ▫Each level can tell us something different
Why Study RNA over DNA? Functional studies ▫ Genome may be constant but an experimental condition has a pronounced effect on gene expression e.g. Drug treated vs. untreated cell line e.g. Wild type versus knock out mice Some molecular features can only be observed at the RNA level ▫ Alternative isoforms, fusion transcripts, RNA editing Predicting transcript sequence from genome sequence is difficult ▫ Alternative splicing, RNA editing, etc.
Why Study RNA over DNA? Interpreting mutations that do not have an obvious effect on protein sequence – ‘Regulatory’ mutations that affect what mRNA isoform is expressed and how much e.g. splice sites, promoters, exonic/intronic splicing motifs, etc. Prioritizing protein coding somatic mutations (often heterozygous) – If the gene is not expressed, a mutation in that gene would be less interesting – If the gene is expressed but only from the wild type allele, this might suggest loss-of-function (haploinsufficiency) – If the mutant allele itself is expressed, this might suggest a candidate drug target
What is RNA-seq Whole Transcriptome Shotgun Sequencing ▫High-throughput sequencing of cDNA to gain information about that samples RNA content. “Transcription Snap-shot” ▫Know the expression levels of every gene in the genome at that particular point in time.
What is Next Gen Seq? Video! Briefly discusses library prep and how sequencing works
We start by asking a question Condition 1 (normal colon) Condition 2 (colon tumor)
We start by asking a question What genes are turned on or off during these conditions? ▫What about whole gene pathways? Change of expression of one gene effect the expression of many Condition 1 (normal colon) Condition 2 (colon tumor)
RNA sequencing Overview Condition 1 (normal colon) Condition 2 (colon tumor) Isolate RNAs Sequence ends 100s of millions of paired reads 10s of billions bases of sequence Fragment,generate cDNA, add adapters, size select, PCR amplify Samples of interest Map to genome, transcriptome, and predicted exon junctions Downstream analysis
Challenges of Studying RNA RNAs consist of small exons that may be separated by large introns ▫ Mapping reads to genome is challenging The relative abundance of RNAs vary wildly ▫ 10 5 – 10 7 orders of magnitude ▫ Since RNA sequencing works by random sampling, a small fraction of highly expressed genes may consume the majority of reads ▫ Ribosomal and mitochondrial genes RNAs come in a wide range of sizes ▫ Small RNAs must be captured separately ▫ PolyA selection of large RNAs may result in 3’ end bias RNA is fragile compared to DNA (easily degraded)
Quality Very good RNA, RIN of 10 Still good, RIN of 8.9 Starting to get worse, RIN of 6.3 Agilent Bioanalyzer
Quality RIN 3 RIN 2.2
Best Practice RNA is highly susceptible to degradation by RNAse enzymes. RNAse enzymes are present in cells and tissues and can be carried on hands, labware, or even dust. They are very stable and difficult to inactivate. For these reasons, it is important to follow best laboratory practices while preparing and handling RNA samples. When harvesting total RNA, use a method that quickly disrupts tissue and isolates and stabilizes RNA Wear gloves and use sterile technique at all times Reserve a set of pipettes for RNA work. Use sterile RNAse-free filter pipette tips to prevent cross-contamination Use disposable plasticware that has been certified to be RNAse-free. All reagents should be prepared from RNAse-free components, including ultrapure water Store RNA samples by freezing. Keep samples on ice at all times while working with them. Avoid extended pauses in the protocol until the RNA has been reverse transcribed into DNA Use RNAse/DNAse decontamination solution to decontaminate work surfaces and equipment
Now we are ready to sequence!
Length of Reads/single vs. paired Longer reads gives you better alignment confidence Maximizes sequencing coverage on the flow cell ▫Average number of sequences representing a particular region of the transcriptome Paired ends help deduce large insertions/deletions/rearrangements Drawback- It costs more
Depth The number of reads per sample/library More depth means more likely to see genes that are very low expressed ~200 million reads can be generated per lane on a flowcell Nat Rev Genet , 57-63
How much depth do you need? Depends on application ▫Differential gene expression, variant detection 10x – 30x coverage If your interested in lower expressed genes, then you still might need more. ▫For applications like transcriptome assemblies Much more depth needed ▫So we choose how much we want to add to a lane for sequencing depending on how much depth we need.
Multiplexing Add a “barcode” to each sample/library then mix and sequence ▫A string of unique nucleotides within the adapter ▫Using barcode, sequenced reads can be traced back to their appropriate sample. B1 B2B4 B3 Barcodes mixed Sequencing
Cost In our lab, it only cost us about ~$30 a library to construct ourselves. Additionally, you have sequencing costs ▫Depends on length(cycles)/paired end ▫Depends on facility and machine $ per lane HiSeq2000 Again, multiplexing reduces cost per sample
Advantages of RNA-Seq compared with other transcriptomics methods
Typical Differential Gene Expression Workflow Raw readsFilter Reads Align to reference genome/transcriptome Count reads that map to genes Run statistical tests Assemble transcriptome Evaluate genes that are differentially expressed
Strand Specificity BMC Genomics 2012, 13:721
FPKM (RPKM): Expression Normalization Fragments (Reads) Per Kilobase of exon model per Million mapped fragments C= the number of reads mapped onto the gene's exons (raw counts) N= total number of reads in the experiment L= the sum of the exons in base pairs (size of gene). Example 1: Large gene #1 with 100 reads and small gene #2 with 100 reads ▫ Gene1
"name": "FPKM (RPKM): Expression Normalization Fragments (Reads) Per Kilobase of exon model per Million mapped fragments C= the number of reads mapped onto the gene s exons (raw counts) N= total number of reads in the experiment L= the sum of the exons in base pairs (size of gene).",
"description": "Example 1: Large gene #1 with 100 reads and small gene #2 with 100 reads ▫ Gene1
Conclusions We know what RNA-seq is! ▫Library preparation ▫Next Generation Sequencing RNA quality is very important ▫3’ bias ▫Tips to protect Things to consider within the cost versus information balance Introduced some analysis
Acknowledgements I thank HHMI, the van der Knaap lab, Dr. Dean Fraga and everyone involved in this workshop for making this possible