Presentation on theme: "RNAseq Library Preparation and ANAlysis basics"— Presentation transcript:
1RNAseq Library Preparation and ANAlysis basics MGL users group
2What does RNAseq provide RNAseq is a quantitative experimentRelative amounts of RNA in a sample across different transcripts are fragmented (for large RNA) and sequencedThe number of reads you obtain for a given transcript is proportional to the amount of it present in the original sampleNumber of reads per-transcript recovered scales directly withNumber of copies of that transcriptSize of the transcript (Longer => More fragments)Higher abundance transcripts may provide differential splicing information
3What does RNAseq NOT Provide? A measure of translational or protein activity.Just because a gene is highly transcribed does not mean that it’s actively engaged fully with ribosomesRibosomal profiling is an excellent supplement to this endA good view of mutationsSince read recovery scales with abundance of the given transcript, some messages are highly (or overly) recovered while other are only sparsely sampled.Mutations may affect the ability of an RNA to accumulate in order to be sequenced.Ability to identify mutations is dependent on sampling a mutation many times.DNA sequencing is better suited to finding mutations to genes as it’s untied from expression.
4RNAseq Library Preparation Considerations Ribosomal DepletionRibo-Zero Gold kit98-99% of isolated RNA is rRNA and will be the most abundant species in a sequencing experiment otherwiseAmountIdeal input for library preparation is 200ng of ribosome depleted RNA1 to 5 ug of total RNA prior to depletion will work as well for transcriptome sequencing experimentsSmall RNA requires special handling separate from transcriptome experiments
5Library PreparationRibosome depleted RNA is fragmented if large (mRNA, etc)Tightly controlled endonucleolytic (RNase III) digestionCleanupSmall RNA differencesEnsure 5’ phosophate, 3’ hydroxylEnd ligation of adapter sequencesCleanup (small RNA involves additional size selection)Reverse transcriptionPCR amplification with barcoded full adaptersCleanup & Quantitation for Multilexing/Bead preparation
6Sample state after library preparation From this point, sample is almost identical to a DNA fragment libraryAdaptor sequences vary slightly from DNA-based libraries, but have identical barcoding schemes.Samples may be read from both ends, or in the case of small RNA it may only be needed to read from one end.If multiplexing, differently barcoded samples are pooled for bead preparation.
7Bead Preparation and deposition Beads are enriched with single fragment sequences by emulsion PCR and deposited onto in lanes on a flowcell.Target deposition:~230k beads per imaging panel~150 million beads per lane
8How many reads Are Needed?? A common and debated point:Read count affects how much sampling of the RNA population is occurringHighly abundant RNAs are fairly easy to see and therefore quantitate with low read countsLower abundance species are going to be harder to catch by random sampling of the population, therefore more difficult to quantitate accuratelyAt very low sampling (read) counts, you may not see some trace RNAs at all simply by not getting them by chance!Target read counts are going to be informed by what you are after in the experimentComplexity of transcriptome (Bacteria vs Human)?Big changes in expression? Low counts neededPrecise measurement of low abundance species? Maybe moreAppearance of a minor slice isoform species in the presence of WT isoforms? You may need a lot moreBasic rule of thumb: 40 million for a good picture of Human transcriptome-sized complexity
9After sequencing Procedeure BEFORE MAPPING!Assess quality of the run.Was the balance between signals fairly even?Was there poor imaging or other aberrations in a cycle?Was the number or reads obtained what you expected?Scrub resulting sequences before trying to map to your transcriptome.Remove contaminant high abundance sequences: rRNA, tRNA, adaptersParticularly for small RNA experiments adapter sequences may be found joined to the ends of your targets and need to be trimmed offTrim reads based on quality (clip off trailing low quality calls that may lead to poor or incorrect mapping)It’s critical to have the best sequences to map without possible mistakes in order to get the highest quality quantitation in the end!
10Mapping Procedeure Depends slightly on your data source and state Illumina sequences are often mapped and quantitated through the Tophat/Bowtie -> Cufflinks pipeline.SOLiD colorspace data may be mapped through Lifescope or Novocraft, then quantitated by any tool at that point.RNAseq mapping typically involves multiple passes:First pass: compare your sequence to known spliced RNA sequences for your genomeSecond pass: if not identified there, map to the genome while not allowing for large gaps, then allowing for large gaps.
11Post mapping At this point, generally data is in SAM or BAM format. Information there includes the read ID, where on the genome it mapped, where the pair is if it’s a paired end experiment, sequence, call qualities, and other information.
12Visualization of DataIGVUCSC Genome BrowserVarious other tools
13Comparing abundancesAbundance of a given species is typically expressed as RPKM or FPKMReads Per Kilobase per Million; Fragments Per Kilobase per MillionFragments is preferred for paired-end data as it only counts those where both ends of your read map to the same transcript, and doesn’t “double count” an individual biological fragmentOne molecule “spot” on the flow cell = 1 Fragment comprised of one forward and one reverse readAdded accuracy controlNormalizes basic counts per species by size of the transcript (how likely you are you sample it from a pool of fragments), and total size of the experimentAllows for cross-experiment/sample comparisons
14Scatter Plot FPKM(1) vs FPKM(2) Compare abundances of all RNAs directlyUsually log scale as FPKM values range greatlyShould see mainly a 1:1 correspondence of abundance in most RNAsCondition 2Condition 1
15Scatter Plot FPKM(1) vs FPKM(2) Most analysis will be interested in what is significantly altered between conditionsNeed to identify globally what has changed and how much confidence we have in those determinations!Condition 2Condition 1
16Comparing AbundancesFPKMs for the same species of RNA can be compared for expression level differencesTypically, a ratio of the FPKM values is taken, and log transformed (usually log base 2 scale)Log transformation eases statistical analysis of the population of ratios across all species observed in an experimentWhat’s unusual/real vs what’s within experimental noiseAssign p-values to the modeled significance of the changes being observedBasic how likely is this to be a sampling difference explainable by chance?Apply RNAseq-specific corrections to derive a “q-value” (corrected p-value)Size of transcripts can affect sampling, replicate measurements if in the experiment, etcApply some cutoff value to the q-value and look at RNAs exceeding that threshold
17Volcano plot Log ratio between FPKMs plotted against q-value Even though apparent ratios between FPKM values may be large, they may simply reflect sampling explainable by chance.
18What to do with the data??An open ended question dependent upon the experimental goalsCommon procedures:Look at altered RNAs exceeding statistical significanceUpregulated and Downregulated.What pathways or targets are represented here?Enrichment analysisLook at similarly expressed gene patterns globally without the statistical cutoff or other RNAs in the same pathway(s)Look at splice isoform differences suggested by the dataMainly only applicable to paired end data as pairs are more likely to span one or more exons
19Thanks! MGL e-mail: (NICHDDIRMolGenomicsLab@mail.nih.gov) James IbenLocation: Building 10, 9D41MGL Web page: https://science.nichd.nih.gov/confluence/display/mgl/HomeMGL listserv: MGL-USERS-LMGL phone: (301)Next users group meeting: Aug 20th, 12pmTopic TBA