Presentation on theme: "MGL USERS GROUP 07-16-14 RNASEQ LIBRARY PREPARATION AND ANALYSIS BASICS."— Presentation transcript:
MGL USERS GROUP RNASEQ LIBRARY PREPARATION AND ANALYSIS BASICS
WHAT DOES RNASEQ PROVIDE RNAseq is a quantitative experiment Relative amounts of RNA in a sample across different transcripts are fragmented (for large RNA) and sequenced The number of reads you obtain for a given transcript is proportional to the amount of it present in the original sample Number of reads per-transcript recovered scales directly with Number of copies of that transcript Size of the transcript (Longer => More fragments) Higher abundance transcripts may provide differential splicing information
WHAT DOES RNASEQ NOT PROVIDE? A measure of translational or protein activity. Just because a gene is highly transcribed does not mean that it’s actively engaged fully with ribosomes Ribosomal profiling is an excellent supplement to this end A good view of mutations Since read recovery scales with abundance of the given transcript, some messages are highly (or overly) recovered while other are only sparsely sampled. Mutations may affect the ability of an RNA to accumulate in order to be sequenced. Ability to identify mutations is dependent on sampling a mutation many times. DNA sequencing is better suited to finding mutations to genes as it’s untied from expression.
RNASEQ LIBRARY PREPARATION CONSIDERATIONS Ribosomal Depletion Ribo-Zero Gold kit 98-99% of isolated RNA is rRNA and will be the most abundant species in a sequencing experiment otherwise Amount Ideal input for library preparation is 200ng of ribosome depleted RNA 1 to 5 ug of total RNA prior to depletion will work as well for transcriptome sequencing experiments Small RNA requires special handling separate from transcriptome experiments
LIBRARY PREPARATION Ribosome depleted RNA is fragmented if large (mRNA, etc) Tightly controlled endonucleolytic (RNase III) digestion Cleanup Small RNA differences Ensure 5’ phosophate, 3’ hydroxyl End ligation of adapter sequences Cleanup (small RNA involves additional size selection) Reverse transcription Cleanup PCR amplification with barcoded full adapters Cleanup & Quantitation for Multilexing/Bead preparation
SAMPLE STATE AFTER LIBRARY PREPARATION From this point, sample is almost identical to a DNA fragment library Adaptor sequences vary slightly from DNA-based libraries, but have identical barcoding schemes. Samples may be read from both ends, or in the case of small RNA it may only be needed to read from one end. If multiplexing, differently barcoded samples are pooled for bead preparation.
BEAD PREPARATION AND DEPOSITION Beads are enriched with single fragment sequences by emulsion PCR and deposited onto in lanes on a flowcell. Target deposition: ~230k beads per imaging panel ~150 million beads per lane
HOW MANY READS ARE NEEDED?? A common and debated point: Read count affects how much sampling of the RNA population is occurring Highly abundant RNAs are fairly easy to see and therefore quantitate with low read counts Lower abundance species are going to be harder to catch by random sampling of the population, therefore more difficult to quantitate accurately At very low sampling (read) counts, you may not see some trace RNAs at all simply by not getting them by chance! Target read counts are going to be informed by what you are after in the experiment Complexity of transcriptome (Bacteria vs Human)? Big changes in expression? Low counts needed Precise measurement of low abundance species? Maybe more Appearance of a minor slice isoform species in the presence of WT isoforms? You may need a lot more Basic rule of thumb: 40 million for a good picture of Human transcriptome-sized complexity
AFTER SEQUENCING PROCEDEURE BEFORE MAPPING! Assess quality of the run. Was the balance between signals fairly even? Was there poor imaging or other aberrations in a cycle? Was the number or reads obtained what you expected? Scrub resulting sequences before trying to map to your transcriptome. Remove contaminant high abundance sequences: rRNA, tRNA, adapters Particularly for small RNA experiments adapter sequences may be found joined to the ends of your targets and need to be trimmed off Trim reads based on quality (clip off trailing low quality calls that may lead to poor or incorrect mapping) It’s critical to have the best sequences to map without possible mistakes in order to get the highest quality quantitation in the end!
MAPPING PROCEDEURE Depends slightly on your data source and state Illumina sequences are often mapped and quantitated through the Tophat/Bowtie -> Cufflinks pipeline. SOLiD colorspace data may be mapped through Lifescope or Novocraft, then quantitated by any tool at that point. RNAseq mapping typically involves multiple passes: First pass: compare your sequence to known spliced RNA sequences for your genome Second pass: if not identified there, map to the genome while not allowing for large gaps, then allowing for large gaps.
POST MAPPING At this point, generally data is in SAM or BAM format. Information there includes the read ID, where on the genome it mapped, where the pair is if it’s a paired end experiment, sequence, call qualities, and other information.
VISUALIZATION OF DATA IGV UCSC Genome Browser Various other tools
COMPARING ABUNDANCES Abundance of a given species is typically expressed as RPKM or FPKM R eads P er K ilobase per M illion; F ragments P er K ilobase per M illion Fragments is preferred for paired-end data as it only counts those where both ends of your read map to the same transcript, and doesn’t “double count” an individual biological fragment One molecule “spot” on the flow cell = 1 Fragment comprised of one forward and one reverse read Added accuracy control Normalizes basic counts per species by size of the transcript (how likely you are you sample it from a pool of fragments), and total size of the experiment Allows for cross-experiment/sample comparisons
SCATTER PLOT FPKM(1) VS FPKM(2) Compare abundances of all RNAs directly Usually log scale as FPKM values range greatly Should see mainly a 1:1 correspondence of abundance in most RNAs Condition 1 Condition 2
SCATTER PLOT FPKM(1) VS FPKM(2) Most analysis will be interested in what is significantly altered between conditions Need to identify globally what has changed and how much confidence we have in those determinations! Condition 1 Condition 2
COMPARING ABUNDANCES FPKMs for the same species of RNA can be compared for expression level differences Typically, a ratio of the FPKM values is taken, and log transformed (usually log base 2 scale) Log transformation eases statistical analysis of the population of ratios across all species observed in an experiment What’s unusual/real vs what’s within experimental noise Assign p-values to the modeled significance of the changes being observed Basic how likely is this to be a sampling difference explainable by chance? Apply RNAseq-specific corrections to derive a “q-value” (corrected p-value) Size of transcripts can affect sampling, replicate measurements if in the experiment, etc Apply some cutoff value to the q-value and look at RNAs exceeding that threshold
VOLCANO PLOT Log ratio between FPKMs plotted against q-value Even though apparent ratios between FPKM values may be large, they may simply reflect sampling explainable by chance.
WHAT TO DO WITH THE DATA?? An open ended question dependent upon the experimental goals Common procedures: Look at altered RNAs exceeding statistical significance Upregulated and Downregulated. What pathways or targets are represented here? Enrichment analysis Look at similarly expressed gene patterns globally without the statistical cutoff or other RNAs in the same pathway(s) Look at splice isoform differences suggested by the data Mainly only applicable to paired end data as pairs are more likely to span one or more exons
THANKS! MGL James Iben Location: Building 10, 9D41 MGL Web page: https://science.nichd.nih.gov/confluence/display/mgl/H ome https://science.nichd.nih.gov/confluence/display/mgl/H ome MGL listserv: MGL-USERS-L MGL phone: (301) Next users group meeting: Aug 20 th, 12pm Topic TBA