2 Overview Uses for the SOLiD system Starting Material -> Final Library MaterialBead Preparation & Deposition (Slide Overview)Sequencing Process (‘Colorspace’ vs Basecalls)Data Formats & Derivative Data OverviewFuture Topics
3 Uses for SOLiD: Anything where a reference is available ResequencingSNP or Indel studiesExome or other CaptureWhole GenomeAbundance studiesTranscriptome RNAseqRibosomal ProfilingMicrobiomeSmall RNAs (miR or other)ChIP-Seq / RIP-SeqNOT suitable for deNovo sequencing (Assembly or unknowns)Technically it is possible, but other platforms would likely give FAR better results
4 Regardless of starting material, we sequence DNA fragments Regardless of starting material, we (or you) prepare a short DNA fragment library derived from it.Longer polynucleotides are generally sheared to smaller sizeCovaris or enzymatic digestionMay depend on application! Getting specific ends is important to some applications (ChIP, Protections, etc)Mate Libraries may also be prepared where we want to sequence the ends of very large fragmentsRNA gets reverse transcribed to DNAAdapter sequences are added on in the processAs extendable ligated stranded RT primers for RNA, or post shear/cleanup ligation for DNA fragments.CRITICAL: Adapter cleanup post ligation! This is a very common major contaminant in poorer library preparations
5 The Generic Derived Library Libraries have two end sequences used for both PCR and sequencing priming.“P1” is the universal Forward primer sequence.Secondary “P2” may have an embedded barcode sequence where applicable.Between the two adapter ends we have the DNA which will be sequenced from any combination of forward, reverse, and/or Barcode regions (green arrows).Note: Adapter sequences DIFFER from Illumina if other preparations are to be adapted to this platform.
6 Bead Preparation from Libraries A library or pool of libraries is subjected to emulsion PCR to populate beadsTitrated oil micro-reactors such that each bead is populated by a single template.Unpopulated beads are removed in subsequent cleanup.
7 Slide Deposition of enriched beads Beads are prepared and flowed / adhered in the flowcell lanes.Low loading: little dataOverloading: Unable to resolve single beads
8 Instrument RunIdentifies single spots in each lane to track for signal.Camera images 708 “panels” on each lane
9 Colorspace“Colorspace” refers to the two-nucleotide encoding used by SOLiD. Tiled 5-bp steps with resets.
10 Colorspace 5-bp steps with resets. Di-nucleotide reads result in redundancy in callsIn practice this translates to a slightly higher accuracy in mutation callsResets in extensions means mis/non-incorporation or a bad cycle does not kill a read. It also allows cycles to be targeted to be repeated without rerunning everything.Drawback: resulting sequence is encoded in “colorspace” dinucleotide calls.Must use colorspace aligners for the data as-is (Lifescope)Possible to use an additional 3bp tiled reading cycle set to disambiguate and produce base-calls. (ECC)Possible to use the first base knowledge to walk a base sequence out, but any poor read anywhere will then cause a cascade of subsequent errors, better to use colorspace algorithms.
11 Data we getData is by default in “XSQ” formatA binary file/not human readable.Possible to export to ‘CSFASTA’ & ‘CSQUAL’ files which is in combination similar to FASTQ from Illumina. Some additional meta information is lost when doing so.Lifescope is the only existing aligner for XSQ data.CSFASTA: (Read ID, then Color calls [0-3 for the 4 dyes]. CSQUAL has quality scores for each read similarly)>600_50_31_F3T>600_50_63_F3T>600_50_100_F3TFASTQ: (Read ID, then sequence, then a repeated sequence ID line, then quality scores for the read)@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
12 Call Quality scoresCall qualities are in ASCII and represent phred-scale scores. Depending on platform these have historically varied. (Basically, a log scale error probability)
13 Aligned data formatBAM is the most common form of aligned sequencing data. This is a binary version of a SAM file.SAM are text/human readable, BAM is not.BAM files are highly compressed & index-able / optimized for rapid access of reads anywhere within.You don’t have to read the whole file if you want to look for reads at a gene in the middle of chromosome 7, for example.BAM files are supported by most genomic viewers. I suggest using IGV to visualize your BAM files.
15 Variant Call Format (VCF) Mutations are typically reported in VCF format.This is a tab-delimited text format (Human Readable). Many programs interpret this format.Varsifter will crunch the data for you in a filterable format.One line per mutation location.Position (chromosome, nt position), Reference base identity, Observed mutation identity, and quality data regarding that call per sample in the VCF file.