Short Read Sequencing Analysis Workshop Day 1 Considerations for Sequencing
Different types of sequencing libraries Whole genome sequencing RNA Sequencing/GRO-Seq ChIP-seq DNAse 1, ATAC-seq Exome sequencing Methyl-Seq Metagenomic/Amplicon (low diversity)
Platform Comparsion
Platform Comparison Platform Comparison MiniSeq MiSeq NextSeq HiSeq 2500 HiSeq 3000/4000 HiSeq X Output per run 1.65Gb – 7.5Gb 0.5Gb – 15Gb 16Gb – 120Gb 9Gb – 500Gb 105Gb – 750Gb 800Gb – 900 Gb Reads per run 7M – 25M 12M – 25M 130M – 400M 300M – 4B 2.1M – 2.5B 2.6B – 3B Max read length 2 x 150 2 x 300 2 x 250 Time per run 7h – 24h 5h – 56h 11h – 30h 7h – 6d 1d – 3.5d <3d 2 color/4color 2 color 4 color Flowcell PE SR / PE Pattern Samples/FC 1 2 or 8 8
How does Illumina sequencing work? Library generation and affixing library to flow cell http://bitesizebio.com/13546/sequencing-by-synthesis-explaining-the-illumina-sequencing-technology/
How does Illumina sequencing work? Cluster Generation
How does Illumina sequencing work? Sequencing by synthesis with reversible terminators
How does Illumina sequencing work?
Output: Millions of short read sequences ATCGACGGTTAACTGATCG… ATGCGTGCTGCAGTGCCAC… CGTGGACCAAATGGCACAT… CTGTGAAACAATTGGGGAT… Index Read 1 (i7) TCAGTGCT ACGTTCTA TCAGTGGG CTCGGCGA ACGTTCTC Index Read 2 (i5) ACGTTCAT CAACGTTC ATTCAGTG GCCTCGGC Read 2 CTGGTGACAACTGATGCTT… TGACCATTGGGTACAACCC… CCAGTGAACGTGAGCAAGT… GGTTGACCATTGGGGTGAC…
Current Illumina kits allow up to 384 unique indexes to be pooled Demultiplexing Read 1 ATCGACGGTTAACTGATCG… ATGCGTGCTGCAGTGCCAC… CGTGGACCAAATGGCACAT… CTGTGAAACAATTGGGGAT… Index Read 1 (i7) TCAGTGCT ACGTTCTA TCAGTGGG CTCGGCGA ACGTTCTC Index Read 2 (i5) ACGTTCAT CAACGTTC ATTCAGTG GCCTCGGC Read 2 CTGGTGACAACTGATGCTT… TGACCATTGGGTACAACCC… CCAGTGAACGTGAGCAAGT… GGTTGACCATTGGGGTGAC… Current Illumina kits allow up to 384 unique indexes to be pooled
Demultiplexing Read 1 ATCGACGGTTAACTGATCG… ATGCGTGCTGCAGTGCCAC… CGTGGACCAAATGGCACAT… CTGTGAAACAATTGGGGAT… Index Read 1 (i7) TCAGTGCT ACGTTCTA TCAGTGGG CTCGGCGA Index Read 2 (i5) ACGTTCAT CAACGTTC ATTCAGTG GCCTCGGC Read 2 CTGGTGACAACTGATGCTT… TGACCATTGGGTACAACCC… CCAGTGAACGTGAGCAAGT… GGTTGACCATTGGGGTGAC… Sample 1 Read 1 Read 2 ATCGACGGTTAACTGATCG… CTGGTGACAACTGATGCTT… CGTGGACCAAATGGCACAT… CCAGTGAACGTGAGCAAGT… Sample 3 Read 1 Read 2 CTGTGAAACAATTGGGGAT… GGTTGACCATTGGGGTGAC… Sample 2 Read 1 Read 2 ATGCGTGCTGCAGTGCCAC… TGACCATTGGGTACAACCC…
What to do with the data? Short Read Sequencing Quality Metrics & Trimming Assembly Align to reference genome Variant Calling Expression/Read Depth Alternative splicing Peak/Region identification Metagenomics
Quality Assessment & Trimming Pinpoint problems with library prep/sequencing Identify possible biases Improve mapping through trimming
Align to reference genome Chr1 1000-2500 Sample 1 reads Sample 2 reads Sample 3 reads Bowtie2 Tophat2 BWA
Variant Calling Reference Chr1 1000-2500 A C C C C C C
Differential Expression Reference Chr1 1000-2500
Alternative Splicing
Peak/Region identification Reference Chr1 1000-2500 Peak
Experimental Design considerations Genome Size Read Length Sequencing Depth # of Replicates Single-end vs. Paired-end Insert Size
Coverage & Read-depth Coverage = estimate of average number of reads covering a single base Avg Coverage = (# reads) x (read length) size of genome Reference Depth D E P T H
Typical Coverage Requirements DNA-Resequencing (SNPs, small indels) 30X with paired-end reads De novo DNA-Seq 100X minimum, longest paired-end, multiple insert size runs Exome 100-200X of the exome
What that means in reads... 30X Coverage with 2 x 150 bp reads For E. coli, ~4.6 Mb 138 Mbp, 0.46 Million reads ~3% of a MiSeq run For Human, ~3.2 Gb 96 Gbp, 320 Million reads 80% of a NextSeq High Output run or 1.3 lanes of HiSeq 2500 run
RNA-Seq Requirements Can’t use coverage as a measure Differential Expression (highly expressed) Small genomes: 5 Million reads Large genomes: 10-30 Million reads De novo Assembly/DE (lowly expressed) Small genomes: 30-65 Million reads Large genomes: 100-200 Million reads ***For RNA-Seq, replicates typically more powerful than read depth, read length
Which Sequencer should I use? MiSeq 15-25 M reads/run 8h – 4 days/run 1x50 to 2x300 $$$/bp NextSeq 130-400 M reads/run 12 – 30 h/run 1x75 to 2x150 $$/bp HiSeq 2500 250 M reads/lane, 8 lanes/run 7h – 3 d/run 1x36 to 2x125 $$/bp HiSeq 4000 312 M reads/lane, 8 lanes/run 1 – 3.5 d/run 1x50 to 2x150 $/bp HiSeq X Ten 350 M reads/lane, 8 lanes/run 3 d/run 2x150 $/bp BUT minimums on orders
Other considerations Base diversity (at each position) Custom versus kitted libraries – kit biases PCR/PCR-free libraries How unique is the run-type you want Queue times/Data delivery times Many more....
Questions?