2nd (Next) Generation Sequencing 2/2/2018
Introduction Why do we want to sequence a genome? To see the sequence (assembly) To validate an experiment (insert or knockout) To compare to another genome and find variations (cancer, populations) The problem: We cannot sequence the genome from start to end. We need to sheer the DNA into smaller fragments and sequence smaller pieces. Sanger sequencing is slow and not high throughput: 13 years for a human genome.
NGS machines Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your genome at 30X depth for 1000-1500 USD. Illumina HiSeq 2500 Roche 454 Ion Torrent
Evolution of NGS Surya Saha, Boyce Thompson Institute, Ithaca, NY (BTI plant bioinformatics course)
Experiment Alex Sanchez, Statistics and Bioinformatics Research Group, Statistics Department, Universitat de Barcelona
Definitions and standards pair 1 Reads come from molecule fragments Read length is the same for an entire dataset (e.g. 101 bases long) Either single or paired-end reads Mate reads Physical coverage and depth Number of reads Duplicates (PCR or sequence) Dark matter (PCR cannot find repeats) fragment pair 2 Lex Naderbragt, SeRC Nordic Assembly Workshop in Stockholm, Sweden, May 14th 2014
Dark matter example chr22:11M-12M RepeatMasker Gap
Design Choice: Fragment Length Illumina sequencers can only sequence DNA fragments up to ~300nt long DNA must be size-selected, usually by gel cut ~200-300nt band cut, purified, prepared for sequencing Fragment length follows a normal distribution around target cut size
Design Choice: Number of Reads Each sequencing run generates a certain # of total reads # of reads per sample ~= # total reads/number of samples # of reads for one sample: library size Can choose target library size for your instrument based on: Desired depth Desired coverage For more see https://genohub.com/recommended-sequencing-coverage-by-application/
Design Choice: Single End vs Paired End
Critical Concept: Read Mapping Question: “Given a read and a reference sequence, where, if anywhere, in the reference does the read sequence occur?” E.g. chr3:2,358,092-2,358,193 More on this next lecture
Mapped Read Terminology Genome Locus Depth: number of sequenced bases that map to a given location Mapped or Aligned reads Coverage: fraction of genomic locus covered by at least one read
Illumina paired-end reads Illumina is now the most common sequencer. It’s error is uniformly distributed (~0.1%) only substitutions (no indels). Older Illumina machines had a fall of quality towards the end of the read.
Statistics Fragment (insert) size follow a truncated normal distribution Sequencing depth is defined by number of fragments covering a bp of the DNA. Not the number of reads. Use read depth to refer to that. Physical coverage is the amount of the genome expected to be covered. However coverage is usually used to mean depth! Coverage follows a Poisson (Negative Binomial) distribution with lambda=physical depth. Coverage follows a Poisson distribution. Read length is a fixed number for Illumina reads. Error is usually higher toward the ends → trimming
Illumina coverage Good coverage Bad coverage
Sequence Data Format: fastq The machines output files containing short reads in fastq format. For each read there are 4 lines: @ read_header comment Read_sequence + [read_header] Quality_string (in ASCII) Scores estimate the probability that a base is called incorrectly. Q30 means 99.9% accuracy. Reads are short, we need a “reference sequence” to resolve where they come from (resequencing).
fastq format start new read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125
fastq format unique read header @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125
fastq format comments separated by space, could be anything @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125
fastq format Sequence of the read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125
fastq format start quality line @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125
fastq format repeat read header and comment, not required @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125
fastq format Quality sequence of the read, in ASCII @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125
fastq format Next read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125 Next read
Comparison Quail, Michael A., et al. "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers." BMC genomics13.1 (2012): 341.
Public data and platforms NCBI (https://www.ncbi.nlm.nih.gov/sra) Illumina basespace (https://basespace.illumina.com/home/index) Google genomics cloud (https://console.cloud.google.com/genomics/) Genome In A Bottle (GIAB) (http://jimb.stanford.edu/giab/) REPOSITIVE (https://discover.repositive.io/datasets/) GDC (https://portal.gdc.cancer.gov/) Seven Bridges (https://igor.sbgenomics.com/)
NCBI sra portal
NCBI ftp
1000 genomes on Google
Illumina BaseSpace
GDC
Simulating sequencing data ART : WGS simulator WGSIM: WGS simulator PBSIM: PacBio simulator See more on OMIC tools (https://omictools.com/read-simulators-category )