Presentation is loading. Please wait.

Presentation is loading. Please wait.

2nd (Next) Generation Sequencing

Similar presentations


Presentation on theme: "2nd (Next) Generation Sequencing"— Presentation transcript:

1 2nd (Next) Generation Sequencing
2/2/2018

2 Introduction Why do we want to sequence a genome?
To see the sequence (assembly) To validate an experiment (insert or knockout) To compare to another genome and find variations (cancer, populations) The problem: We cannot sequence the genome from start to end. We need to sheer the DNA into smaller fragments and sequence smaller pieces. Sanger sequencing is slow and not high throughput: 13 years for a human genome.

3 NGS machines Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your genome at 30X depth for USD. Illumina HiSeq 2500 Roche 454 Ion Torrent

4 Evolution of NGS Surya Saha, Boyce Thompson Institute, Ithaca, NY (BTI plant bioinformatics course)

5 Experiment Alex Sanchez, Statistics and Bioinformatics Research Group, Statistics Department, Universitat de Barcelona

6 Definitions and standards
pair 1 Reads come from molecule fragments Read length is the same for an entire dataset (e.g. 101 bases long) Either single or paired-end reads Mate reads Physical coverage and depth Number of reads Duplicates (PCR or sequence) Dark matter (PCR cannot find repeats) fragment pair 2 Lex Naderbragt, SeRC Nordic Assembly Workshop in Stockholm, Sweden, May 14th 2014

7 Dark matter example chr22:11M-12M
RepeatMasker Gap

8 Design Choice: Fragment Length
Illumina sequencers can only sequence DNA fragments up to ~300nt long DNA must be size-selected, usually by gel cut ~ nt band cut, purified, prepared for sequencing Fragment length follows a normal distribution around target cut size

9 Design Choice: Number of Reads
Each sequencing run generates a certain # of total reads # of reads per sample ~= # total reads/number of samples # of reads for one sample: library size Can choose target library size for your instrument based on: Desired depth Desired coverage For more see

10 Design Choice: Single End vs Paired End

11 Critical Concept: Read Mapping
Question: “Given a read and a reference sequence, where, if anywhere, in the reference does the read sequence occur?” E.g. chr3:2,358,092-2,358,193 More on this next lecture

12 Mapped Read Terminology
Genome Locus Depth: number of sequenced bases that map to a given location Mapped or Aligned reads Coverage: fraction of genomic locus covered by at least one read

13 Illumina paired-end reads
Illumina is now the most common sequencer. It’s error is uniformly distributed (~0.1%) only substitutions (no indels). Older Illumina machines had a fall of quality towards the end of the read.

14 Statistics Fragment (insert) size follow a truncated normal distribution Sequencing depth is defined by number of fragments covering a bp of the DNA. Not the number of reads. Use read depth to refer to that. Physical coverage is the amount of the genome expected to be covered. However coverage is usually used to mean depth! Coverage follows a Poisson (Negative Binomial) distribution with lambda=physical depth. Coverage follows a Poisson distribution. Read length is a fixed number for Illumina reads. Error is usually higher toward the ends → trimming

15 Illumina coverage Good coverage Bad coverage

16 Sequence Data Format: fastq
The machines output files containing short reads in fastq format. For each read there are 4 lines: @ read_header comment Read_sequence + [read_header] Quality_string (in ASCII) Scores estimate the probability that a base is called incorrectly. Q30 means 99.9% accuracy. Reads are short, we need a “reference sequence” to resolve where they come from (resequencing).

17 fastq format start new read @SRR1997412.1 1 length=125
NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125

18 fastq format unique read header @SRR1997412.1 1 length=125
NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125

19 fastq format comments separated by space, could be anything
@SRR length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125

20 fastq format Sequence of the read @SRR1997412.1 1 length=125
NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125

21 fastq format start quality line @SRR1997412.1 1 length=125
NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125

22 fastq format repeat read header and comment, not required
@SRR length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125

23 fastq format Quality sequence of the read, in ASCII
@SRR length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125

24 fastq format Next read @SRR1997412.1 1 length=125
NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 Next read

25 Comparison Quail, Michael A., et al. "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers." BMC genomics13.1 (2012): 341.

26 Public data and platforms
NCBI ( Illumina basespace ( Google genomics cloud ( Genome In A Bottle (GIAB) ( REPOSITIVE ( GDC ( Seven Bridges (

27 NCBI sra portal

28 NCBI ftp

29 1000 genomes on Google

30 Illumina BaseSpace

31 GDC

32 Simulating sequencing data
ART : WGS simulator WGSIM: WGS simulator PBSIM: PacBio simulator See more on OMIC tools ( )


Download ppt "2nd (Next) Generation Sequencing"

Similar presentations


Ads by Google