2nd (Next) Generation Sequencing

Slides:



Advertisements
Similar presentations
The Past, Present, and Future of DNA Sequencing
Advertisements

V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Next–generation DNA sequencing technologies – theory & practice
Processing of miRNA samples and primary data analysis
Welcome to Introduction to Bioinformatics Wednesday, 10 February Genome Sequencing/Assembly Genome sequencing/Assembly Click anywhere to go on to the next.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
SOLiD Sequencing & Data
Next-generation sequencing
Greg Phillips Veterinary Microbiology
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
1 Next Generation Sequencing Itai Sharon November 11th, 2009 Introduction to Bioinformatics.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
High Throughput Sequencing
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
CS 6293 Advanced Topics: Current Bioinformatics
Next Generation DNA Sequencing Platforms: Evolving Tools for
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Introduction to next generation sequencing Rolf Sommer Kaas.
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
RNAseq analyses -- methods
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Quick introduction to genomic file types Preliminary quality control (lab)
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
De Novo Genome Assembly - Introduction
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Canadian Bioinformatics Workshops
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
An Overview of Applications for the MiSeq and HiSeq 2500 April 4, 2016 Kevin Shianna, Ph.D. Sequencing Specialist - Illumina, Inc. MGC USERS GROUP.
Next-generation sequencing technology
Research Techniques Made Simple: Next-Generation Sequencing:
DNA Sequencing Second generation techniques
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
Lecture 6: Genotype by sequencing
Cancer Genomics Core Lab
Sequencing technologies
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Next-generation sequencing technology
Sequencing technology and assembly
The FASTQ format and quality control
B3- Olympic High School Bioinformatics
Jin Zhang, Jiayin Wang and Yufeng Wu
How to Build a Horse: Final Report
The characterisation of mtDNA deletions using long-read sequencing
ULTRASEQUENCING. Next Generation Sequencing: methods and applications.
Maximize read usage through mapping strategies
Next-generation DNA sequencing
Single-Molecule Sequencing: Towards Clinical Applications
BF528 - Genomic Variation and SNP Analysis
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

2nd (Next) Generation Sequencing 2/2/2018

Introduction Why do we want to sequence a genome? To see the sequence (assembly) To validate an experiment (insert or knockout) To compare to another genome and find variations (cancer, populations) The problem: We cannot sequence the genome from start to end. We need to sheer the DNA into smaller fragments and sequence smaller pieces. Sanger sequencing is slow and not high throughput: 13 years for a human genome.

NGS machines Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your genome at 30X depth for 1000-1500 USD. Illumina HiSeq 2500 Roche 454 Ion Torrent

Evolution of NGS Surya Saha, Boyce Thompson Institute, Ithaca, NY (BTI plant bioinformatics course)

Experiment Alex Sanchez, Statistics and Bioinformatics Research Group, Statistics Department, Universitat de Barcelona

Definitions and standards pair 1 Reads come from molecule fragments Read length is the same for an entire dataset (e.g. 101 bases long) Either single or paired-end reads Mate reads Physical coverage and depth Number of reads Duplicates (PCR or sequence) Dark matter (PCR cannot find repeats) fragment pair 2 Lex Naderbragt, SeRC Nordic Assembly Workshop in Stockholm, Sweden, May 14th 2014

Dark matter example chr22:11M-12M RepeatMasker Gap

Design Choice: Fragment Length Illumina sequencers can only sequence DNA fragments up to ~300nt long DNA must be size-selected, usually by gel cut ~200-300nt band cut, purified, prepared for sequencing Fragment length follows a normal distribution around target cut size

Design Choice: Number of Reads Each sequencing run generates a certain # of total reads # of reads per sample ~= # total reads/number of samples # of reads for one sample: library size Can choose target library size for your instrument based on: Desired depth Desired coverage For more see https://genohub.com/recommended-sequencing-coverage-by-application/

Design Choice: Single End vs Paired End

Critical Concept: Read Mapping Question: “Given a read and a reference sequence, where, if anywhere, in the reference does the read sequence occur?” E.g. chr3:2,358,092-2,358,193 More on this next lecture

Mapped Read Terminology Genome Locus Depth: number of sequenced bases that map to a given location Mapped or Aligned reads Coverage: fraction of genomic locus covered by at least one read

Illumina paired-end reads Illumina is now the most common sequencer. It’s error is uniformly distributed (~0.1%) only substitutions (no indels). Older Illumina machines had a fall of quality towards the end of the read.

Statistics Fragment (insert) size follow a truncated normal distribution Sequencing depth is defined by number of fragments covering a bp of the DNA. Not the number of reads. Use read depth to refer to that. Physical coverage is the amount of the genome expected to be covered. However coverage is usually used to mean depth! Coverage follows a Poisson (Negative Binomial) distribution with lambda=physical depth. Coverage follows a Poisson distribution. Read length is a fixed number for Illumina reads. Error is usually higher toward the ends → trimming

Illumina coverage Good coverage Bad coverage

Sequence Data Format: fastq The machines output files containing short reads in fastq format. For each read there are 4 lines: @ read_header comment Read_sequence + [read_header] Quality_string (in ASCII) Scores estimate the probability that a base is called incorrectly. Q30 means 99.9% accuracy. Reads are short, we need a “reference sequence” to resolve where they come from (resequencing).

fastq format start new read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125

fastq format unique read header @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125

fastq format comments separated by space, could be anything @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125

fastq format Sequence of the read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125

fastq format start quality line @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125

fastq format repeat read header and comment, not required @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125

fastq format Quality sequence of the read, in ASCII @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125

fastq format Next read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCATTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFBFF @SRR1997412.2 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTTCTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR1997412.2 2 length=125 Next read

Comparison Quail, Michael A., et al. "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers." BMC genomics13.1 (2012): 341.

Public data and platforms NCBI (https://www.ncbi.nlm.nih.gov/sra) Illumina basespace (https://basespace.illumina.com/home/index) Google genomics cloud (https://console.cloud.google.com/genomics/) Genome In A Bottle (GIAB) (http://jimb.stanford.edu/giab/) REPOSITIVE (https://discover.repositive.io/datasets/) GDC (https://portal.gdc.cancer.gov/) Seven Bridges (https://igor.sbgenomics.com/)

NCBI sra portal

NCBI ftp

1000 genomes on Google

Illumina BaseSpace

GDC

Simulating sequencing data ART : WGS simulator WGSIM: WGS simulator PBSIM: PacBio simulator See more on OMIC tools (https://omictools.com/read-simulators-category )