Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
Advertisements

Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
DNAseq analysis Bioinformatics Analysis Team
High Throughput Sequencing
SOLiD Sequencing & Data
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Introduction to next generation sequencing Rolf Sommer Kaas.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
NGS data analysis CCM Seminar series Michael Liang:
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Next Generation DNA Sequencing
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
Quick introduction to genomic file types Preliminary quality control (lab)
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Identification of Copy Number Variants using Genome Graphs
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
De Novo Genome Assembly - Introduction
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Lesson: Sequence processing
Next Generation Sequencing Analysis
MGmapper A tool to map MetaGenomics data
Sequencing technology and assembly
SVs and CNVs They are often confused…
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Next-generation DNA sequencing
BF528 - Genomic Variation and SNP Analysis
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
The Variant Call Format
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 4 Mapping and Genome Rearrangement ATCAA CTAAG DNA fragment Paired-end Reads

Module 4 bioinformatics.ca Learning Objectives of Module Understand mapping sequence reads to a reference genome Understand file formats like FASTA, FASTQ and SAM/BAM Learn common terminology used to describe alignments Learn how paired-end reads can be used to find genome rearrangements Run a mapper and rearrangement caller

Module 4 bioinformatics.ca Sequencing platforms Increasing Run Time Increasing Data Per Run $ $ Cross-platform data integration needed. 700Mb/23h 150Mb/3h 100Mb/1h 2Gb/27h 100Gb/15d 90Gb/10d 600Gb/10d 14TB/run 120Gb/1d Proton? GridION?

Module 4 bioinformatics.ca Basecalling How do we translate the machine data to base calls? How do we estimate and represent sequencing errors?

Module 4 bioinformatics.ca Sources of error Illumina: Pre-phasing & Phasing

Module 4 bioinformatics.ca What is a base quality? Base QualityP error (obs. base) 350 % 532 % 1010 % 201 %1 % % % Phred quality scores: -Estimate of probability the base call is incorrect

Module 4 bioinformatics.ca Error Profiles Illumina – Low error rate (~0.5%), mainly substitutions 454/Ion Torrent – Mainly insertions/deletions in homopolymer runs Pacbio – Higher error rate, mixture of insertions, deletions, substitutions

Module 4 bioinformatics.ca Mismatch by cycle

Module 4 bioinformatics.ca Fasta files ASF-1.faASF-2.fa Reads are often stored in fasta files Separate file for forward and reverse pairs header line: identifier sequence lines: nucleotides

Module 4 bioinformatics.ca Fastq files ASF-1.fastqASF-2.fastq header sequence line line beginning with + encoded quality value line Most reads are stored in fastq 4 lines per read

Module 4 bioinformatics.ca Reference-based Alignment Goal: – find position in reference genome from which read was sampled Issues : – the human genome is large and repetitive – NGS instruments produce huge amounts of data – the sequenced genome will differ from the reference due to SNPs, indels and structural variation

Module 4 bioinformatics.ca Choosing an Aligner High accuracy needed – Misaligned reads are a source of false positive variant calls High sensitivity needed – The aligner must allow for differences between the individual and reference to find the correct mapping position High speed needed – With large data the informatics cost is significant We will use the popular aligner bwa in the tutorial

Module 4 bioinformatics.ca Reference alignments ? Reference genome Sequence read

Module 4 bioinformatics.ca Reference alignments Reference genome Sequence read x x x

Module 4 bioinformatics.ca Alignment Quality Most aligners will estimate how reliable the alignment is with a Mapping Quality – Phred-scaled estimate of the probability that the chosen mapping is wrong – 1 in 1000 reads with “Q30” alignment will be placed incorrectly

Module 4 bioinformatics.ca What are Paired Reads? ATCAA CTAAG Insert size (IS) DNA fragment Paired-end Reads Slides by M. Brudno

Module 4 bioinformatics.ca Paired Reads Reference genome Sequence read pair ?

Module 4 bioinformatics.ca Read pair alignment Reference genome xx x Sequence read pair xx x xx

Module 4 bioinformatics.ca Working with alignments SAM/BAM is a standardized format for working with read alignments SAM is tab-delimited text representation BAM is a compressed binary representation SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT 8AB685C26091:77

Module 4 bioinformatics.ca SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT 8AB685C26091:77 Read name Flag ➞ Flag indicates the reference strand, pairing information

Module 4 bioinformatics.ca SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT 8AB685C26091:77 Chromosome Coordinate

Module 4 bioinformatics.ca SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT 8AB685C26091:77 Mapping Quality

Module 4 bioinformatics.ca SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT 8AB685C26091:77 CIGAR REF ACGATACATAC REF GACA-AACC READ ACGA-ACATAC READ GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M

Module 4 bioinformatics.ca SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT 8AB685C26091:77 Mate chromosome, position Insert size ATCAA CTAAG Insert size (IS)

Module 4 bioinformatics.ca Resources samtools: toolkit for working with SAM/BAM files – Convert between SAM/BAM – Sort alignments – Extract alignments for a given genomic location SAM/BAM specification: Questions/Help – – –

Module 4 bioinformatics.ca We are now going to start an exercise in read mapping

Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session

Module 4 bioinformatics.ca What kinds of variation is there? Single Nucleotide Polymorphisms (SNPs) Short indels (< read length) Structural variations – Large insertions and deletions – Inversions – Translocations – Copy number variation

Module 4 bioinformatics.ca Structural variants Mate-pair and paired-end reads can be used to detect structural variants Fragmentation & circularization to an internal adaptor Shear Isolate internal adaptors and fragment ends Mate-Pairs Paired-Ends Fragmentation Add amplification and sequencing adaptors Sequence Add amplification and sequencing adaptors Genomic DNA kb 200 – 500bp

Module 4 bioinformatics.ca Read pair orientation Reference genome Sequence read pair The expected orientation is one read on the forward strand and one read on the reverse strand for paired-end reads

Module 4 bioinformatics.ca Read pair alignment Fragment/insert size is determined by library preparation Pairs that match the expected orientation and distance are called concordant Discordant read pairs give evidence of structural variation Fragment size Fragment number

Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno

Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno Deletion signature: mapped insert size larger than expected

Module 4 bioinformatics.ca SV Signatures: Insertion don ref Slides by M. Brudno Insertion signature: mapped insert size smaller than expected

Module 4 bioinformatics.ca SV Signatures: Tandem Duplication don ref Tandem duplication signature: wrong orientation

Module 4 bioinformatics.ca SV Signatures: Inversion don ref Inversion signature: wrong orientation of pairs

Module 4 bioinformatics.ca SV summary TypeMapped DistanceOrientation Insertiontoo smallcorrect Deletiontoo bigcorrect Inversion* Tandem duplication* Interchromosomaldifferent chromosomes N/A Slides by M. Brudno

Module 4 bioinformatics.ca Where can we go wrong: missed insertion don ref IS Insertions larger than insert size cannot be detected this way

Module 4 bioinformatics.ca Structural Variants and Split Reads Paired Short Reads Align Most of these pairs can be aligned to the reference genome For some paired-end reads one of the pair may not be mapped because it goes across the breakpoint of a structural variant. We call such reads split reads. Slides by M. Brudno

Module 4 bioinformatics.ca Deletion: split read signature don ref Signature: read aligns in two pieces, one on either side of the breakpoint

Module 4 bioinformatics.ca Somatic vs. Germline tumor vs. normal sequencing approach 1: – find SVs separately in two samples – filter out somatic SVs that overlap germline SVs approach 2 – find somatic SVs – for each somatic SV, find any type of evidence in germline – filter out anything with germline evidence Slides by M. Brudno

Module 4 bioinformatics.ca Gene fusions if a linking signature connects two genes, this might indicate a gene fusion ChrA ChrB Gene X Gene Y Gene XY Protein

Module 4 bioinformatics.ca SV Software and Exercise We will use HYDRA-SV in the tutorial – – Quinlan et al, Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Research Many others exist: – Breakdancer, GASV, Pindel – It is worth spending time learning multiple packages and their strengths and weaknesses – There is rarely one program that fits all needs!

Module 4 bioinformatics.ca We are now going to start an exercise in structural variant detection

Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session

Module 4 bioinformatics.ca Any questions?