Download presentation
Presentation is loading. Please wait.
Published byLenard Fields Modified over 7 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2Module #: Title of Module
3
Module 4 Mapping and Genome Rearrangement ATCAA CTAAG DNA fragment Paired-end Reads
4
Module 4 bioinformatics.ca Learning Objectives of Module Understand mapping sequence reads to a reference genome Understand file formats like FASTA, FASTQ and SAM/BAM Learn common terminology used to describe alignments Learn how paired-end reads can be used to find genome rearrangements Run a mapper and rearrangement caller
5
Module 4 bioinformatics.ca Sequencing platforms Increasing Run Time Increasing Data Per Run $ $ Cross-platform data integration needed. 700Mb/23h 150Mb/3h 100Mb/1h 2Gb/27h 100Gb/15d 90Gb/10d 600Gb/10d 14TB/run 120Gb/1d Proton? GridION?
6
Module 4 bioinformatics.ca Basecalling How do we translate the machine data to base calls? How do we estimate and represent sequencing errors?
7
Module 4 bioinformatics.ca Sources of error Illumina: Pre-phasing & Phasing
8
Module 4 bioinformatics.ca What is a base quality? Base QualityP error (obs. base) 350 % 532 % 1010 % 201 %1 % 300.1 % 400.01 % Phred quality scores: -Estimate of probability the base call is incorrect
9
Module 4 bioinformatics.ca Error Profiles Illumina – Low error rate (~0.5%), mainly substitutions 454/Ion Torrent – Mainly insertions/deletions in homopolymer runs Pacbio – Higher error rate, mixture of insertions, deletions, substitutions
10
Module 4 bioinformatics.ca Mismatch by cycle
11
Module 4 bioinformatics.ca Fasta files ASF-1.faASF-2.fa Reads are often stored in fasta files Separate file for forward and reverse pairs header line: identifier sequence lines: nucleotides
12
Module 4 bioinformatics.ca Fastq files ASF-1.fastqASF-2.fastq header line: @SEQUENCE_ID sequence line line beginning with + encoded quality value line Most reads are stored in fastq 4 lines per read
13
Module 4 bioinformatics.ca Reference-based Alignment Goal: – find position in reference genome from which read was sampled Issues : – the human genome is large and repetitive – NGS instruments produce huge amounts of data – the sequenced genome will differ from the reference due to SNPs, indels and structural variation
14
Module 4 bioinformatics.ca Choosing an Aligner High accuracy needed – Misaligned reads are a source of false positive variant calls High sensitivity needed – The aligner must allow for differences between the individual and reference to find the correct mapping position High speed needed – With large data the informatics cost is significant We will use the popular aligner bwa in the tutorial
16
Module 4 bioinformatics.ca Reference alignments ? Reference genome Sequence read
17
Module 4 bioinformatics.ca Reference alignments Reference genome Sequence read x x x
18
Module 4 bioinformatics.ca Alignment Quality Most aligners will estimate how reliable the alignment is with a Mapping Quality – Phred-scaled estimate of the probability that the chosen mapping is wrong – 1 in 1000 reads with “Q30” alignment will be placed incorrectly
19
Module 4 bioinformatics.ca What are Paired Reads? ATCAA CTAAG Insert size (IS) DNA fragment Paired-end Reads Slides by M. Brudno
20
Module 4 bioinformatics.ca Paired Reads Reference genome Sequence read pair ?
21
Module 4 bioinformatics.ca Read pair alignment Reference genome xx x Sequence read pair xx x xx
22
Module 4 bioinformatics.ca Working with alignments SAM/BAM is a standardized format for working with read alignments SAM is tab-delimited text representation BAM is a compressed binary representation SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77
23
Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Read name Flag ➞ Flag indicates the reference strand, pairing information
24
Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Chromosome Coordinate
25
Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Mapping Quality
26
Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 CIGAR REF ACGATACATAC REF GACA-AACC READ ACGA-ACATAC READ GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M
27
Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Mate chromosome, position Insert size ATCAA CTAAG Insert size (IS)
28
Module 4 bioinformatics.ca Resources samtools: toolkit for working with SAM/BAM files – Convert between SAM/BAM – Sort alignments – Extract alignments for a given genomic location SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf http://samtools.sourceforge.net/SAM1.pdf Questions/Help – https://lists.sourceforge.net/lists/listinfo/samtools-help https://lists.sourceforge.net/lists/listinfo/samtools-help – http://www.biostars.org/ http://www.biostars.org/ – http://seqanswers.com/
29
Module 4 bioinformatics.ca We are now going to start an exercise in read mapping
30
Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session
31
Module 4 bioinformatics.ca What kinds of variation is there? Single Nucleotide Polymorphisms (SNPs) Short indels (< read length) Structural variations – Large insertions and deletions – Inversions – Translocations – Copy number variation
32
Module 4 bioinformatics.ca Structural variants Mate-pair and paired-end reads can be used to detect structural variants Fragmentation & circularization to an internal adaptor Shear Isolate internal adaptors and fragment ends Mate-Pairs Paired-Ends Fragmentation Add amplification and sequencing adaptors Sequence Add amplification and sequencing adaptors Genomic DNA 1 - 20kb 200 – 500bp
33
Module 4 bioinformatics.ca Read pair orientation Reference genome Sequence read pair The expected orientation is one read on the forward strand and one read on the reverse strand for paired-end reads
34
Module 4 bioinformatics.ca Read pair alignment Fragment/insert size is determined by library preparation Pairs that match the expected orientation and distance are called concordant Discordant read pairs give evidence of structural variation Fragment size Fragment number
35
Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno
36
Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno Deletion signature: mapped insert size larger than expected
37
Module 4 bioinformatics.ca SV Signatures: Insertion don ref Slides by M. Brudno Insertion signature: mapped insert size smaller than expected
38
Module 4 bioinformatics.ca SV Signatures: Tandem Duplication don ref Tandem duplication signature: wrong orientation
39
Module 4 bioinformatics.ca SV Signatures: Inversion don ref Inversion signature: wrong orientation of pairs
40
Module 4 bioinformatics.ca SV summary TypeMapped DistanceOrientation Insertiontoo smallcorrect Deletiontoo bigcorrect Inversion* Tandem duplication* Interchromosomaldifferent chromosomes N/A Slides by M. Brudno
41
Module 4 bioinformatics.ca Where can we go wrong: missed insertion don ref IS Insertions larger than insert size cannot be detected this way
42
Module 4 bioinformatics.ca Structural Variants and Split Reads Paired Short Reads Align Most of these pairs can be aligned to the reference genome For some paired-end reads one of the pair may not be mapped because it goes across the breakpoint of a structural variant. We call such reads split reads. Slides by M. Brudno
43
Module 4 bioinformatics.ca Deletion: split read signature don ref Signature: read aligns in two pieces, one on either side of the breakpoint
44
Module 4 bioinformatics.ca Somatic vs. Germline tumor vs. normal sequencing approach 1: – find SVs separately in two samples – filter out somatic SVs that overlap germline SVs approach 2 – find somatic SVs – for each somatic SV, find any type of evidence in germline – filter out anything with germline evidence Slides by M. Brudno
45
Module 4 bioinformatics.ca Gene fusions if a linking signature connects two genes, this might indicate a gene fusion ChrA ChrB Gene X Gene Y Gene XY Protein
46
Module 4 bioinformatics.ca SV Software and Exercise We will use HYDRA-SV in the tutorial – https://code.google.com/p/hydra-sv/ https://code.google.com/p/hydra-sv/ – Quinlan et al, Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Research Many others exist: – Breakdancer, GASV, Pindel – It is worth spending time learning multiple packages and their strengths and weaknesses – There is rarely one program that fits all needs!
47
Module 4 bioinformatics.ca We are now going to start an exercise in structural variant detection
48
Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session
49
Module 4 bioinformatics.ca Any questions? jared.simpson@oicr.on.ca
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.