Canadian Bioinformatics Workshops www.bioinformatics.ca.

Canadian Bioinformatics Workshops www.bioinformatics.ca

2Module #: Title of Module

Module 4 Mapping and Genome Rearrangement ATCAA CTAAG DNA fragment Paired-end Reads

Module 4 bioinformatics.ca Learning Objectives of Module Understand mapping sequence reads to a reference genome Understand file formats like FASTA, FASTQ and SAM/BAM Learn common terminology used to describe alignments Learn how paired-end reads can be used to find genome rearrangements Run a mapper and rearrangement caller

Module 4 bioinformatics.ca Sequencing platforms Increasing Run Time Increasing Data Per Run $ $ Cross-platform data integration needed. 700Mb/23h 150Mb/3h 100Mb/1h 2Gb/27h 100Gb/15d 90Gb/10d 600Gb/10d 14TB/run 120Gb/1d Proton? GridION?

Module 4 bioinformatics.ca Basecalling How do we translate the machine data to base calls? How do we estimate and represent sequencing errors?

Module 4 bioinformatics.ca Sources of error Illumina: Pre-phasing & Phasing

Module 4 bioinformatics.ca What is a base quality? Base QualityP error (obs. base) 350 % 532 % 1010 % 201 %1 % 300.1 % 400.01 % Phred quality scores: -Estimate of probability the base call is incorrect

Module 4 bioinformatics.ca Error Profiles Illumina – Low error rate (~0.5%), mainly substitutions 454/Ion Torrent – Mainly insertions/deletions in homopolymer runs Pacbio – Higher error rate, mixture of insertions, deletions, substitutions

Module 4 bioinformatics.ca Mismatch by cycle

Module 4 bioinformatics.ca Fasta files ASF-1.faASF-2.fa Reads are often stored in fasta files Separate file for forward and reverse pairs header line: identifier sequence lines: nucleotides

Module 4 bioinformatics.ca Fastq files ASF-1.fastqASF-2.fastq header line: @SEQUENCE_ID sequence line line beginning with + encoded quality value line Most reads are stored in fastq 4 lines per read

Module 4 bioinformatics.ca Reference-based Alignment Goal: – find position in reference genome from which read was sampled Issues : – the human genome is large and repetitive – NGS instruments produce huge amounts of data – the sequenced genome will differ from the reference due to SNPs, indels and structural variation

Module 4 bioinformatics.ca Choosing an Aligner High accuracy needed – Misaligned reads are a source of false positive variant calls High sensitivity needed – The aligner must allow for differences between the individual and reference to find the correct mapping position High speed needed – With large data the informatics cost is significant We will use the popular aligner bwa in the tutorial

Module 4 bioinformatics.ca Reference alignments ? Reference genome Sequence read

Module 4 bioinformatics.ca Reference alignments Reference genome Sequence read x x x

Module 4 bioinformatics.ca Alignment Quality Most aligners will estimate how reliable the alignment is with a Mapping Quality – Phred-scaled estimate of the probability that the chosen mapping is wrong – 1 in 1000 reads with “Q30” alignment will be placed incorrectly

Module 4 bioinformatics.ca What are Paired Reads? ATCAA CTAAG Insert size (IS) DNA fragment Paired-end Reads Slides by M. Brudno

Module 4 bioinformatics.ca Paired Reads Reference genome Sequence read pair ?

Module 4 bioinformatics.ca Read pair alignment Reference genome xx x Sequence read pair xx x xx

Module 4 bioinformatics.ca Working with alignments SAM/BAM is a standardized format for working with read alignments SAM is tab-delimited text representation BAM is a compressed binary representation SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77

Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Read name Flag ➞ Flag indicates the reference strand, pairing information

Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Chromosome Coordinate

Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Mapping Quality

Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 CIGAR REF ACGATACATAC REF GACA-AACC READ ACGA-ACATAC READ GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M

Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Mate chromosome, position Insert size ATCAA CTAAG Insert size (IS)

Module 4 bioinformatics.ca Resources samtools: toolkit for working with SAM/BAM files – Convert between SAM/BAM – Sort alignments – Extract alignments for a given genomic location SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf http://samtools.sourceforge.net/SAM1.pdf Questions/Help – https://lists.sourceforge.net/lists/listinfo/samtools-help https://lists.sourceforge.net/lists/listinfo/samtools-help – http://www.biostars.org/ http://www.biostars.org/ – http://seqanswers.com/

Module 4 bioinformatics.ca We are now going to start an exercise in read mapping

Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session

Module 4 bioinformatics.ca What kinds of variation is there? Single Nucleotide Polymorphisms (SNPs) Short indels (< read length) Structural variations – Large insertions and deletions – Inversions – Translocations – Copy number variation

Module 4 bioinformatics.ca Structural variants Mate-pair and paired-end reads can be used to detect structural variants Fragmentation & circularization to an internal adaptor Shear Isolate internal adaptors and fragment ends Mate-Pairs Paired-Ends Fragmentation Add amplification and sequencing adaptors Sequence Add amplification and sequencing adaptors Genomic DNA 1 - 20kb 200 – 500bp

Module 4 bioinformatics.ca Read pair orientation Reference genome Sequence read pair The expected orientation is one read on the forward strand and one read on the reverse strand for paired-end reads

Module 4 bioinformatics.ca Read pair alignment Fragment/insert size is determined by library preparation Pairs that match the expected orientation and distance are called concordant Discordant read pairs give evidence of structural variation Fragment size Fragment number

Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno

Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno Deletion signature: mapped insert size larger than expected

Module 4 bioinformatics.ca SV Signatures: Insertion don ref Slides by M. Brudno Insertion signature: mapped insert size smaller than expected

Module 4 bioinformatics.ca SV Signatures: Tandem Duplication don ref Tandem duplication signature: wrong orientation

Module 4 bioinformatics.ca SV Signatures: Inversion don ref Inversion signature: wrong orientation of pairs

Module 4 bioinformatics.ca SV summary TypeMapped DistanceOrientation Insertiontoo smallcorrect Deletiontoo bigcorrect Inversion* Tandem duplication* Interchromosomaldifferent chromosomes N/A Slides by M. Brudno

Module 4 bioinformatics.ca Where can we go wrong: missed insertion don ref IS Insertions larger than insert size cannot be detected this way

Module 4 bioinformatics.ca Structural Variants and Split Reads Paired Short Reads Align Most of these pairs can be aligned to the reference genome For some paired-end reads one of the pair may not be mapped because it goes across the breakpoint of a structural variant. We call such reads split reads. Slides by M. Brudno

Module 4 bioinformatics.ca Deletion: split read signature don ref Signature: read aligns in two pieces, one on either side of the breakpoint

Module 4 bioinformatics.ca Somatic vs. Germline tumor vs. normal sequencing approach 1: – find SVs separately in two samples – filter out somatic SVs that overlap germline SVs approach 2 – find somatic SVs – for each somatic SV, find any type of evidence in germline – filter out anything with germline evidence Slides by M. Brudno

Module 4 bioinformatics.ca Gene fusions if a linking signature connects two genes, this might indicate a gene fusion ChrA ChrB Gene X Gene Y Gene XY Protein

Module 4 bioinformatics.ca SV Software and Exercise We will use HYDRA-SV in the tutorial – https://code.google.com/p/hydra-sv/ https://code.google.com/p/hydra-sv/ – Quinlan et al, Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Research Many others exist: – Breakdancer, GASV, Pindel – It is worth spending time learning multiple packages and their strengths and weaknesses – There is rarely one program that fits all needs!

Module 4 bioinformatics.ca We are now going to start an exercise in structural variant detection

Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session

Module 4 bioinformatics.ca Any questions? jared.simpson@oicr.on.ca

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback