Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 4 Mapping and Genome Rearrangement ATCAA CTAAG DNA fragment Paired-end Reads

4 Module 4 bioinformatics.ca Learning Objectives of Module Understand mapping sequence reads to a reference genome Understand file formats like FASTA, FASTQ and SAM/BAM Learn common terminology used to describe alignments Learn how paired-end reads can be used to find genome rearrangements Run a mapper and rearrangement caller

5 Module 4 bioinformatics.ca Sequencing platforms Increasing Run Time Increasing Data Per Run $ $ Cross-platform data integration needed. 700Mb/23h 150Mb/3h 100Mb/1h 2Gb/27h 100Gb/15d 90Gb/10d 600Gb/10d 14TB/run 120Gb/1d Proton? GridION?

6 Module 4 bioinformatics.ca Basecalling How do we translate the machine data to base calls? How do we estimate and represent sequencing errors?

7 Module 4 bioinformatics.ca Sources of error Illumina: Pre-phasing & Phasing

8 Module 4 bioinformatics.ca What is a base quality? Base QualityP error (obs. base) 350 % 532 % 1010 % 201 %1 % 300.1 % 400.01 % Phred quality scores: -Estimate of probability the base call is incorrect

9 Module 4 bioinformatics.ca Error Profiles Illumina – Low error rate (~0.5%), mainly substitutions 454/Ion Torrent – Mainly insertions/deletions in homopolymer runs Pacbio – Higher error rate, mixture of insertions, deletions, substitutions

10 Module 4 bioinformatics.ca Mismatch by cycle

11 Module 4 bioinformatics.ca Fasta files ASF-1.faASF-2.fa Reads are often stored in fasta files Separate file for forward and reverse pairs header line: identifier sequence lines: nucleotides

12 Module 4 bioinformatics.ca Fastq files ASF-1.fastqASF-2.fastq header line: @SEQUENCE_ID sequence line line beginning with + encoded quality value line Most reads are stored in fastq 4 lines per read

13 Module 4 bioinformatics.ca Reference-based Alignment Goal: – find position in reference genome from which read was sampled Issues : – the human genome is large and repetitive – NGS instruments produce huge amounts of data – the sequenced genome will differ from the reference due to SNPs, indels and structural variation

14 Module 4 bioinformatics.ca Choosing an Aligner High accuracy needed – Misaligned reads are a source of false positive variant calls High sensitivity needed – The aligner must allow for differences between the individual and reference to find the correct mapping position High speed needed – With large data the informatics cost is significant We will use the popular aligner bwa in the tutorial

15

16 Module 4 bioinformatics.ca Reference alignments ? Reference genome Sequence read

17 Module 4 bioinformatics.ca Reference alignments Reference genome Sequence read x x x

18 Module 4 bioinformatics.ca Alignment Quality Most aligners will estimate how reliable the alignment is with a Mapping Quality – Phred-scaled estimate of the probability that the chosen mapping is wrong – 1 in 1000 reads with “Q30” alignment will be placed incorrectly

19 Module 4 bioinformatics.ca What are Paired Reads? ATCAA CTAAG Insert size (IS) DNA fragment Paired-end Reads Slides by M. Brudno

20 Module 4 bioinformatics.ca Paired Reads Reference genome Sequence read pair ?

21 Module 4 bioinformatics.ca Read pair alignment Reference genome xx x Sequence read pair xx x xx

22 Module 4 bioinformatics.ca Working with alignments SAM/BAM is a standardized format for working with read alignments SAM is tab-delimited text representation BAM is a compressed binary representation SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77

23 Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Read name Flag ➞ Flag indicates the reference strand, pairing information

24 Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Chromosome Coordinate

25 Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Mapping Quality

26 Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 CIGAR REF ACGATACATAC REF GACA-AACC READ ACGA-ACATAC READ GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M

27 Module 4 bioinformatics.ca SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT GTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9> 8AB685C26091:77 Mate chromosome, position Insert size ATCAA CTAAG Insert size (IS)

28 Module 4 bioinformatics.ca Resources samtools: toolkit for working with SAM/BAM files – Convert between SAM/BAM – Sort alignments – Extract alignments for a given genomic location SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf http://samtools.sourceforge.net/SAM1.pdf Questions/Help – https://lists.sourceforge.net/lists/listinfo/samtools-help https://lists.sourceforge.net/lists/listinfo/samtools-help – http://www.biostars.org/ http://www.biostars.org/ – http://seqanswers.com/

29 Module 4 bioinformatics.ca We are now going to start an exercise in read mapping

30 Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session

31 Module 4 bioinformatics.ca What kinds of variation is there? Single Nucleotide Polymorphisms (SNPs) Short indels (< read length) Structural variations – Large insertions and deletions – Inversions – Translocations – Copy number variation

32 Module 4 bioinformatics.ca Structural variants Mate-pair and paired-end reads can be used to detect structural variants Fragmentation & circularization to an internal adaptor Shear Isolate internal adaptors and fragment ends Mate-Pairs Paired-Ends Fragmentation Add amplification and sequencing adaptors Sequence Add amplification and sequencing adaptors Genomic DNA 1 - 20kb 200 – 500bp

33 Module 4 bioinformatics.ca Read pair orientation Reference genome Sequence read pair The expected orientation is one read on the forward strand and one read on the reverse strand for paired-end reads

34 Module 4 bioinformatics.ca Read pair alignment Fragment/insert size is determined by library preparation Pairs that match the expected orientation and distance are called concordant Discordant read pairs give evidence of structural variation Fragment size Fragment number

35 Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno

36 Module 4 bioinformatics.ca SV Signatures: Deletion don ref Slides by M. Brudno Deletion signature: mapped insert size larger than expected

37 Module 4 bioinformatics.ca SV Signatures: Insertion don ref Slides by M. Brudno Insertion signature: mapped insert size smaller than expected

38 Module 4 bioinformatics.ca SV Signatures: Tandem Duplication don ref Tandem duplication signature: wrong orientation

39 Module 4 bioinformatics.ca SV Signatures: Inversion don ref Inversion signature: wrong orientation of pairs

40 Module 4 bioinformatics.ca SV summary TypeMapped DistanceOrientation Insertiontoo smallcorrect Deletiontoo bigcorrect Inversion* Tandem duplication* Interchromosomaldifferent chromosomes N/A Slides by M. Brudno

41 Module 4 bioinformatics.ca Where can we go wrong: missed insertion don ref IS Insertions larger than insert size cannot be detected this way

42 Module 4 bioinformatics.ca Structural Variants and Split Reads Paired Short Reads Align Most of these pairs can be aligned to the reference genome For some paired-end reads one of the pair may not be mapped because it goes across the breakpoint of a structural variant. We call such reads split reads. Slides by M. Brudno

43 Module 4 bioinformatics.ca Deletion: split read signature don ref Signature: read aligns in two pieces, one on either side of the breakpoint

44 Module 4 bioinformatics.ca Somatic vs. Germline tumor vs. normal sequencing approach 1: – find SVs separately in two samples – filter out somatic SVs that overlap germline SVs approach 2 – find somatic SVs – for each somatic SV, find any type of evidence in germline – filter out anything with germline evidence Slides by M. Brudno

45 Module 4 bioinformatics.ca Gene fusions if a linking signature connects two genes, this might indicate a gene fusion ChrA ChrB Gene X Gene Y Gene XY Protein

46 Module 4 bioinformatics.ca SV Software and Exercise We will use HYDRA-SV in the tutorial – https://code.google.com/p/hydra-sv/ https://code.google.com/p/hydra-sv/ – Quinlan et al, Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Research Many others exist: – Breakdancer, GASV, Pindel – It is worth spending time learning multiple packages and their strengths and weaknesses – There is rarely one program that fits all needs!

47 Module 4 bioinformatics.ca We are now going to start an exercise in structural variant detection

48 Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session

49 Module 4 bioinformatics.ca Any questions? jared.simpson@oicr.on.ca


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google