Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

1 Canadian Bioinformatics Workshops

2 Module #: Title of Module
2

3 Mapping and Genome Rearrangement
Module 3 Mapping and Genome Rearrangement Jared Simpson, Ph.D. Bioinformatics for Cancer Genomics May 30 – June 3, 2016 from: doi: /nmeth.2258

4 Learning Objectives of Module
Understand mapping reads to a reference genome Understand the FASTQ and SAM/BAM file formats Learn common terminology used to describe alignments Learn how to find genome rearrangements using read pairs Run a mapper and rearrangement caller

5 Sequencing platforms $ Increasing Data Per Run $ Increasing Run Time
14TB/run $ 600Gb/10d 100Gb/15d 120Gb/1d 90Gb/10d Increasing Data Per Run 150Mb/3h 2Gb/27h 700Mb/23h 100Mb/1h $ Increasing Run Time

6 Illumina Sequencing

7 Basecalling Prediction of the DNA sequence from the images

8 Sources of error Illumina: Pre-phasing & Phasing

9 Error Profiles Illumina 454/Ion Torrent Pacbio and Oxford Nanopore
Low error rate (~0.5%), mainly substitutions 454/Ion Torrent Mainly insertions/deletions in homopolymer runs Pacbio and Oxford Nanopore Single molecule sequencers Higher error rate, mixture of insertions, deletions, substitutions

10 Illumina Error Profile

11 What is a FASTQ file?

12 What is a FASTQ file? Read name

13 What is a FASTQ file? Basecalled sequence

14 What is a FASTQ file? Quality separator

15 What is a FASTQ file? Base quality scores

16 What is a base quality score?
Phred quality scores: Estimate of probability the base call is incorrect Base Quality Perror(obs. base) 3 50 % 5 32 % 10 10 % 20 1 % 30 0.1 % 40 0.01 %

17 Reference Mapping

18 Reference Mapping Why do we map reads to the reference?
By comparing the reads from a sequenced individual to a reference genome we can identify variants like SNPs, and rearrangements To do this we need to identify where in the reference genome that a read might have come from

19 Reference Mapping Issues
The genome is very large and repetitive The mapping program must be efficient and tolerant of repetitive sequences Mappers like BWA using an index of the reference genome to rapidly identify possible mapping locations

20 Reference Mapping Issues
The reads contain sequencing errors The mapping program must tolerate differences between the reads and the reference Typically the mapper will find exact-match seeds then refine the seed alignments using dynamic programming Mapping reads with many errors or insertions/deletions is much harder

21 Reference Mapping Issues
Short read sequences produce huge amounts of data The mapping algorithm must be extremely efficient while accounting for the issues discussed above

22 Choosing a Mapper Needs to be accurate Needs to be sensitive
Misaligned reads are a source of false positive variant calls Needs to be sensitive Must allow for differences between the individual and reference Needs to be fast

23

24 Reference Mapping Reference genome Sequence read ?

25 Reference Mapping Reference genome x x x Sequence read

26 Mapping Quality Phred-scaled estimate of the probability that the chosen mapping is wrong 1 in 1000 reads with “Q30” alignment will be placed incorrectly What causes mapping errors? High error rate Repetitive sequence Differences between the reference and sequenced sample

27 What are Paired Reads? DNA fragment ATCAAGA CTACATG Insert size (IS)
Slides by M. Brudno 27

28 Paired Reads Reference genome ? Sequence read pair 28

29 Paired Mapping Reference genome x x Sequence read pair

30 Paired Mapping Reference genome x x x x x x x x Sequence read pair

31 Sequence Alignment/Map Format
SAM/BAM is a format for working with mapped reads SAM is tab-delimited text representation BAM is a compressed binary representation SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT

32 SAM Format Flag indicates the reference strand, pairing information
SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT Read ID Flag Flag indicates the reference strand, pairing information

33 SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT Chromosome Position

34 SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT Mapping Quality

35 SAM Description Ref ACGATACATAC Ref GACA-AACC
SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT CIGAR Ref ACGATACATAC Ref GACA-AACC Read ACGA-ACATAC Read GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M

36 Mate chromosome, position
SAM Description SRR M = NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT Mate chromosome, position Insert size ATCAA CTAAG Insert size (IS)

37 Resources samtools: toolkit for working with SAM/BAM files
Convert between SAM/BAM Sort alignments Extract alignments for a given genomic location SAM/BAM specification: Questions/Help

38 Viewing Alignments - IGV

39 Alignment Problems

40 Alignment Problems

41 We are now going to start a read mapping exercise

42 We are on a Coffee Break & Networking Session

43 Types of variation Single Nucleotide Variants (SNVs)
Insertions/deletions (INDELs) Structural variations Large insertions and deletions Inversions Translocations Copy number variation

44 Structural variants using paired-end reads
Genomic DNA Fragmentation and size selection ( bp) Add sequencing adaptors Sequence both ends

45 Read pair orientation Reference read pair Expected orientation:
one read on the forward strand, one read on the reverse strand

46 Fragment size distribution
from: doi: /ng.3121 Fragment/insert size is determined by library preparation Pairs that match the expected orientation and distance are called concordant Discordant read pairs give evidence of structural variation

47 SV Signatures: Deletion
sample reference Slides by M. Brudno

48 SV Signatures: Deletion
sample reference Signature: mapped insert size larger than expected Slides by M. Brudno

49 SV Signatures: Insertion
sample reference Signature: mapped insert size smaller than expected Slides by M. Brudno

50 SV Signatures: Tandem Duplication
sample reference Signature: wrong orientation

51 SV Signatures: Inversion
sample reference Signature: wrong orientation

52 SV summary Type Mapped Distance Orientation Insertion too small
correct Deletion too big Inversion * Tandem duplication Interchromosomal different chromosomes N/A Slides by M. Brudno

53 Problems: missed large insertion
sample reference Insertions larger than insert size cannot be detected this way

54 Deletion: split read signature
don ref Signature: read aligns in two pieces, one on either side of the breakpoint

55 Gene fusions if a linking signature connects two genes, this might indicate a gene fusion Gene X ChrA Gene Y ChrB Gene XY Protein

56 Somatic vs. Germline When sequencing cancers we want to know about the somatic changes – the mutations that are only in the tumour We do this by looking for evidence of structural variation that is only in the tumour sample but not in the normal sample

57 We are now going to start an exercise in structural variant detection

58 We are on a Coffee Break & Networking Session

59 Any questions? jared.simpson@oicr.on.ca


Download ppt "Canadian Bioinformatics Workshops"

Similar presentations


Ads by Google