Presentation is loading. Please wait.

Presentation is loading. Please wait.

(with many slides adapted from Jim Noonan)

Similar presentations


Presentation on theme: "(with many slides adapted from Jim Noonan)"— Presentation transcript:

1 (with many slides adapted from Jim Noonan)
Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760

2 Rule to remember when using bioinformatics tools
Computer scientists and mathematicians/statisticians make simplifications to speed up computations or to create a robust statistical model So, every tool solves an approximation of the problem The answers you get will largely be correct, but must be taken with a grain of salt What might the tools be missing? What quality/confidence score values can be believed? Do the downstream steps of the pipeline adjust for the simplifications?

3 Sequence read lengths remain limiting
Chr1: 249 Mb 249 Mb sequencing read Current platforms: Illumina: A very large number (2 billion) of short reads ( bp) PacBio: A moderate number (~500,000) of long reads (~10 kb) For most applications reads are aligned to a reference genome Short reads contain inherently limited information De novo assembly of short reads is difficult

4 Determining the identity and location of short sequence reads
in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589: :N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Aligning short reads to much larger reference Need a computationally efficient method to perform accurate alignments of millions of reads

5 Fundamental simplification
Determining the identity and location of short sequence reads in the genome/exome/transcriptome Fundamental simplification Represent DNA molecules as strings (or sequences) 4 character alphabet, plus N Algorithmic simplification Finding exact matches is quickest Comparing strings is slower “Gapped” alignment is very slow TAGATTAC |||||||| TAGATTACTCAGA |||||||| |||| TAGATTACACAGA TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC

6 First attempt: A “telephone directory”
Suppose input is 100 bp reads Create “telephone directory” of 100 bp sections of genome Sorted list of sequences, with locations <Sequence, chromosome, start position> For each read, lookup in directory to find genome location(s) for the read This solution does not work Lookup might be slow for hundreds of millions of reads Sequencing errors and variation prevent correct lookup

7 Older (and Newer) Algorithms: Seed and Extend
Find shorter exact matches, called seeds Long enough to be mostly unique (20-25 bp for human) Short enough to find exact matches for “all” reads Encode seeds as integer values 2-bit encoding, A=00 C=01 G=10 T=11 32 nt seed fits within 64 bit integer Exact seed match when integer values the same Create “telephone directory” for reference, called an index Choose forward strand only or forward and reverse strands? Store every seed, or spaced seeds (every X nts)? Newer algorithms create Burrows-Wheeler Transform index

8 Reference genome indexing using Burrows-Wheeler transform
alignment Reference genome indexing using Burrows-Wheeler transform Reversible encoding scheme All locations of every suffix stored in a contiguous region of the index For any suffix, transform table can find region when adding 1 nt prefix Lookups become very fast Independent of reference size Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

9 Older (and Newer) Algorithms: Seed and Extend
Lookup the seeds occurring in each read Forward strand only or forward and reverse? Every seed or spaced seeds? Extend seeds to calculate alignment Ungapped alignment just comparing strings Gapped alignment using dynamic programming TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG Ref Read

10 Alignments in Bowtie 2 Multiseed alignment (ungapped)
@HWI-ST974:58:C059FACXX:2:1201:10589: :N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Multiseed alignment (ungapped) Ref index: BWT, every nt, fwd only Read seeds: 16 nt, every 10 nt, fwd+rev Mismatch = -6 TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG Ref Read Gap = -11 -5 to open -3 to extend by 1 bp Seeds are extended (gaps allowed) to generate alignment Match = 2

11 Scoring alignments Ungapped: Match (+1) Mismatch (-1, -2, etc.)
| Match (+1) Mismatch (-1, -2, etc.) T TAGATTACTCAGATTAC |||||||| |||||||| TAGATTACACAGATTAC Gapped: Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap A-TAC ||||| ATTAC A--AC TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC Simpler algorithms allow a fixed number of mismatches Complex algorithms pick best scoring alignment, stopping extension if score falls below threshold from best score so far Adapted from Mark Gerstein

12 Common algorithms for mapping short reads to a
reference genome Program Website ELAND (v2) N/A – integrated into Illumina pipeline Bowtie/Bowtie2 BWA Novoalign Considerations Alignment scoring method Speed Quality aware Seeding Gapped alignment Split read alignment

13 Split read alignments for transcriptomes and structural variations
Splice junctions (contiguous) Aligning to transcripts: Splice junctions (split) Aligning to genome:

14 Simplifications Aligners Use
Assume the reference is complete The best scoring match is considered location (or locations) of the read in the genome Speed is the priority Difficult to align ends are “soft clipped” Reads are aligned individually Indels may not be consistent across reads

15 Quality Scores

16 Scoring using Likelihood Measures
Likelihood-based scoring methods more robust than empirically derived scoring methods Construct a model of the error modes Usually, a training-based Bayesian or HMM model Common NGS quality scores Basecall quality scores Mapping quality scores Variant/Genotype quality scores

17 Basecall quality scores (“Phred” scores)
A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A: The estimated probability that A is not correct is P(~A); The quality score for A is Q (A) = -10 log10 (P(~A)) A quality score of 10 means a probability of 0.1 that A is the wrong basecall. P(~A) is platform-specific; Q-scores can be compared across platforms. Quality scores are logarithmic: Q-score Error probability 10 0.1 20 0.01 40 0.0001

18 Errors in lllumina sequencing reads
Reverse termination Add next base, etc. 1 cycle Scan flow cell Add base Sequencing by synthesis with reversible dye terminators Wrong “color” called when scanning image Wrong nucleotide is added Terminator not removed for a cycle Multiple bases added (terminator doesn’t work for a base)

19 Errors in lllumina sequencing reads
Error are mainly mismatches (substitutions) Error rates increase with increasing cycle number

20 Errors in single-molecule sequencing
PacBio: TAGATTA-ACAG-TT-C ||||||| |||| || | TAGATTACACAGATTAC Florescent molecules, attached to each nucleotide, captured by direct observation of “zero-mode waveguide” during incorporation by polymerase Single molecule screening - gaps Incorporation happens to quickly to be observed Nucleotide sits in polymerase “pocket” (and is detected), but is not incorporated Wrong nucleotide incorporated Errors are mainly insertions and deletions

21 Quality Score Metholodogy
Training-based calibration of scores Identify key “features” of a platform’s sequencing signals Signal Intensity/frequency/duration, base position, ... Phred method: must be continuous value range, “monotonic” to error rate Train using set of runs from known samples and genomes Correlate errors to feature value ranges Phred method: bin combinations of value ranges, then use error counts to set score Must recalibrate with changes in instrument or reagent kits Simplifications The training set characterizes the range of genomes/transcriptomes/... Library/Sequencing prep PCR errors not captured by sequencing measurements

22 GATK Recalibration

23 Quality score encoding in FASTQ format

24 Mapping Quality

25 Mappability The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map Chr3 Chr7 repeat Longer reads: Paired reads:

26 Mapability scores at UCSC
The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

27 Poorly mappable regions of the genome
36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

28 Mapping quality score Base quality values and mismatch positions in a candidate alignment are used to assign a p value P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error rates corresponding to the read’s quality values Mapping quality score for a read is computed from p values of all candidate alignments If there are two candidates for a read with p values 0.9 and 0.3: 0.9/( ) = 0.75, chance highest scoring alignment is correct , chance highest scoring alignment is wrong Mapping quality score = -10 log(0.25) = 6.

29 Counting and Frequencies

30 Counting Reads, Converting to Frequencies
NGS reads are all “clonally amplified” Start with a single molecule Either sequence it directly, or amplify it into a cluster/spot/well Post-alignment analysis begins with read counting Variant vs Non-variant Transcript ChIP-seq peaks Read counts (depth) converted into frequencies Normalizes across datapoints to permit comparisons “Alt frequency”, FPKM/RPKM, fold change Filtering applied to produce call set Distinguish significant events from random chance or sequencing error

31 Counting Reads, Converting to Frequencies
Confounders Sampling variation (across genome/transcriptome) PCR duplicates Non-random sequencing error GC-bias Strand bias Amplification bias Alignment bias Sample purity Transcript length

32 Variant Calling Quality Scores
At each location where reads differ from the reference, compute likelihood values Difference from the reference Each genotype (0/0, 0/1, 1/1) for diploid genomes Models either pre-trained or trained on the reads using a “truth set” of known variants Variant quality score Convert likelihood into phred-scaled score Genotype quality score Most likely genotype minus second most likely genotype Phred-scaled score

33 Conclusion First alignment, then counting, and then comes the analysis
Alignment and counting are computation-focused Pipelines tradeoff speed vs. accuracy Quality/confidence measures calculated and used in the analysis Simple likelihoods or trained models? Your choices depend on your objectives/budgets


Download ppt "(with many slides adapted from Jim Noonan)"

Similar presentations


Ads by Google