Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760.

Similar presentations


Presentation on theme: "Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760."— Presentation transcript:

1 Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760

2 Rule to remember when using bioinformatics tools Computer scientists and mathematicians/statisticians make simplifications to speed up computations or to create a robust statistical model – So, every tool solves an approximation of the problem – The answers you get will largely be correct, but must be taken with a grain of salt What might the tools be missing? What quality/confidence score values can be believed? Do the downstream steps of the pipeline adjust for the simplifications?

3 Sequence read lengths remain limiting For most applications reads are aligned to a reference genome Short reads contain inherently limited information De novo assembly of short reads is difficult Chr1: 249 Mb 249 Mb sequencing read Current platforms: Illumina: A very large number (2 billion) of short reads (75-125 bp) PacBio: A moderate number (~500,000) of long reads (~10 kb)

4 Determining the identity and location of short sequence reads in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGG ACGTG Need a computationally efficient method to perform accurate alignments of millions of reads Aligning short reads to much larger reference

5 Determining the identity and location of short sequence reads in the genome/exome/transcriptome Fundamental simplification – Represent DNA molecules as strings (or sequences) – 4 character alphabet, plus N Algorithmic simplification – Finding exact matches is quickest – Comparing strings is slower – “Gapped” alignment is very slow TAGATTAC |||||||| TAGATTAC TAGATTACTCAGA |||||||| |||| TAGATTACACAGA TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC

6 First attempt: A “telephone directory” Suppose input is 100 bp reads Create “telephone directory” of 100 bp sections of genome – Sorted list of sequences, with locations – For each read, lookup in directory to find genome location(s) for the read This solution does not work – Lookup might be slow for hundreds of millions of reads – Sequencing errors and variation prevent correct lookup

7 Older (and Newer) Algorithms: Seed and Extend Find shorter exact matches, called seeds – Long enough to be mostly unique (20-25 bp for human) – Short enough to find exact matches for “all” reads Encode seeds as integer values – 2-bit encoding, A=00 C=01 G=10 T=11 – 32 nt seed fits within 64 bit integer – Exact seed match when integer values the same Create “telephone directory” for reference, called an index – Choose forward strand only or forward and reverse strands? – Store every seed, or spaced seeds (every X nts)? – Newer algorithms create Burrows-Wheeler Transform index

8 Reference genome indexing using Burrows-Wheeler transform alignment Trapnell and Salzberg, Nat Biotechnology 27:455 (2009) Reversible encoding scheme All locations of every suffix stored in a contiguous region of the index For any suffix, transform table can find region when adding 1 nt prefix Lookups become very fast Independent of reference size

9 Older (and Newer) Algorithms: Seed and Extend Lookup the seeds occurring in each read – Forward strand only or forward and reverse? – Every seed or spaced seeds? Extend seeds to calculate alignment – Ungapped alignment just comparing strings – Gapped alignment using dynamic programming TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGG ACGTG TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA-- GAGAAATACACACTTTAGGACGTG Ref Read

10 Alignments in Bowtie 2 @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGG ACGTG Multiseed alignment (ungapped) Ref index: BWT, every nt, fwd only Read seeds: 16 nt, every 10 nt, fwd+rev Mismatch = -6 TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGG ACGTG TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA-- GAGAAATACACACTTTAGGACGTG Ref Read Gap = -11 -5 to open -3 to extend by 1 bp Seeds are extended (gaps allowed) to generate alignment Match = 2

11 Scoring alignments TAGATTACTCAGATTAC |||||||| TAGATTACACAGATTAC TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC Adapted from Mark Gerstein Ungapped: Gapped: C|CC|C Match (+1) Mismatch (-1, -2, etc.) CTCT Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap A-TAC ||||| ATTAC A--AC ||||| ATTAC Simpler algorithms allow a fixed number of mismatches Complex algorithms pick best scoring alignment, stopping extension if score falls below threshold from best score so far

12 ProgramWebsite ELAND (v2)N/A – integrated into Illumina pipeline Bowtie/Bowtie2http://bowtie-bio.sourceforge.net/ BWAhttp://bio-bwa.sourceforge.net/ Novoalignhttp://www.novocraft.com/products/novoalign/ Common algorithms for mapping short reads to a reference genome Considerations Alignment scoring method Speed Quality aware Seeding Gapped alignment Split read alignment

13 Split read alignments for transcriptomes and structural variations Aligning to transcripts: Splice junctions (contiguous) Splice junctions (split) Aligning to genome:

14 Simplifications Aligners Use Assume the reference is complete The best scoring match is considered location (or locations) of the read in the genome Speed is the priority Difficult to align ends are “soft clipped” Reads are aligned individually Indels may not be consistent across reads

15 QUALITY SCORES

16 Scoring using Likelihood Measures Likelihood-based scoring methods more robust than empirically derived scoring methods Construct a model of the error modes – Usually, a training-based Bayesian or HMM model Common NGS quality scores – Basecall quality scores – Mapping quality scores – Variant/Genotype quality scores

17 Basecall quality scores (“Phred” scores) A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A: The estimated probability that A is not correct is P(~A); The quality score for A is Q (A) = -10 log 10 (P(~A)) A quality score of 10 means a probability of 0.1 that A is the wrong basecall. Quality scores are logarithmic: P(~A) is platform-specific; Q-scores can be compared across platforms. Q-scoreError probability 100.1 200.01 400.0001

18 Sequencing by synthesis with reversible dye terminators 1 cycle Scan flow cell Add base Reverse termination Add next base, etc. Errors in lllumina sequencing reads Wrong “color” called when scanning image Wrong nucleotide is added Terminator not removed for a cycle Multiple bases added (terminator doesn’t work for a base)

19 Errors in lllumina sequencing reads Error are mainly mismatches (substitutions) Error rates increase with increasing cycle number

20 Errors in single-molecule sequencing PacBio: TAGATTA-ACAG-TT-C ||||||| |||| || | TAGATTACACAGATTAC Incorporation happens to quickly to be observed Nucleotide sits in polymerase “pocket” (and is detected), but is not incorporated Wrong nucleotide incorporated Errors are mainly insertions and deletions Florescent molecules, attached to each nucleotide, captured by direct observation of “zero-mode waveguide” during incorporation by polymerase

21 Quality Score Metholodogy Training-based calibration of scores – Identify key “features” of a platform’s sequencing signals Signal Intensity/frequency/duration, base position,... Phred method: must be continuous value range, “monotonic” to error rate – Train using set of runs from known samples and genomes – Correlate errors to feature value ranges Phred method: bin combinations of value ranges, then use error counts to set score – Must recalibrate with changes in instrument or reagent kits Simplifications – The training set characterizes the range of genomes/transcriptomes/... – Library/Sequencing prep PCR errors not captured by sequencing measurements

22 GATK Recalibration https://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf

23 Quality score encoding in FASTQ format

24 MAPPING QUALITY

25 Mappability The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map Chr3Chr7repeat Longer reads: Paired reads:

26 Mapability scores at UCSC The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

27 Poorly mappable regions of the genome 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

28 Mapping quality score Base quality values and mismatch positions in a candidate alignment are used to assign a p value P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error rates corresponding to the read’s quality values Mapping quality score for a read is computed from p values of all candidate alignments If there are two candidates for a read with p values 0.9 and 0.3: 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct 1- 0.75, chance highest scoring alignment is wrong Mapping quality score = -10 log(0.25) = 6.

29 COUNTING AND FREQUENCIES

30 Counting Reads, Converting to Frequencies NGS reads are all “clonally amplified” – Start with a single molecule – Either sequence it directly, or amplify it into a cluster/spot/well Post-alignment analysis begins with read counting – Variant vs Non-variant – Transcript – ChIP-seq peaks Read counts (depth) converted into frequencies – Normalizes across datapoints to permit comparisons – “Alt frequency”, FPKM/RPKM, fold change Filtering applied to produce call set – Distinguish significant events from random chance or sequencing error

31 Counting Reads, Converting to Frequencies Confounders – Sampling variation (across genome/transcriptome) – PCR duplicates – Non-random sequencing error – GC-bias – Strand bias – Amplification bias – Alignment bias – Sample purity – Transcript length

32 Variant Calling Quality Scores At each location where reads differ from the reference, compute likelihood values – Difference from the reference – Each genotype (0/0, 0/1, 1/1) for diploid genomes – Models either pre-trained or trained on the reads using a “truth set” of known variants Variant quality score – Convert likelihood into phred-scaled score Genotype quality score – Most likely genotype minus second most likely genotype – Phred-scaled score

33 Conclusion First alignment, then counting, and then comes the analysis Alignment and counting are computation-focused – Pipelines tradeoff speed vs. accuracy Quality/confidence measures calculated and used in the analysis – Simple likelihoods or trained models? Your choices depend on your objectives/budgets


Download ppt "Introduction to Short Read Sequencing Analysis James Knight (with many slides adapted from Jim Noonan) GENE 760."

Similar presentations


Ads by Google