Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Short Read Sequencing Analysis

Similar presentations


Presentation on theme: "Introduction to Short Read Sequencing Analysis"— Presentation transcript:

1 Introduction to Short Read Sequencing Analysis
Jim Noonan GENE 760

2 Sequence read lengths remain limiting
Chr1: 249 Mb 249 Mb sequencing read Current platforms: A moderate number (~500,000) of long reads (~10 kb) A very large number (>200 M) of short reads (100 bp) For most applications reads are aligned to a reference genome Short reads contain inherently limited information De novo assembly of short reads is difficult

3 Determining the identity and location of short sequence reads
in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589: :N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Aligning short reads to much larger reference Need a computationally efficient method to perform accurate alignments of millions of reads

4 Read length requirements vary depending on the
feature being studied Exome: bp Splice junctions (connectivity) Transcriptome: 10,000 bp

5 Determining the identity and location of short sequence reads
in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589: :N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG @HWI-ST974:58:C059FACXX:2:1201:10589: :N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Aligning short reads to much larger reference Exome or Genome Considerations Alignment scoring Source of the reads Sequencing format (PE or SE) Read length Error rates Transcriptome

6 Topics Scoring alignments
Error rates and quality scores for short read sequencing Mapability Common algorithms for short read sequence alignment Scoring short read sequence aligments Uniform data output formats

7 Scoring alignments Correct: Match (+1) Mismatch (-1, -2, etc.) Wrong:
| Match (+1) Mismatch (-1, -2, etc.) T TAGATTACACAGATTAC ||||||||||||||||| Wrong: TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap A-TAC ||||| ATTAC A--AC Many short read alignment algorithms allow a fixed number of mismatches Adapted from Mark Gerstein

8 Scoring alignments Correct (polymorphism): Match (+1)
| Match (+1) Mismatch (-1, -2, etc.) T TAGATTACTCAGATTAC |||||||| |||||||| TAGATTACACAGATTAC Wrong: TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap A-TAC ||||| ATTAC A--AC Many short read alignment algorithms allow a fixed number of mismatches Adapted from Mark Gerstein

9 Quality scores A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A: The estimated probability that A is not correct is P(~A); The quality score for A is Q (A) = -10 log10 (P(~A)) A quality score of 10 means a probability of 0.1 that A is the wrong basecall. P(~A) is platform-specific; Q-scores can be compared across platforms. Quality scores are logarithmic: Q-score Error probability 10 0.1 20 0.01 40 0.0001

10 Error rates in lllumina sequencing reads
Reverse termination Add next base, etc. 1 cycle Scan flow cell Add base Sequencing by synthesis with reversible dye terminators Individual synthesis reactions go out of phase

11 Error rates in lllumina sequencing reads
Error rates are mismatch rates relative to reference genome Error rates increase with increasing cycle number Contingent on reference genome quality Reads may be trimmed to improve alignment quality

12 Illumina quality score encoding in FASTQ format
(CASAVA v1.8)

13 Illumina basecalling and data analysis pipeline
summary report

14 Sources of error in single-molecule sequencing
Illumina: TAGATTACACAGATTAC ||||||||||||||||| Consensus signal PacBio: Single molecule screening - gaps TAGATTA-ACAG-TT-C ||||||| |||| || | TAGATTACACAGATTAC One molecule, one read Sequence templates multiple times

15 Mapability The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map Chr3 Chr7 repeat Longer reads: Paired reads:

16 Mapability scores at UCSC
The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

17 Poorly mappable regions of the genome
36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

18 Common algorithms for mapping short reads to a
reference genome Program Website ELAND (v2) N/A – integrated into Illumina pipeline Bowtie BWA Maq SOAP2 Mosaik Novoalign Considerations Alignment scoring method Speed Quality aware Seeding Gapped alignment

19 Seed-based alignment strategy
Single seed alignments Reference Seed Critical values are seed length and number of mismatches allowed In ELAND: Seed length = 32 Number of mismatches = 2 Multiseed alignments (ELAND v2, others) Seed interval contingent on read length

20 Spaced-seed indexing of the reference genome
alignment Spaced-seed indexing of the reference genome Need to break up the genome into manageable segments Create index of short sequences Match seeds against genome index Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

21 Implementation in ELAND v2
A read must have at least one seed with no more than 2 mismatches and no gaps Gapped alignment: extend each alignment to full length of read, allowing gaps up to 10 bp

22 Resolving ambiguous read alignments with multiple seeds
Reference Seed

23 Resolving ambiguous read alignments with multiple seeds

24 Utility of gapped alignments
RNA-seq Insertions and deletion variants in exome and whole genome sequencing

25 Mapping paired end reads
Insert size Insert size within specified range

26 ELAND alignment scoring
Base quality values and mismatch positions in a candidate alignment are used to assign a p value P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error rates corresponding to the read’s quality values Alignment score for a read is computed from p values of all candidate alignments If there are two candidates for a read with p values 0.9 and 0.3: 0.9/( ) = 0.75, chance highest scoring alignment is correct , chance highest scoring alignment is wrong Alignment score = -10 log(0.25) = 6.

27 Reference genome indexing using Burrows-Wheeler transform
alignment Reference genome indexing using Burrows-Wheeler transform Need to break up the genome into manageable segments Transform genome into index that can be searched very rapidly Match seeds against index Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

28 Bowtie

29 Scoring alignments: Bowtie 2
@HWI-ST974:58:C059FACXX:2:1201:10589: :N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Multiseed alignment Seed length: 20 # mismatches: 0 Total score = -17 TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG Ref Read Gap = -11 -5 to open -3 to extend by 1 bp Mismatch = -6 Best possible score is 0 (perfect match to reference)

30 Mapping in highly repetitive regions
ELAND is conservative Non-unique alignments are flagged; only one is reported in export.txt Post-alignment CASAVA analyses ignore these Bowtie will report non-unique alignments User-specified options determine how these are reported

31 Sequence Alignment/Map (SAM) format
Standard format for reporting short read alignment data BAM is compressed version Header Alignment info Description posted on class wiki

32 Summary Read the material posted for this lecture on the class wiki
Next week: first Regulomics lecture


Download ppt "Introduction to Short Read Sequencing Analysis"

Similar presentations


Ads by Google