CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics
Next-generation sequencing - Mapping short reads

Short read mapping Input: Output:
A reference genome A collection of many bp tags (reads) User-specified parameters Output: One or more genomic coordinates for each tag In practice, only 70-75% of tags successfully map to the reference genome. Why?

Multiple mapping A single tag may occur more than once in the reference genome. The user may choose to ignore tags that appear more than n times. As n gets large, you get more data, but also more noise in the data.

Inexact matching ? An observed tag may not exactly match any position in the reference genome. Sometimes, the tag almost matches one or more positions. Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out. The user can specify the maximum number of mismatches, or a phred-style quality score threshold. As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags.

Read Length is Not As Important For Resequencing
Jay Shendure

Mapping Reads Back Hash Table (Lookup table) Array Scanning
FAST, but requires perfect matches. [O(m n + N)] Array Scanning Can handle mismatches, but not gaps. [O(m N)] Dynamic Programming (Smith Waterman) Indels Mathematically optimal solution Slow (most programs use Hash Mapping as a prefilter) [O(mnN)] Burrows-Wheeler Transform (BW Transform) FAST. [O(m + N)] (without mismatch/gap) Memory efficient. But for gaps/mismatches, it lacks sensitivity

Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “seeds.” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “hit,” confirm the remaining positions. Report results to the user.

Burrows-Wheeler Store entire reference genome.
Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution.

Why Burrows-Wheeler? BWT very compact: Linear-time search algorithm
Approximately ½ byte per base As large as the original text, plus a few “extras” Can fit onto a standard computer with 2GB of memory Linear-time search algorithm proportional to length of query for exact matches

Burrows-Wheeler Transform (BWT)
$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac acaacg$ gc$aaac Burrows-Wheeler Matrix (BWM)

Burrows-Wheeler Matrix
$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

Burrows-Wheeler Matrix
3 1 4 2 5 6 $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the suffix array?

Key observation 1$acaacg1 2aacg$ac1 1acaacg$1 3acg$aca2 1caacg$a1
2cg$acaa3 1g$acaac2 a1c1a2a3c2g1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column.

Recover text 5 6 4 3 2 1 6

Exact match 3 1 4 2 5 6

Exact match (another example)
BWT(agcagcagact) = tgcc$ggaaaac Search for pattern: gca gca gca gca gca $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac Test with your own seq and pattern at:

Auxiliary data structures
Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g T rank 1 5 8 11 BWT 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g t rank 1 5 8 11 BWT gca 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices Next block: From = 1 to 1 + (4-1) = 4

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g T rank 1 5 8 11 BWT gca 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices Next block: From = 5 to 5 + (2-1) = 6

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g T rank 1 5 8 11 BWT gca 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices Next block: From = 9 to 8 + (3-1) = 10

Inexact match

Main advantage of BWT against suffix array
BWT needs less memory than suffix array For human genome m = 3 * 109 : Suffix array: mlog2(m) bits = 4m bytes = 12GB BWT: m/4 bytes plus extras = GB m/4 bytes to store BWT (2 bits per char) Suffix array and occurrence counts array take 5 m log2 m bits = 20 n bytes In practice, SA and OCC only partially stored, most elements are computed on demand (takes time!) Tradeoff between time and space

Comparison Spaced seeds Requires ~50Gb of memory. Runs 30-fold slower.
Is much simpler to program. MAQ Burrows-Wheeler Requires <2Gb of memory. Runs 30-fold faster. Is much more complicated to program. Bowtie

Short-read mapping software
Software Technique Developer License Eland Hashing reads Illumnia ? SOAP Hashing refs BGI Academic Maq Sanger (Li, Heng) GNUPL Bowtie BWT Salzberg/UMD BWA SOAP2 BWT & hashing

References (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics (2008) 24: 713-4 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760 SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li, (2009) 25: 1966–1967 (MAQ) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18: Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009)

CS 6293 Advanced Topics: Translational Bioinformatics

Similar presentations

Presentation on theme: "CS 6293 Advanced Topics: Translational Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 6293 Advanced Topics: Translational Bioinformatics

Similar presentations

Presentation on theme: "CS 6293 Advanced Topics: Translational Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback