November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen

November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen
Minimizers Reducing Storage Requirements for Biological Sequence Comparison M. Roberts et al, 2004 November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen

Previously On “Algorithms for Deep Sequencing”
Last Week: Find identical parts between documents Reduce the needed storage for the algorithm Similarities to Last Week: Comparing between documents Large storage reduction, small performance hit Differences from Last Week Biology domain - our “documents” are sequences of nucleic acids (ACGT) and proteins. Different options for choosing the “representative k-mers” A bit on similarities, not just identical strings.

Problem Given two long sequences, find common substrings (and their location) “Common”: Not necessarily Identical, But “similar”. The similarity is a function of the differences between the two strings, and can be determined by the user. A C G T Not Similar A C G T Similar enough

Motivation DNA assembly - find common patterns at the ends of DNA parts

Motivation Homology - similar parts may suggest common ancestry or similar “function”.

k-mers k-mer: a substring of length k
In a string with length L, there are L-k+1 k-mers: G 4 T 5 A 1 C 2 3 Intuition: In the first L-k indices there are L-k k-mers. In the last k indices there is only place for 1 more k-mer. A 1 C 2 G 3 G 4 C 2 3 G 4 T 5 3

“Seed and Extend” We want to find common substrings between two strings. What about more than two strings? And similarity? Later! We get k-mers from both strings. These are the “seeds”. How do we choose these seeds/k-mers? Good question! Each seed is represented by 3-tuple: <s, i, p> s - the k-mer letters i - the index of the string (in our case: ‘1’ or ‘2’) p - the starting position of the k-mer in the string

Finding Common K-mers Sort your list of k-mers (your “seeds”).
Now identical seeds are one after another, making it easy to find the corresponding strings and try to extend the seed matches. The ability to recognize matches as soon as the database is sorted is called the “collection criterion”.

Seed and Extend - Simple Process
G T Chosen seed (3-mer): To get the length of 5 match, a 3-mer in that “window” had to be chosen. Exact Matches of less that k can’t be found that way. G A C

Storage Cost: Naïve Choice of K-Mers
If we don’t want to miss any match, we have to store all k-mers. For simplicity, assume 𝑘≪𝐿. Then there are L k-mers in a string of size L. In 2004, gene assembly for a common rat uses about 33∗ sequences, with an average length of 600 letters each. That gives about 2∗ k-mers. A typical choice of k was 20, so get to 4∗ letters. Even if we use 2 bits to store each letter, we need 5 bytes for the letters of a k-mer. And don’t forget we need 3 more bytes for the string index (i) and 2 more bytes for the position of the k-mer in the string (p). The total is about 200GB, and a task can be more demanding than the gene assembly of a rat.

Reducing Storage Requirements
Store fewer k-mers as seeds But which ones? Simplest option: store every 𝐺 𝑡ℎ k-mer (for some G) Problem: overlapping strings with offset can be completely ignored Better option: Minimizers Group of adjacent k-mers  Choose representative k-mer Representative = two strings with a significant overlap choose the same k-mer

Minimizers Minimizers: a special set of representative k-mers. The Representation Property (Property 1): If two strings have a significant exact match, then at least one of the minimizers chosen from one will also be chosen from the other. 𝒎 𝟏 𝒎 𝟐 𝒎 𝟑 𝒎 𝟏

Window of K-Mers 𝒎 𝟏 𝒎 𝟐 𝒌: seed length (k-mer)
𝒘: window size. The number of adjacent k-mers in a k-mer group. 𝒘+𝒌−𝟏: window coverage. The substring covered by all w k-mers. k=3 𝒎 𝟏 𝒎 𝟐 w=5 1 2 3 4 5 6 7 w+k-1=7 … K w w+k-1

Interior Minimizers A T C G
(𝐰,𝐤) minimizer: given a window of 𝑤 consecutive k-mers, the minimizer is the smallest k-mer. This tactic requires an ordering: Simplest: lexicographic. “AAAA” < “ABAA” < “CGCG” Better orderings: later. The representation property is satisfied: (Property 1’): If two strings have a substring of length w+k-1 in common, they have a (𝑤,𝑘) minimizer in common. k=3 w=5 A T C G

Interior Minimizers 𝑘=3 , 𝑤=4 , 𝑆 =15 , 𝑤+𝑘−1 =6
𝑘=3 , 𝑤=4 , 𝑆 =15 , 𝑤+𝑘−1 =6 All k-Mers: 𝑆 −𝑘+1 =13 seeds Interior minimizers: 4 seeds

Gaps Between Minimizers
Not every letter in the string must be covered by some minimizer Maximal gap size: 𝑤−𝑘= 𝑤+𝑘−2𝑘 Complete coverage: 𝑤≤𝑘 𝑤=𝑘 is common Sparse minimizers: 𝑤≫𝑘 𝒎 𝟏 gap 𝒎 𝟐 1 … k w+k-1 w+k

End Minimizers Interior minimizers don’t guarantee coverage at the ends of a string At most 𝑤−1 letters at each end might be uncovered (𝑢,𝑘) end-minimizer: a (𝑢,𝑘) minimizer chosen from a windows of size 𝑢 which is anchored to one end of the string. If for some 𝑣 we build the set of all (𝑢,𝑘) end-minimizers for 𝑢∈ 1…𝑣 , we satisfy the end-representation property (Property 2): If the ends of two strings have an exact overlap of at least 𝑘 letters and at most 𝑘+𝑣−1 letters, then they share at least one k-end-minimizer.

End Minimizers 𝑘=3 , 𝑙=16 , 𝑢∈ 1,…,𝑙−𝑘+1 = 1,…,14

Mixed Strategy Ensure complete letter coverage by using both:
𝑤=𝑘=3 𝑢∈ 1,2 Ensure complete letter coverage by using both: Choose 𝑤≤𝑘 𝑤,𝑘 interior minimizers 𝑢,𝑘 end minimizers at both ends for every 𝑢∈ 1,…,𝑤−1 Every letter in the string is covered by at least one minimizer.

Ordering - Effect on Storage
Last week we saw that on average we get a new minimal value in our window (i.e. "minimizer") every 2 𝑤+1 letters. Until now we dealt with numbers. Let’s switch to nucleic acids (ACGT). We choose minimizers by lexicographic order (i.e A is the first letter). In case of a tie, choose all that are tied. What if encounter k-mers of A’s several times in our window? a k-mer of A’s is the first in our order. We would have to choose all those k-mers!

Ordering - Effect on Storage
But this could be said on every sequence of the same letter - If our “first” letter was G, then sequences of G’s would have hurt us the same way… But DNA sequences are not completely random. A is more common than C or G. So, if A is our “first” letter, since there are more sequences of A’s than of C’s and G’s, we would get a new minimizer more than every 2/(𝑤+1) times (on average), and so we would store more k-mers than expected. But if we choose the “first” letter (or k-mer) to be a rare one, e.g CGCG, this would mitigate this problem!

Ordering - Effect on Match Significance
Ordering by “rarity” does not only help in reducing storage We want our matches to be significant. If we want to see if two articles are similar, The word “protein” in both of them is more indicative of a resemblance than multiple co-occurrences of the phrase “this is a“. Same with genes - a match of CGCG is more significant than a match of AAAA. The order can impact both storage requirements and the statistical significance of the matches that were found. The latter is important when minimizers are sparse (not covering the whole string).

A Bit on Similarity Like matches, not all mismatches are the same.
BLAST: a family of algorithms for sequence matching. Seed & Extend: Start from seeds, tries to extend from there. Can look for similarities, not just exact matches. Similarities are determined by a similarity score matrix.

How Can Minimizer Ordering Help Find Similarities
One possible feature of BLAST is to extend until the similarity score is below some threshold. Assume seed size (k) is 4, threshold 7, and this matrix: A C G T 2 -2 -3 1 5 -1 3 A T G C

Case Study Faux dataset – computationally shattered C. Elegans genome
Dataset stats: Total genome length: 100MB = base pairs About reads of length ~𝑁( 𝜇=537, 𝜎=90 ) 5.7-fold cover of the genome Artificial base errors were inserted Probabilities taken for actual reads of the human genome

Case Study The goal: finding overlaps of at least 40 base pairs
Algorithmic pipeline: Seeds were created by using minimizers of different (𝑤, 𝑘) values, including all k-mers (w=1). Seed & Extend algorithm was executed, detecting overlaps using these seeds. In some cases, Symmetrizer was applied to find additional overlaps.

Case Study The goal: finding overlaps of at least 40 base pairs
Measurements: Run time: in hours, using an average desktop computer. 𝑡 𝑟𝑎𝑡𝑖𝑜 or 𝑟𝑒𝑐𝑎𝑙𝑙 or 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 : the percent of true overlaps found. 𝐹/𝑇 : 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 : 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 = 1 1+𝐹/𝑇 False positives are common due to repeated regions in the genome that match locally.

Symmetrizer A technique for finding missing overlaps
If read X plausibly overlaps reads Y and Z, and the offsets suggest that Y and Z overlap, then Y and Z are sent to the Extender part of the algorithm. In the example: the overlap between R and B is insufficient to reliably produce a minimizer, but their offsets relative to G suggest that they do in fact overlap. 𝑹 𝟏𝟎 𝑹 𝟐𝟎 𝑹 𝟑𝟎 𝑹 𝟒𝟎 𝑹 𝟓𝟎 𝑹 𝟔𝟎 𝑹 𝟕𝟎 𝑹 𝟖𝟎 𝑮 𝟏𝟎 𝑮 𝟐𝟎 𝑮 𝟑𝟎 𝑮 𝟒𝟎 𝑮 𝟓𝟎 𝑮 𝟔𝟎 𝑮 𝟕𝟎 𝑮 𝟖𝟎 𝑮 𝟗𝟎 𝑮 𝟏𝟎𝟎 𝑩 𝟏𝟎 𝑩 𝟐𝟎 𝑩 𝟑𝟎 𝑩 𝟒𝟎 𝑩 𝟓𝟎 𝑩 𝟔𝟎 𝑩 𝟕𝟎 𝑩 𝟖𝟎 𝑩 𝟗𝟎

Results Precision 36% 38% 42% 48% 45% 43% 28% 29% 33%

Recall-Precision Tradeoff
As 𝑤 increases, minimizers are less common. w=20 VS w=1: Recall somewhat damaged Precision increases significantly Run time shortens dramatically

Minimizers VS All K-Mers
w=k=20 VS w=1,k=30: Similar recall Minimizers have better precision Minimizers have much better run time Notice Recall-Precision tradeoff with all k-mers, different k sizes

With & Without Symmetrizer
In general, Symmetrizer improves recall Sym,w=20 VS NoSim,w=3: Similar precision Better recall Better Run Time Window size has minor effect on recall (notice the scale!) and big effect on run time. (Although this might be because the recall is already so high)

Sym Minimizers VS All K-Mers
For high-recall needs, Minimizers with Symmetrizer provide: Better recall Similar precision Significantly better run time Sym Minimizers might not perform well in a high-precision scenario (tendency to find FPs), but the data is insufficient for a solid conclusion.

Recall Equivalence About 2 𝑤+1 of k-mers are (𝑤,𝑘) minimizers.
A string of length 𝑙 has 𝑙−𝑘+1 k-mers in total. ⇒ A string of length 𝑙 is expected to have about 2 𝑙−𝑘+1 𝑤+1 minimizers. When is a string expected to have 1 minimizer? Solving for 𝑙: 𝑙 1 =𝑘+ 𝑤−1 2 ≅𝑘+ 𝑤 2 We expect matching substrings of length 𝑘+ 𝑤 2 to be found both by a (𝑤,𝑘) minimizer and by a 𝑘+ 𝑤 2 -mer Indeed, we see similar recall values for all 𝑘+ 𝑤 2 -mers and for (𝑤,𝑘) minimizers

Minimizers vs. All k-mers: Precision
We saw that with regards to recall, using (𝑤, 𝑘) minimizers and using all 𝑘+ 𝑤 2 k-mers have comparable results. To analyze precision, we have to consider the following parameters: L: Total length of the database (in letters) b: number of different letters (the “base” of a sequence). 4 in DNA (ACGT), 20 in proteins. A sequence of length k is expected to appear in L a total of 𝐿 𝑏 𝑘 times. 𝐿 𝑏 𝑘 is an indicator of precision: If k is chosen such the 𝐿 𝑏 𝑘 ≪ 1, then if a sequence of length k appears in our database more than one time, the match is unlikely to have occurred in random.

Looking for Long Matches? Minimizers!
Assume we are looking for long matches. We will probably choose a large k, so that we won’t waste our time on many small seeds. In this case, we probably get 𝐿 𝑏 𝑘 ≪ 1. But to be sure, check your L! We know that using (𝑤, 𝑘) minimizers and using all 𝑘+ 𝑤 2 k-mers will have similar recall, but using the minimizers will take a factor of 2 𝑤+1 less storage. Although the “all k-mers” approach will have a slightly better precision, the difference will be negligible.

What About shorter Matches?
Say that short matches are significant enough. So k is bounded by the size of a significant match. In this case, we can get 𝐿 𝑏 𝑘 > 1. (but again, check your L!) If k=15, b=4, and L= , then for all k-mers approach, we get 𝐿 𝑏 𝑘 ≈10. (w=10, k=10) minimizers will yield the same recall, but 𝐿 𝑏 𝑘 will be a thousand times larger (k got from 15 to 10). This is a big hit on precision. And even regarding storage, since every k-mer is expected to appear in L multiple times, we don’t need to save all the k-mers, we can just use a hash table.

Summary 𝑤,𝑘 minimizers are guaranteed to find matches of length ≥ 𝑤+𝑘−1 Minimizers use a factor of 2 𝑤+1 less storage on average Choose an ordering that favors rare k-mers 𝑤 and 𝑘 affect the recall-precision-runtime tradeoff Don’t always use minimizers. Consider the size of the alphabet (b) and database (L)

November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen

Similar presentations

Presentation on theme: "November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen

Similar presentations

Presentation on theme: "November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen"— Presentation transcript:

Similar presentations

About project

Feedback