Mash fast genome and metagenome distance estimation using MinHash

Mash fast genome and metagenome distance estimation using MinHash
Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren and Adam M. Phillippy, 2016 להקריא את תת הכותרת ראינו את מינהאש Sequence comperison מצליח להשוות תוצאות בחלקיק הזמן לאלגוריתמים קיימים, וכפועל יוצא מצליח במשימות שלא היו פיזיביליות עד כה November 2018 Deep Sequencing Seminar Dvir Ginzburg, Yuri Klayman

Presentation overview
Mash introduction MinHash reminder Mash distance and p value Key points, parameters tuning (k,s sizes) and Conceptual issues Experiments and results Improvement suggestions Conclusions * Terminology and derivations appear in the end of the presentation

Mash overview Novel bioinformatics approach for sequences comparison.
Turns large sequences and sequence sets to small, representative sketches. Provides reliable distance estimation, that can be used for useful features as- Provides clustering and search capabilities for massive sequences collections. Estimation error depends only on the size of the sketch, and independent of the genome size. Based on MinHash Reminder in the next slides Uses “Mash distance” instead of the Jaccard index to address specific domain problems. להתחיל מהנקודות פה, ואז לעבור לחלק הכתוב כל הקורס מחולק לסיקווינס אליינמנט וסיקוונס קומפריסון גם פה kmers, להשוות את כל הkmers זה קשה ודורש כמות זיכרון לא פיזיבילית Sketchs

Mash capabilities Supports Parallelism
Assembled or unassembled (alignment-free) data Computationally efficient, addressing tasks unfeasible before Real-time Online Arbitrary alphabets Free & Open source Helps with privacy concerns Response teams (generating reads takes time, can give assessment online)

Terminology and Definitions
MinHash sketch s– Given 𝑡 𝑘−𝑚𝑒𝑟𝑠 for the multi-sequence L, and ℎ the MinHash function A MinHash sketch of size s of L , consists of the s smallest elements in the set ℎ 𝐿 ≝{ℎ(𝑙)|𝑙∈𝐿} x – Number of shared Hash values between sketches 𝑠1, 𝑠2 (Intersection). Jaccard similarity – MinHash metric to evaluate similarity of Sets 𝐴𝐴𝐺𝑇,𝐴𝐴𝐴𝐴,𝐶𝐶𝐺𝑇,𝐴𝐺𝑇𝐺 → 3,12,1,8 → 𝑆=2 {1,3} L סיקוונס t קמרס h פונקציית האש h(l) בגודל s הקטנים ביותר נומרית Jaccard estimate (j) - 𝑥 𝑠

Locality Sensitive Hash
MinHash is a form of a locality sensitive hash This is opposed to normal hashes and especially cryptographic hashes We explicitly look for a hash function that reduces a large set of options to a smaller set, where similar input is more likely to be reduced to similar values. Normal is : small input change -> large variance in output LSH : small input change -> small change in output

MinHash We want to evaluate without computing union and intersection with all the k-mers in the sequences. Our goal is to find a “sketch” (subset) representation of A and B s.t : 1) S is small enough that we can fit a sketch in main memory for each set. 2) Sim (A, B) is (almost) the same as the “similarity” of S1 and S2 (using J)

MinHash Pick a random permutation of the rows (the universe U).
h(L) = the index of the first element of L in the permuted order. Use s as the sketch size to define the sequence signature. L1 L2 L3 L4 A C 1 B D G F E ACGT-’A’ AACG-’B’ ATAG-’C’ ACCC-’D’ GGGT-’E’ GTGT-’F’ TAGA-’G’ C D G F A B E 𝑆 𝐿1 = 𝑇𝐴𝐺𝐴,𝐺𝑇𝐺𝑇,𝐴𝐶𝐺𝑇,𝐴𝐴𝐶𝐺 𝑆 𝐿2 = 𝐴𝑇𝐴𝐺,𝐴𝐶𝐶𝐶,𝑇𝐴𝐺𝐴 𝑆 𝐿3 = 𝑇𝐴𝐺𝐴,𝐺𝑇𝐺𝑇,𝐴𝐶𝐺𝑇 𝑆 𝐿4 ={𝐴𝑇𝐴𝐺,𝐴𝐶𝐶𝐶,𝐺𝐺𝐺𝑇,𝐴𝐴𝐶𝐺} 3 1

≈ MinHash Signature matrix Pr(h(L1) = h(L2)) = Sim(L1,L2)
The first row where one of the two sets has value 1 belongs to the union. Recall that union contains rows with at least one 1. We have equality if both sets have value 1, and this row belongs to the intersection With multiple signatures we get a good approximation Input matrix MinHash L1 L2 L3 L4 A 1 B C D E F G Signature matrix Actual Sig (L1,L2) (L1,L3) 3/5 2/3 (L1,L4) 1/7 (L2,L3) (L2,L4) 3/4 1 (L3,L4) L1 L2 L3 L4 h1 1 2 h2 3 h3 ≈

MinHash – Two Variants Taking large number of hash functions, it can be shown that the error bound is 𝑂 1 𝑠 Is it feasible? No, with k=32 we have 1.84∗ rows. Creating permutations Index the permutation Instead of permuting the rows we will apply a hash function that maps the rows(kmers) to a new (possibly larger) space The sketch of the set, will be the lowest s indexes (values of the hash function) in the new space. Can be shown that the error bound stays 𝑂 1 𝑠

Mash stages For all sequences: Compare sketches “Mash Distance”
Compute k-mers Draw the s smallest hashes output by ℎ over all k-mers in the sequence When 𝑛≫𝑠 the process is almost linear (discussed later) Sketches are sorted! Compare sketches “Mash Distance” Typical s values are 𝑠<1000 making this stage almost negligible Compute p value to determine the asymptotic significance of the results.

Mash Distance Jaccard pitfall

Gene mutations mutation is the permanent alteration of the nucleotide sequence of the genome. When comparing sequences of the same genome for “type” matching (i.e. detect virus types etc.) we seek a metric that would be invariant to mutations. Jaccard index is sensitive to genome size and simultaneously captures both point mutations and gene content differences Mash distance D seeks to directly estimate the mutation rate under a simple Poisson process to address this problems. חלק מהתהליך הטבעי של שכפול, מקרי יכול להיות מבסיס אחד בתא ספציפי רצף בסיסים – מוגדר לפי שוני ממרבית הדגימות של אותו גנום סדרה->משפחה->סוג->מין

Mash distance derivation (1)
Jaccard distance: For sketch size 𝑠, sequences A, B and the sorted hash value sets 𝑆 𝐴 , 𝑆 𝐵 : We define x as the number of shared hashes found after processing the first p hash values from both sets. The Jaccard estimate is 𝑗= 𝑥 𝑃 ′ . Computed in 𝑂(𝑠) 5,8,32,66 5,9,10,32 𝑝=2; 𝑝 ′ =1→ 5 →𝑗=1 𝑝=3; 𝑝 ′ =2 → 5,8 →𝑗= 1 2 𝑝=8; 𝑝 ′ =6→ 5,8,9,10,32,66 →𝑗= 1 3

We’ll consider the mutation rate under a simple Poisson process. Given the probability d of a single substitution, the expected number of mutations in a k-mer is 𝜆= 𝑘𝑑. And the chance of no mutations to occur is 𝑒 −𝑘𝑑 . Mark w as the number of k-mers with no mutations, and t the number of k-mers in the sequence, so under the Poisson process the expected value of unmutated k-mers is exactly 𝑤 𝑡 . לשים את הנוסחה לתהליך פואסון 𝑃 𝑁=𝑛 = 𝜆 𝑛 𝑛! 𝑒 −𝜆

Solving 𝑒 −𝑘𝑑 = 𝑤 𝑡 yields 𝑑=− 1 𝑘 ln 𝑤 𝑡 (1) Mash sets t to the average genome size n, thus 𝑗= 𝑥 𝑠 = 𝑤 2𝑛−𝑤 Hence 𝑤 𝑛 = 2𝑗 1+𝑗 Plugging it in (1) gives us Mash distance: 𝐷=− 1 𝑘 ln 2𝑗 1+𝑗

Mash derivation equations
Jaccard distance Jaccard estimate 𝑗= 𝑥 𝑠 Chance of no mutations in a k-mer 𝑒 −𝑘𝑑 (Poisson process) w,t are the number of unmutated and all k-mers respectively 𝑒 −𝑘𝑑 = 𝑤 𝑡 → 𝑑=− 1 𝑘 ln 𝑤 𝑡 𝑗= 𝑤 2𝑛−𝑤 → 𝑤 𝑛 = 2𝑗 1+𝑗 𝐷=− 1 𝑘 ln 2𝑗 1+𝑗

Why we use the mean Other techniques set t to the smaller of the two genome’s k-mer counts. This way Mash distance also penalizing for size differences and measuring resemblance (Avoid zero distance between a phage and a genome containing that phage) 𝑆 1 = 𝐴𝐵𝐴𝐷𝐴 𝑛 1 =1 𝑆 2 = 𝐶𝐶𝐶𝐴𝐵,𝐶𝐶𝐴𝐵𝐴,𝐶𝐴𝐵𝐴𝐷,𝐴𝐵𝐴𝐷𝐴,𝐵𝐴𝐷𝐴𝐵,𝐴𝐷𝐴𝐵𝐵,𝐷𝐴𝐵𝐵𝐵 𝑛 2 =7 𝑡=1,𝑤=1→𝑗= 𝑤 2𝑛−𝑤 =1→𝐷=− 1 𝑘 ln 2𝑗 1+𝑗 =0 𝑡=4,𝑤=1,𝑗= 1 8−1 = 1 7 →𝐷=− 1 5 ln =0.277 Containment

P-Value We need to know the expected number of x matches or more between two random genomes, to understand if the Jaccard Index and therefore the estimate and the mash distance we calculated is significant

P-Value calculation To calculate probability of a k-mer appearing randomly, without replacement, we would use the hypergeometric cumulative distribution w – all matches m – all distinct k-mers But, since m >> s, we can approximate to selection with replacement, and a binomial cumulative distribution, with r = w/m Cumulative because we try to find probability of x matches or more

How do we get r ? r calculation : 𝑝 𝑘𝜖𝑋 = 1 1+ Σ 𝑘 𝑛
𝑝 𝑘𝜖𝑋 = Σ 𝑘 𝑛 You might wonder, why would I estimate the k-mer count if I know the genome size, since we are the ones that sketched it. Apparently, they figured that too, and in recent versions of the code, this calculation is done only if genome size isn’t provided as a parameter.

How do we get m ? m estimate of k-mer count by m= 2 𝑏 𝑠 𝑣 , where:
v – maximum hash value in the sketch b – number of hash bits You might wonder, why would I estimate the k-mer count if I know the genome size, since we are the ones that sketched it. Apparently, they figured that too, and in recent versions of the code, this calculation is done only if genome size isn’t provided as a parameter.

P-Value Significance The P value tells us the mash distance we calculated has significance, but what does it mean ? We calculated a significant number, but does that number mean two things are close or far ? How close ? Is it transitive ?

ANI - Average Nucleotide Identity
We know the mash distance is a useful measure of global sequence similarity because it correlates with ANI. The most wildly used genomic relatedness index. [1] A variation on alignment and scoring, actually simulates “in silico” the previously most wildly used technique, which was in use for 50 years. [1] A large-scale evaluation of algorithms to calculate average nucleotide identity -

ANI Due to the high cost of computing ANI via whole-genome alignment, a subset of 500 Escherichia genomes was selected for comparison The correlation for ANI of values 90–100 % is very good A side note, an ANI of 95% is not too useful for distinguishing between eukaryotes species, but values above 99% can be used.

ANI correlation Mention we are measuring difference between genomes ( Esch). The X axis is ANI, or 1 – ANI to be exact, Y-axis is the Mash Distance we calculated. The gray line is X=Y, the blue line is a linear regression. The columns are different, increasing sketch sizes. The rows are increasing k-mer lengths. The number on the bottom right of each graph is root-mean-square error. We can see the trend is longer k-mers lower the error. The same for sketch size. For some reason, S=5000, K=21 has a better error than S=5000, k=27. Could be ANI isn’t exact enough.

Estimate of Jaccard Index
Note this is still the straight forward Jaccard Index Each line is a different value of s, increasing from light gray to black : s=100 (light gray), s=1,000, s=10,000, and s=100,000 (black)

Impact of S, sketch size Increasing sketch size helps with divergent sequences. Increasing s, has negligible impact on time it takes to sketch. Computed in O(n*log(s)), where n is the genome size. However, it has linear impact on how long it takes to calculate distance, since we store the sketches sorted. Has serious impact on storage size If I have time, look into this : Further, the probability that the i-th hash of the genome will enter the sketch is s/i, so the expected runtime of the algorithm is O(n + s log s log n) [4], which becomes nearly linear when n > > s.

Jaccard & Mash with different K’s
If we look at how we calculate mash distance from the JE, we can see we divide by the k-mer length Mathematically, the relation is obvious, but what it represents is two things : Longer strings, are less likely to match. This is true for the Jaccard Index without Mash. Longer strings are more likely to have mutated, and this is something mash D takes into account, while JI doesn’t The paper doesn’t try to distinguish the two. Also, choice of s is unknown. k = 15, 21, 27 ( top to bottom, red to blue )

Collisions k = 21, 27, 15 ( top to bottom, black to red )
Now we can see what happens if we take too small a k-mer. The JI is hypothetical where all k-mers are unique and is what we test the effect against. Note that 415≈1.07 𝐺 Y-axis with Mash was run on actual genomes 1 Gbp, where not all k-mers are unique. The interesting one is the red one. We can see, that no matter the “actual” similarity or JI, the distance is always below 0.03, since we have so many random matches. I’m not sure how they calculated the hypothetical one, maybe by only calculating JI for sets of the same size as a 1 GBP genome, but artificially selecting only unique values

The importance of a just-right K
Tradeoff between sensitivity and specificity. With a large k, we will miss smaller k matches. In some areas, a sequence match of length 15 is important. With smaller k, the larger the sequence, the higher the chance of the sequence appearing randomly. The paper recommends a method that helps to pick a k large enough to avoid random collision, but not larger than needed, to not miss interesting matches.

The limits of K Probability of a k-mer appearing randomly in in genome of size n. P(k∈𝑋) = Σ 𝑘 𝑛 If we wish for this probability to be particular value q, we can calculate the required k by : A suggested default is k=21, s = 1000, which requires just 8 kB per sketch. How independent are the appearances of n-mers in different genomes? - For details

How did we get that ? Simple modeling and experimental verification
We mark with q the probability of a particular k-mer to appear randomly, and we assume independence between k-mers. n ≈ |Σ| 𝑘 (𝑞+ 𝑞 2 + 𝑞 3 + 𝑞 4 + …) = Σ 𝑘 𝑞 1−𝑞 𝑙𝑜𝑔⟹ 𝑞≈ 𝑛 𝑛+ Σ 𝑘 = Σ 𝑘 𝑛 This was confirmed to be a good estimate with experiments

Finer points For DNA sequences, Mash uses canonical k-mers. Meaning we only pick one k-mer from complimentary DNA sequences. How to handle read errors ? In the next slide. I don’t know why they need 5 and not 4 or 10.

Two Step Extension Ignore low-occurrence k-mers when constructing the sketch, two implementations: A hash map with a counter, so that only a k-mer that appeared more than c times, is eligible To avoid large hash-maps, because many k-mers appear less than c times, construct a Bloom Filter. We won’t explore this, because in practice, the authors say the exact method outperforms both extensions in accuracy and memory usage. This is actually interesting. There is related research that shows that one could find a single

Conceptual Issues The correlation with ANI begins to degrade for more divergent genomes because the variance of the Mash estimate grows with distance. Increasing sketch size helps. In some areas, k-mers of length 5 can be important, but this is a limitation of most techniques based on NGS. Fast enough to be used to run in real-time with sequencing, building the sketch incrementally as it becomes available, but since the method is probabilistic, this suffers from the multiple testing problem. Maybe we should move this to the end

Experiments Results

Clustering NCBI RefSeq
The Reference Sequence comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. More than 54,118 organisms, 618 Gbp of genomic sequence

Whole genome sequencing becomes routine It became impractical to manually assign taxonomic labels for all genomes. Automated methods are crucial for constructing groups of related genomes Infeasible with alignment based approaches. MinION - ONT Oxford Nanopore Technologie - prtable protein nanopore sequencing(2015)

k=16 and s=400 Parameters seems to be not ideal both for large genomes(k), or highly divergent(s) Optimal for species-level relationships 93 MB for all resulting sketches (7000 times less) Sketching process – 26.1 CPU h ~ 1.5 billion pairwise distance computes – 6.9h Easily parallelized Additional genomes can be added in 4 CPU min for 3GB genome S highly divergent?

Unfeasible with prior methods, By using ANI metric, and choosing 500 genomes a comparison has been made. RMSE of For all pairs with ANI higher than 90%.

Hierarchical clustering for mutation rate
17 RefSeq primate genomes were computed in just 2.5 CPU h (11 min wall clock on 17 cores) with default parameters (s = 1000 and k = 21) . Compared with the UCSC alignment based model. The trees are topologically consistent for everything except the Homo/Pan split, for which the Mash topology is more similar to past phylogenetic studies phylon = tribe, clan, race + γενετικός – genetikós = origin, source, birth) S=1000 k=21 because all are similar-

Real-time genome identification from assemblies or reads
Mash is able to rapidly identify genomes from both assemblies and raw sequencing reads. Generate sketches , compute Mash distances for multiple E coli datasets compare against the RefSeq sketch. For the assembled genomes the correct strain was identified in a few seconds For the unassembled genome, even with reads as the E.coli 1D minION consisting of 40% sequencing error rate the correct species were identified strain

לתת יותר זמן על הטבלה, לדבר על הזמנים, לדבר על גודל הגנום, לדבר על ריד או לא וכמה זמן זה לקח
רידס זה הרבה יותר דאתא

Clustering massive metagenomic datasets
Metagenomic comparison tool Fraction of the time previously required. Current Metagenomic comparison tools as the DSM compute the exact J index using all k-mers. As we previously see, the J distance (1-J) rapidly drops with mutation rate, where Mash isn’t. Metagenomics – not from lab

Clustering massive metagenomic datasets
Global Ocen Survey(GOS) data Mash is tenfold faster, and correlates better to the original GOS study. Incremetal scalability Sketching happens only once per sample, and pairwise distances in parallel and instantauous. Other techniques take 1 h to add a new GOS sample to the cluster Mash takes less than 1 m.

אקלים ממוזג

Implementation details & Summary

Actively developed

Input - FASTQ

Usage Reference-ID, Query-ID, Mash-distance, P-value, Matching hashes

Further improvements Since mash works on FASTQ, and FASTQ comes with quality indicators per nucleotide, we could try to pick k-mers that are of higher quality The program should probably come with some features to help with the multiple testing problem, like : Delaying querying and buffering new reads as they are being read, until a reasonable number has been found. Stability in replacing the tested sketch, so that only significant changes in score replace a value in the sketch. Warning when the amount of tests performed is high with regards to the p-value Support for containment testing, which MinHash is capable of with some changes.

Conclusions A highly efficient and scalable method of estimating distance between sequences, that is mostly independent of sequence size Open source and under active development and use

MinHash sketch s– Given 𝑡 𝑘−𝑚𝑒𝑟𝑠 for the multi-sequence L, and ℎ the MinHash function A MinHash sketch of size s of L are the s smallest elements in the set ℎ 𝐿 ≝{ℎ(𝑙)|𝑙∈𝐿} x – Number of shared Hash values between sketches 𝑠1, 𝑠2 (Intersection). Jaccard distance – MinHash metric to evaluate similarity of Sets

Jaccard estimate (j) - 𝑥 𝑠 Mash distance (D) – A new, domain suited distance metric for sequence similarity Conserved k-mers (w) – The number of unmutated k-mers for two sequences of the same gene.

Mash derivation equations
Jaccard distance Jaccard estimate 𝑗= 𝑥 𝑠 Chance of no mutations in a k-mer 𝑒 −𝑘𝑑 (Poisson process) w,t are the number of unmutated and all k-mers respectively 𝑒 −𝑘𝑑 = 𝑤 𝑡 → 𝑑=− 1 𝑘 ln 𝑤 𝑡 𝑗= 𝑤 2𝑛−𝑤 → 𝑤 𝑛 = 2𝑗 1+𝑗 𝐷=− 1 𝑘 ln 2𝑗 1+𝑗

Mash fast genome and metagenome distance estimation using MinHash

Similar presentations

Presentation on theme: "Mash fast genome and metagenome distance estimation using MinHash"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mash fast genome and metagenome distance estimation using MinHash

Similar presentations

Presentation on theme: "Mash fast genome and metagenome distance estimation using MinHash"— Presentation transcript:

Similar presentations

About project

Feedback