Precomputing Edit-Distance Specificity of Short Oligonucleotides

Precomputing Edit-Distance Specificity of Short Oligonucleotides
Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Polymerase Chain Reaction

Primer Specificity Need to ensure that primers hybridize to a specific (specified) locus only Require exactly one occurrence of specified sequence Require no (potential) mis-hybridization loci Bottleneck computation in primer-design Design / check iteration is problematic

k-unique 20-mers Edit-distance as a surrogate for mis-hybridization potential k-unique loci: All non-self genomic loci are require more than k edits in (global) alignment Closest non-self genomic loci requires (k+1) edits in (global) alignment

Find all k-unique 20-mers
Naïve algorithm: O(n2km) Quadratic in size of genome. 0-unique (exact match) 20-mers (Expected) linear time algorithm Achieve expected linear time using a hybrid approach (blastn): Use partial exact match to “seed” expensive dynamic programming alignment Large chunks ) Fast, but miss occurrences Small chunks ) Slow, but correct

Inexact sequence match
Baeza-Yates Perleberg: Correct and O(n) for small k At least 1 chunk is observed with no error. Small k → Large chunks → Fast and correct Largest correct chunk: floor(m/(k+1)) g ≠ = ≠ q

Example worst case alignments
TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT ACTTGTCCACAGTGCTTAAG ||||||*||||||*|||||| ACTTGTGCACAGTCCTTAAG

Brute-force approach ACTTGTGCACAGTCCTTAAG 2-mer position table AA:18

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach Divide the genome into 10 Mb blocks
For all pairs of blocks: For all l-mer matches: Do all pair-wise DPs containing match If ≤ k edits, mark position non-unique 300 x 300 pairs of blocks For 20-mers: k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.

Brute-force approach Things are looking really, really, bad:
Seeds are too short 90,000 pair-wise block comparisons Actually quite good (seed size 12): Non-uniqueness certificates are dense Almost all positions eliminated early Behaves more like linear time than quadratic

In practice (edit-dist 4)

In practice (edit-dist 3)

In practice (edit-dist 4,3,2)

Edit distance 2 After seed size 12 After seed size 8
~ 27K (0.288%) positions have no match After seed size 8 ~ 3K (0.029%) positions have no match Using seed size 6 is still too slow Need a more sophisticated hashing strategy 6-mers match in too many places!

Spaced seed-set design problem
Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) # cares: l ( = 10,12,14 ) Find the smallest set of spaced seeds that will find all alignments.

Solution for (20,2,8) 11111111, 111101111 TCCCGCGTAGATTGAGATCT
||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT How can we find these spaced seed set solutions?

Spaced seed set design set-cover formulation
Set cover instance: Ground set: all possible placements of the k errors (alignments) Covering sets: all possible placements of the l care positions For (m=20,k=2,l=10), 190 elements, 184,756 sets! Need to reduce the number of sets!

Dirty secret of spaced seeds
Spaced seeds take O(# cares) to update! Contiguous seeds are O(1) to update vs 8 steps to update vs 1 step to update Constant time update for spaced seeds? Yes, if they have a certain structure

O(1) spaced seed update ACGTACGTACGTACGTACGT A G A G C T C T G A G A
T C T C ... Spaced seed can be updated in 1 step!

O(1) spaced seed update “Periodic” spaced seeds can be updated in “constant” time steps steps step Need to minimize the number of update steps, not the number of templates , has update cost 5.

TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT
Edit-distance SS-SDP Position of matching bases might shift! Need ↓ to get CCGCTAGA Need ↑ to get CCGCTAGA Set cover formulation no longer works TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT

r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT
Edit-Distance SS-SDP Use a variation on set cover: q: ,r: covers: Pay for query & reference update costs separately Control size of problem by only enumerating templates with small update cost r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT

Solution for (20,2,10) Query Templates:
1: Cost: 1 2: Cost: 5 27: Cost: 5 42: Cost: 5 Text Templates: 32: Cost: 5 37: Cost: 5 Pairs of templates: 1: : Covers: 1274 2: : Covers: 260 2: : Covers: 1218 1: : Covers: 309 42: : Covers: 42 27: : Covers: 319 42: : Covers: 186 27: : Covers: 51 42: : Covers: 287

k-unique human 20-mers No 4-unique 20-mers No 3-unique 20-mers
0. 038% of (forward) human 20-mers are 2-unique in total about 1 every 2638 bases Fast 2-uniquness oracle

F. tularensis 20-mer signatures
Exact match in all six strains No match to bacterial background at edit-distance k No 3-unique 20-mer signatures 263 2-unique 20-mer signatures 0.013% 1.3M 20-mer signatures (no background check) 1.2M 0-unique 20-mer signatures 580K 1-unique 20-mer signatures

Conclusions Precompute of human k-unique 20-mers is now feasible!
Faster for large edit-distance! Need spaced seed-set designs Constant time update for spaced seeds Good integer programming formulation of SS-SDP Limited template enumeration based on update cost Work with integer programming experts to solve effectively

Next Steps Publish! Adapt for Tm and/or hybridization model
Convert to native BOINC-application Integrate with primer-design software

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Similar presentations

Presentation on theme: "Precomputing Edit-Distance Specificity of Short Oligonucleotides"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Similar presentations

Presentation on theme: "Precomputing Edit-Distance Specificity of Short Oligonucleotides"— Presentation transcript:

Similar presentations

About project

Feedback