Presentation is loading. Please wait.

Presentation is loading. Please wait.

Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Similar presentations


Presentation on theme: "Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,"— Presentation transcript:

1 Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 2 Polymerase Chain Reaction

3 3

4 4 Primer Specificity Need to ensure that primers hybridize to a specific (specified) locus only Require exactly one occurrence of specified sequence Require no (potential) mis-hybridization loci Bottleneck computation in primer-design Design / check iteration is problematic

5 5 k-unique 20-mers Edit-distance as a surrogate for mis- hybridization potential k-unique loci: All non-self genomic loci are require more than k edits in (global) alignment Closest non-self genomic loci requires (k+1) edits in (global) alignment

6 6 Find all k-unique 20-mers Naïve algorithm: O(n 2 km) Quadratic in size of genome. 0-unique (exact match) 20-mers (Expected) linear time algorithm Achieve expected linear time using a hybrid approach (blastn): Use partial exact match to “seed” expensive dynamic programming alignment Large chunks ) Fast, but miss occurrences Small chunks ) Slow, but correct

7 7 Baeza-Yates Perleberg: Correct and O(n) for small k At least 1 chunk is observed with no error. Small k → Large chunks → Fast and correct Largest correct chunk: floor(m/(k+1)) Inexact sequence match ≠ = ≠ q g

8 8 Example worst case alignments TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT ACTTGTCCACAGTGCTTAAG ||||||*||||||*|||||| ACTTGTGCACAGTCCTTAAG

9 9 Brute-force approach ACTTGTGCACAGTCCTTAAG AA:18 AC:1,9 AG:11,19 CA:8,10 CC:14 CT:2,15 GC:7 GT:5,12 TA:17 TC:13 TG:4,6 TT:3,16 2-mer position table

10 10 Brute-force approach ACTTGTGCACAGTCCTTAAG

11 11 Brute-force approach ACTTGTGCACAGTCCTTAAG

12 12 Brute-force approach ACTTGTGCACAGTCCTTAAG

13 13 Brute-force approach ACTTGTGCACAGTCCTTAAG

14 14 Brute-force approach ACTTGTGCACAGTCCTTAAG

15 15 Brute-force approach ACTTGTGCACAGTCCTTAAG

16 16 Brute-force approach Divide the genome into 10 Mb blocks For all pairs of blocks: For all l-mer matches: Do all pair-wise DPs containing match If ≤ k edits, mark position non-unique 300 x 300 pairs of blocks For 20-mers: k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.

17 17 Brute-force approach Things are looking really, really, bad: Seeds are too short 90,000 pair-wise block comparisons Actually quite good (seed size 12): Non-uniqueness certificates are dense Almost all positions eliminated early Behaves more like linear time than quadratic

18 18 In practice (edit-dist 4)

19 19 In practice (edit-dist 4)

20 20 In practice (edit-dist 4)

21 21 In practice (edit-dist 3)

22 22 In practice (edit-dist 3)

23 23 In practice (edit-dist 4,3,2)

24 24 In practice (edit-dist 4,3,2)

25 25 Edit distance 2 After seed size 12 ~ 27K (0.288%) positions have no match After seed size 8 ~ 3K (0.029%) positions have no match Using seed size 6 is still too slow Need a more sophisticated hashing strategy 6-mers match in too many places!

26 26 Spaced seed-set design problem Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) # cares: l ( = 10,12,14 ) Find the smallest set of spaced seeds that will find all alignments.

27 27 Solution for (20,2,8) 11111111, 111101111 TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT How can we find these spaced seed set solutions? One/two table? 2 x false positives

28 28 Spaced seed set design set- cover formulation Set cover instance: Ground set: all possible placements of the k errors (alignments) Covering sets: all possible placements of the l care positions For (m=20,k=2,l=10), 190 elements, 184,756 sets! Need to reduce the number of sets!

29 29 Dirty secret of spaced seeds Spaced seeds take O(# cares) to update! Contiguous seeds are O(1) to update 101010101010101 vs 11111111 8 steps to update vs 1 step to update Constant time update for spaced seeds? Yes, if they have a certain structure(s) Restrict spaced seeds to small update cost

30 30 O(1) spaced seed update ACGTACGTACGTACGTACGT 1: A G A G 2: C T C T... Spaced seed 1010101 can be updated in 1 step!

31 31 O(1) spaced seed update ACGTACGTACGTACGTACGT ACGTACG -> ACGACG CGTACGT -> CGTCGT... Spaced seed 1110111 can be updated in 1 step!

32 32 O(1) spaced seed update “Period” step update 11001100110011 2 steps 1000010000100001 1 step “Runs” step update 11100111111 1 step 11101110111 2 steps Minimize the number of update steps Weighted set-cover instance…

33 33 Edit-distance SS-SDP Position of matching bases might shift! Need 11111111 ↓ to get CCGCTAGA Need 111101111 ↑ to get CCGCTAGA Set cover formulation no longer works TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT

34 34 Edit-Distance SS-SDP Use a variation on set cover: q:111101111,r:11111111 covers: Pay for query & reference update costs separately Control size of problem by only enumerating templates with small update cost r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT

35 35 Correct solutions for 1-unique 20-mers 10 7 random sequence x 10 7 random sequence Seed: 1111111111 (Best single seed solution, weight 10) ~ 9.5 expensive dynamic programs per locus Seed set: 11111111111;111110111111 (weight 11) ~ 8.9 DP/locus Seed set (weight 11) ~ 7.8 DP/locus Seed set: 111111111111;1111101111111 (weight 12) ~ 2.2 DP/locus 11111111111 ~ 111101111111 111101111111 ~ 11111111111 111101111111 ~ 111101111111

36 36 Correct solutions for 1-unique 20-mers 10 7 random sequence x 10 7 random sequence Seed set: 111111111111;1111101111111 (weight 12) ~ 2.5 DP/locus Seed set (weight 12) ~ 1.8 DP/locus Seed set: 1111111111111;11111101111111 (weight 13) ~ 0.56 DP/locus (same specificity as contiguous seed weight 12) Seed set (weight 14) ~ 0.20 1111110111111 x 111111111111 1111110111111 111111001111111 11111111111111 ~ 11111111111111 111110111111111 ~ 11111111111111 111111111110111 ~ 11111111111111 111111111110111 ~ 111111111110111 11111111111111 ~ 111111101111111 111110111111111 ~ 111110111111111

37 Correct solutions for 2-unique 20-mers Seed: 111111 (Best single seed solution, weight 6) ~ 2439 DP/locus Weight 10 – 73 DP/locus (specificity of 8/9 contig seed) Weight 12 – 10 DP/locus (specificity of 10 contig seed) 37 1111111111 ~ 1111111111 11111011111 ~ 1111111111 11111011111 ~ 11111011111 1111100000011111 ~ 1111100000011111 11111000000011111 ~ 1111100000011111 111110000000011111 ~ 1111100000011111 111110000011111 ~ 1111100000011111 111110000000011111 ~ 11111000000000011111

38 38 k-unique human 20-mers No 4-unique 20-mers No 3-unique 20-mers 0. 038% of (forward) human 20-mers are 2-unique 1088322 in total about 1 every 2638 bases Fast 2-uniquness oracle

39 Genome Browser Track Edit Distance: UCSC Track: 1-unique 20-mers UCSC Track: 2-unique 20-mers 39

40 40 Integration with High Performance Computing Break sequence into chunks of size 10 7. Remember which positions have been eliminated. Integrated with (UMIACS) Condor Too unreliable for very large sequences NFS filesystem is unreliable Simultaneous jobs capped at ~ 300 Integrated with hadoop/map-reduce on 80 nodes (Edwards Lab) Reliability improved, DFS solves (some) filesystem woes Much better scalability (in theory, yet to be tested) Explicit synchronization of reduce step is undesirable.

41 41 Other improvements Other groupings are possible: Species designation on FASTA defline can be any suitable partition Constraints on the position of edits: False positive due to mishybridization at 3’ end is unlikely to be observed with some technologies Constraint on valid Tm range: Computed as in Primer3 Can eliminate undesirable mers early

42 42 Conclusions Precompute of human k-unique 20-mers is now feasible! Faster for large edit-distance! Need spaced seed-set designs Constant time update for spaced seeds Good integer programming formulation of SS- SDP Limited template enumeration based on update cost Work with integer programming experts to solve effectively


Download ppt "Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,"

Similar presentations


Ads by Google