Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching.

Similar presentations


Presentation on theme: "Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching."— Presentation transcript:

1 Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

2 Jumbled matching Interesting variation of string matching. To find substrings of T which are permutations of P. For example: P=abcb in T=aababcaabc. 2

3 Jumbled matching Parikh Vector- The pattern can be described as parikh vector. Vector of multiplicities of the characters. p(S) is (1,2,1,0) for S = abcb = {a,b,c,d}. 3

4 Approximate Permutaion Matching The string P´ is a k-approximate permutation of P, 0 <= k < m, |P´| = |P| = m holds set(P´) is the set of characters in P´ and cc(u,c) is the number of occurrences of a character c in a string u. 4

5 Motivation Alignment of strings SNP discovery Discovery of repeated patterns Interpretation of mass spectrometry data 5

6 Previous Algorithms Key Idea- scan the text forward while maintaining counts of characters. Work in linear time. These algorithms were developed as filtration methods for online approximate string matching. 6

7 Previous Algorithms Grossi & Luccio’s (Information Processing Letters 1989) and Navarro’s (Proc. WSP 1997) solutions are based on the frequency of characters. Navarro’s counting algorithm - sliding window approach. 7

8 Previous Algorithms Grossi and Luccio’s (Information Processing Letters 1989) solution maintains a queue of characters. It grows with the acceptable characters. Navarro presented a Mcount for multiple patterns (Proc. WSP 1997). 8

9 Previous Algorithms Cantone and Faro (Proc. PSC 2014) presented the BAM algorithm (Bit-parallel Abelian Matcher). Associate a counter(bin) to each distinct character in P. A single 1-bit counter for the remaining characters of the alphabet. 9

10 Previous Algorithms At the start of processing a window, every overflow bit is zero. 1-bit counter reserved for all the characters not occurring in p is initially null. And it gets set as soon as any character not in p is encountered in the text window. It becomes clear that the text window cannot be a permutation of the pattern P. 10

11 Bit Parallel simulation P = abbccc c b a other characters 11

12 Initialization for state vector c b a P = abbccc All other characters 12

13 Forward Processing 13

14 Backward Processing 14

15 New solutions Solutions for both exact and approximate jumbled matching. We present two algorithms that are modifications of BAM. ABAM (approximate BAM). BAM2 (enhanced BAM with 2-grams). 15

16 Key Idea: Counters We used bit fields to store counters. For each character that appears in the pattern. One for all other characters. Highest bit is an overflow indicator. Space to represent number of times the character appears in the pattern + maximum error count k. 16

17 State Vector D Counters are stored in state vector D. If they do not fit in one word We can put several different characters in one field. But then we must verify matches. Initial vales of D are fetched from precomputed word. Processing of each character is made by using array M[t j ] which has the one in the field for t j. Value of D is updated by D D + M[t j ]. 17

18 Initialization for state vector D and M[ ] for pattern P = abbccc 0 0 0 0 0 0 1 0 0 abc All other characters x 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 M[a]M[a] M[b]M[b] M[c]M[c] M[x] I 18

19 Variations of BAM BAMs Some bins are shared if necessary. If bins are shared, each match candidate needs to be verified. BAM2 Handles 2 text characters (2-gram) at a time. Separate loop for patterns of even and odd length. Reads four characters before testing D first time. Hence the minimum width of a field is four bits instead of two. 19

20 ABAM ABAM : Approximate BAM. C is the error counter. F[t j ] is mask for testing overflow bits. 20

21 EBL (Exact Backward for Large alphabets) EBL is based on SBNDM2. Instead of representing occurrence vectors. Array B states of a character is present in the pattern. When the alignment window contains only acceptable characters, the window is a match candidate. Acceptable: characters that appear in the pattern. Update step is simply D = D & B[t i+j-1 ]. 21

22 EFS (Exact forward for small alphabets) AFL (Approximate Backward for small alphabets) EFS: Update step is D D + M[t i ] – M[t i-m ]. AFL is modification of Mcount tuned for single pattern. Different initial value of the counter. 22

23 ABS (Approximate Backward for Small Alphabets) Error count C is updated without conditional code by shifting the corresponding overflow bit to the lowest bit and then masking it. Shift is utilizing array o[ ] which contains the positions of overflow bits. 23

24 Execution times of algorithms (in seconds) for English data 24

25 Execution times of algorithms (in seconds) for dna data 25

26 Execution times of algorithms (in seconds) for protein data 26

27 Experimental Results English data BAM2a works more than two times faster than the previous algorithms. DNA data EFS works in a double speed an compared to previous algorithms. Protein data BAM2a is fastest and takes less than half time compared to previos agorithms. 27

28 Concluding remarks We introduced new variations jumbled matching algorithms. All the forward algorithms are clearly linear. The speed of AFL do not depend on the value of k. Technique of shared bins showed to be useful for jumbled matching. 28

29 THANK YOU 29


Download ppt "Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching."

Similar presentations


Ads by Google