Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kazunori Hirashima 1, Hideo Bannai 1, Wataru Matsubara 2, Kazuhiko Kusano 2, Akira Ishino 2, Ayumi Shinohara 2 1 Kyushu University, Japan 2 Tohoku University,

Similar presentations


Presentation on theme: "Kazunori Hirashima 1, Hideo Bannai 1, Wataru Matsubara 2, Kazuhiko Kusano 2, Akira Ishino 2, Ayumi Shinohara 2 1 Kyushu University, Japan 2 Tohoku University,"— Presentation transcript:

1 Kazunori Hirashima 1, Hideo Bannai 1, Wataru Matsubara 2, Kazuhiko Kusano 2, Akira Ishino 2, Ayumi Shinohara 2 1 Kyushu University, Japan 2 Tohoku University, Japan

2  Runs  Bit-parallel algorithms for counting runs ◦ Counting prefix runs ◦ Removing duplicate runs by position ◦ Removing duplicates by Sieve  Computational Experiments  Conclusion

3  runs: occurrence of a periodic factor ◦ non extendable(maximal) ◦ exponent at least two ◦ primitive-rooted  example: w = abbabbaccbcbcbc  run(w) : number of runs in string w

4  Linear time algorithm [Kolpakov&Kucherov ‘99] ◦ requires LZ-factorization of string  We present 3 bit-parallel algorithms to calculate run(w) ◦ does not require complicated data structures ◦ very efficient for short strings

5  Runs  Algorithms ◦ Counting prefix runs ◦ Removing duplicate runs by position ◦ Removing duplicates by Sieve  Computational Experiments  Discussion

6 For general alphabet:  Counting prefix runs For binary alphabet:  Removing duplicate runs by position  Removing duplicate runs by Sieve

7 prefix repetition = a repetition that is also a prefix prefix run = a run that is also a prefix  Idea For each suffix: 1.detect right maximal prefix repetitions of each period 2.count only repetitions with exponent at least 2 3.count only left maximal repetitions

8 example: w=aabaabaaaacaac 1234567891011121314 1234567891011121314 periodaabaabaaaacaac 111000000000000 211000000000000 311111111000000 411111000000000 511111000000000 611111111000000 711111111100000 aabaabaaaacaac 1aa------------ 2-------------- 3aabaabaa------ 4-------------- 5-------------- 6-------------- 7-------------- aabaabaaaacaac 11011011110110 00100100000000 00000000001001 occ occ[a] occ[b] occ[c] Detect right maximal prefix repetitions of each period prefix run w[1]=w[4],w[2]=w[4] ActiveArea

9 bitmask 1234567891011121314 nextChar=w[i]; bitmask=((occ[nextChar] >> (Length-i)) | (~0) << i); alive=alive&bitmask; periodaabaabaaaacaac 1 2 3 4 5 6 7 example: w=aabaabaaaacaac a100 b 010 a100 a100 b 010 a100 a100 occ a b c 1 1 1 1 1 1 1 a 1 1 1 1 1 1 1 0 0 1 1 1 1 1 a 1 1 1 1 1 1 1 b 0 0 1 1 1 1 1 a 0 0 1 1 1 1 1 0 1 1 1 1 1 1 abaaaacaac 0000000000 0000000000 1111000000 1000000000 1000000000 1111000000 1111100000 alive Length - i Detect right maximal prefix repetitions of each period pseudo code

10 nextChar=w[i]; bitmask=((occ[nextChar] >> (Length-i)) | (~0) << i); prevAlive=alive; alive=alive&bitmask; If prevAlive ^ alive & ActiveArea≠0 then count++; example: w=aabaabaaaacaac 1234567891011121314 periodaabaabaaaacaac 111000000000000 211000000000000 311111111000000 411111000000000 511111000000000 611111111000000 711111111100000 ActibeArea 1 0 0 0 0 0 0 ActiveArea 1 1 0 0 0 0 0 prevAlive ^ alive Count only repetitions with exponent at least 2 If i mod 2 = 1 then activeArea := (activeArea << 1) | 1 ; pseudo code

11 1234567891011121314 periodaabaabaaaacaac 1--100000000000 2--000000000000 3--000000000000 4--111100000000 5--000000000000 6--000000000000 7--000000000000 aabaabaaaacaac 1--100000000000 2--110000000000 3--111111000000 4--111100000000 5--111110000000 6--111111000000 7--111111100000 aabaabaaaacaac 1-0100000000000 2-1110000000000 3-1111111000000 4-0111100000000 5-1111110000000 6-1111111000000 7-1111111100000 example: w=aabaabaaaacaac aabaabaaaacaac 11011011110110 00100100000000 00000000001001 aabaabaaaacaac 11011011110110 00100100000000 00000000001001 occ occ[a] occ[b] occ[c] Count only left maximal repetitions w[2]=w[2+3]w[2]=w[2+2] w[2]≠w[2+1] w[3:8] seems to be run, but it can extend left. So w[3:8] isn’t a run.

12 Idea 1. detect maximal repetition for each period 1, 2..., |w|/2. 2. count only repetitions with exponent at least 2. 3. count only repetitions of minimum period

13  v= w ^ ((~w)>>p)  Example p=3 01101100101 01101100101 P 11110110 w XOR ~w Detect maximal repetition for each period 1, 2..., |w|/2. w = v =v = maximal repetition of period p in w  stretch of 1’s in v

14 5 01101100101 01101100101 p=3 11110110 7 4=7-3 w XOR ~w Delete repetitions with exponent less than 2. 2=5-3 Stretch of 1’s must be at least length p=3. This is too short to be a run of period p = 3. w = v =  v= w ^ ((~w)>>p)  Example p=3

15 011110010 s = v; While (p>1) s = s & (v>>p); p--; END v=s;v=s; 000010000 011110010 011110010 011110010 This calculation shortens each stretch of 1’s by p-1 Delete repetitions with exponent less than 2. 1 2 p - 1 & & &

16  selfAND(v,p)  While p>1 s = p>>1; v = v & (v>>s); p = p – s; END 011111111 Example v = 00111111110010p=7 011111111 p s 0 ** 011111 0 * 0010111 000000011 Delete repetitions with exponent less than 2. O(p) → O(log p).

17  Example w=00110011111111,p=4 ◦ v=w^((~w)>>p)= 000011110001111 ◦ selfAND(v,p) = 000000010000001 run with minimum period 1 We need to remove duplicates. 2 approaches to remove repetitions of non-minimum periods: Removing duplicate by Position Removing duplicate by Sieve

18 For period =1 to length/2 do v=(w^((~w)>>1))&(1 length >>period) ; x=SelfAND(v,period); While x ≠ 0 do begPos=lsb(x); y=x+(1<<begPos); x= x & y; y=y & (-y); y=y << ((period – 1) << 1); If (runEndsByBegPos[begPos] & y) = 0then count ++; runEndsByBegPos[begPos] = runEndsByBegPos[begPos] | y; End 1001010101011 0000111111110 0000001111110 w=w= w^((~w)>>2)= w^((~w)>>4)= Begin position  End position  only count maximal repetitions with different begin and end positions only count maximal repetitions with different begin and end positions 2 4

19 00110011111 00010010000 00000011111 For period =1 to length/2 do pvec[period]=w^((~w)>>1) ; End For period=1 to length/2 do x=SelfAND(pvec[period],period); count=count+oneRuns(x); For p=2*period to length/2 do x=x & (x >> period); If x=0 then break pvec[p] =pvec[p] ^ (x); End Example:w=11110101010 01110000000 00000001111 w^((~w)>>1) w^((~w)>>2) w^((~w)>>3) w^((~w)>>4) xor 00000000000 00000010000 delete runs in larger periods

20  count=0; While (v ≠ 0) v = v & ((v | (v – 1)) + 1); count++; END  Example v=1001110011 ◦ v | (v – 1) = 100111011 ◦ v | (v – 1) + 1 = 100111100 ◦ v & ((v | (v – 1))+1) = 100111000 ◦ v | (v – 1) = 100111111 ◦ v & ((v | (v – 1))+1) = 100000000 ◦ v & ((v | (v – 1))+1) = 000000000 bit operations to count the number of stretches of 1’s

21  Runs  Algorithms ◦ Counting prefix runs ◦ Removing duplicate runs by position ◦ Removing duplicates by Sieve  Computational Experiments  Discussion

22 Calculate run(w) for all binary strings of length n  CPU:3.2GHz dual core Xeon  GPU:Geforce 8800GT  Memory:18GB  OS:MacOSX 10.5 Leopard

23 count=0 For period =1 to length/2 do pvec[period] = w ^ ((~w) >> 1) ; End For period=1 to length/2 do x = SelfAND(pvec[period],period); count = count + oneRuns(x); For p = 2 * period to length/2 do x=x & (x >> period); If x=0 then break pvec[p] = pvec[p]^(~x); End GPU Multi Processor Multi Processor Stream Processor Use the programming tool CUDA

24 Running time (seconds) for calculating run(w) for all binary strings of length n n 2021 222324252627282930 prefix 0.320.691.493.138.0215.632.466.4150.2296.5625.4 position 0.090.180.360.731.493.06.212.625.652.1106.0 sieve 0.100.180.370.751.503.06.011.9 23.948.196.7 GPGPU (sieve) 0.010.020.040.080.180.40.71.43.0 5.9 12.2

25 n234567891011121314 p(n)p(n)1122345567889 n15161718192021222324252627 p(n)p(n)10 12131415 161718192021 n28293031323334353637383940 p(n)p(n)222324252627 282930 3132 n41424344454647 p(n)p(n)3335 363738 New! Kolpakov & Kucherov ’99 The maximum number of runs function ρ(n)=max { run(w) : |w| = n } for binary strings calculated for n up to 47

26 1.00n1.05n0.95n0.90n 5n [Rytter ’06] 3.48n [Puglisi et al. ’08] 3.44n [Rytter ’07] 1.6n [Crochemore & Ilie ’08] cn [Kolpakov & Kucherov ’99] 0n2n3n4n5n ρ(n)ρ(n) 1.029n [Crochemore et al. ’08] 0.944565n [Matsubara et al ’08] 0.94457571235n [Matsubara et al ’09] [Simpson ’09] 0.927n [Franeck & Simpson ’06]

27 f(n, 3) = 2f(n – 2, 3) - f(n – 4, 3) + 234 for n  16. f(n, 2) = f(n – 2, 2) + 72 for n  9. n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 f(n, 1) 2 6 14 18 20 f(n, 3) 0 8 38 102 202 376 596 880 1220 1622 2080 2598 3174 3808 4502 5252 6064 6930 f(n, 2) 0 2 14 38 66 98 138 170 210 242 282 314 354 386 426 458 498 530 570 602 f(n, 4) 0 4 34 130 306 682 1314 2296 3736 5686 8260 11562 15642 20626 26574 33590 41754 n f(n, 1) f(n, 2) f(n, 3) f(n, 4) 23 20 642 7860 51184 24 20 674 8842 61898 25 20 714 9890 74070 26 20 746 10988 87732 27 20 786 12154 103000 28 20 818 13368 119922 29 20 858 14652 138664 30 20 890 15982 159216 31 20 930 17384 181764 32 20 962 18830 206308 33 20 1002 20350 233012 34 20 1034 21912 261896 35 20 1074 23550 293138 36 20 1106 25228 326696 37 20 1146 26984 362804 38 20 1178 28778 401434 39 20 1218 30652 442762 40 20 1250 32562 486776 41 20 1290 34554 533702 42 20 1322 36580 583470

28  We presented 3 bit-parallel algorithms for efficiently computing all the runs in short strings. ◦ O(n 2 ) time if n = O(word size) ◦ First algorithm can be used for strings with larger alphabet size at some cost ◦ Two latter algorithms specialized for binary strings* and very efficient * We recently noticed that they can be adapted to handle larger alphabets  Calculated ρ (n) for binary strings of length up to n=47


Download ppt "Kazunori Hirashima 1, Hideo Bannai 1, Wataru Matsubara 2, Kazuhiko Kusano 2, Akira Ishino 2, Ayumi Shinohara 2 1 Kyushu University, Japan 2 Tohoku University,"

Similar presentations


Ads by Google