Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London,

Similar presentations


Presentation on theme: "Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London,"— Presentation transcript:

1 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK1/48 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK-UEKAE National Research Institute of Electronics & Cryptology,Turkey kulekci@uekae.tubitak.gov.tr www.busillis.com/o_kulekci

2 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK2/48 The area of research Pattern Matching On-line Off-line Exact Approximate Using Bit-parallelism Other techniques...

3 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK3/48 Bit-parallelism ? Computers perform bitwise operations very fast. Designing algorithms that benefit from that intrinsic property of processors

4 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK4/48 Previous bit-parallel pattern matching algorithms Shift-or algorithm –Shift-or (SO), (Baeza-Yates&Gonnet,1992) –Fast (FSO), Average optimal (AOSO), and Fast AOSO (FAOSO), (Fredriksson&Grabowski,2005) BNDM algorithm –Actually (BDM + SO) BNDM (Navarro&Raffinot,2000) –SBNDM (Peltola&Tarhio,2003) –SBNDM2 (Holub&Durian,2005)

5 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK5/48 Problems in Bit-parallel Pattern Matching Lack of shift mechanism (in original idea) –BNDM solved it. –Recent SO variants (AOSO, FAOSO) also include shift mechanisms. Patterns are required to be no longer than the computer word size !

6 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK6/48 What causes that limitation? The way that the bits are used! In previous approaches, each bit marks the position of a character in the pattern. If pattern is longer than the computer word size, more words are needed. significant drop in efficiency

7 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK7/48 Bitmasks in previous algorithms Mask creation in BNDM (SO is also similar) : unsigned long B[ALPHABET_SIZE]; for (a є Σ) B[a] = 0; for j=1..m B[p j ] = B[p j ] | (1<<(m-j)); Bits in mask B[c] express the location of character c in the pattern, –e.g. For pattern P = abaab B[a] = 0....10110 B[b] = 0....01001

8 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK8/48 How to overcome ? Load a different information, which is lenght independent, to a single bit. Each bit carries information about the whole pattern right shifted some amount in the proposed bit-parallel length independent matching (BLIM).

9 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK9/48 Basic notation Text T = t 0 t 1 t 2 t 3...t n-1 Pattern P = p 0 p 1...p m-1 Computer Word Size W Σ denotes alphabet

10 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK10/48 Off-line Pattern Matching (in general...) Slide a window over the text Check & Shift Text : Pattern:

11 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK11/48 Sliding window p0p0 p1p1 p2p2 p m-1 0 p0p0 p1p1 p2p2 p0p0 p1p1 p2p2 1 2 0123 p0p0 p1p1 p2p2 W-1 W-m+2 The window that is to be slid over T. W rows, ws = W+m-1 columns i th row contains i character right shifted P.

12 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK12/48 Sliding window 0 1 2 01234567891011 3 4 5 6 7 abaab abaab abaab abaab abaab abaab abaab abaab P = abaab W = 8

13 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK13/48 0 1 2 3 4 5 6 7 abaab abaab abaab abaab abaab abaab abaab abaab 01234567891011 tjtj t j+1 t j+2 t j+3 t j+4 t j+5 t j+6 t j+7 t j+8 t j+9 t j+10 t j+11 b Mask[b][6] = 1 0 1 0 0 1 1 1 = A7 b 0 =1 b 1 =1 b 2 =1 b 3 =0 b 4 =0 b 5 =1 b 6 =0 b 7 =1 Bitmask Creation

14 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK14/48 Bitmask Mask[ch][pos] is a bitvector of W bits as b w-1 b w-2... b 1 b 0 where ch Є Σ, and 0 pos (W+m-1) Bits denote which of the alignments in the investigation window are appropriate when one observes character ch at position pos.

15 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK15/48 Bitmask i th bit of Mask[ch][pos] gathers info whether the i character right shifted placement of pattern mathes with the observed ch at position pos. b i = 0, if (0 pos-i < m) and (ch p pos-i ) b i = 1, otherwise

16 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK16/48 P = abaab, W = 8, Σ = {a,b,c,d} (ch) ws = W + m – 1 = 12 (pos) Mask[ch][pos] =... Sample Bitmask

17 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK17/48 Up to now, we created the sliding window, and the associated bitmasks. How those masks are used for matching followed by a shift procedure?

18 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK18/48 Checking... Text Pattern window ch pos flag = 1111... 1 flag = flag & Mask[ch][pos] W bits

19 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK19/48 Checking... Text Pattern window Continue until flag becomes zero or all the positions are visited. If all positions visited, there are some matches. The index of the bits that are 1 on the flag determine which of the alignments are observed. In what order we visit the positions on the window?

20 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK20/48 Scan Order A heuristic approach to visit minimum number of characters in case of a mismatch ScanOrder = {m-i,2m-i,...,km-i} – i = 1,2,...m – (km-i) < ws (ws = W+m-1)

21 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK21/48 Scan Order abaab abaab abaab abaab abaab abaab abaab abaab 01234567891011 ScanOrder =4,9,3,8,2,7,1,6,11,0,5,10

22 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK22/48 Shifting... Text The amount of shift ? Pattern window

23 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK23/48 Shift Mechanism Same as Sundays quick search Move right according to the immediate text character succeding the current window under investigation

24 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK24/48 Shift Mechanism abaab abaab abaab abaab abaab abaab abaab abaab tjtj t j+1 t j+2 t j+3 t j+4 t j+5 t j+6 t j+7 t j+8 t j+9 t j+10 t j+11 abaab abaab abaab abaab abaab t j+12 Character Shift Value a 9 b 8 others... 13......

25 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK25/48 BLIM Algorithm Ws = W+m-1; Compute Mask; Compute ScanOrder; Compute Shift; Pad text T with ws number of NULL characters; i=0; while(i<n){ flag = Mask[T[i+ScanOrder[0]]][ScanOrder[0]]; for(i=j;j<ws;j++) flag &= Mask[T[i+ScanOrder[j]]][ScanOrder[j]]; if (flag){ Check bits of the flag to locate occurences } i+=Shift[T[i+ws]]; }

26 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK26/48 Sample Run

27 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK27/48 Complexity Best case : Worst Case: Minimum number of character comparison Maximum shift Minimum shift Maximum number of character comparison

28 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK28/48 Experimental Results On DNA sequences (Manzinis DNA compression corpus) On natural language text (enwik8.txt) 100 sample pattern for each length tested gcc -O3 Intel Xeon 2.4 Ghz, 3GB memory

29 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK29/48 BLIM vs. SO Family on DNA

30 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK30/48 BLIM vs. BNDM Family on DNA

31 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK31/48 Overall Performance on DNA

32 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK32/48 BLIM vs. SO Family on Nat.Lan.

33 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK33/48 BLIM vs. BNDM Family on Nat.Lan.

34 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK34/48 Overall performance on Nat. Lan.

35 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK35/48 Multi-pattern Case Bit-parallel approaches suffer more in multi-pattern case, as the total length is more likely to exceed the word size. BLIM serves a good basis for that case with its ability to search up to W patterns of any length in a common bit-parallel fashion.

36 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK36/48 Multi-pattern BLIM abaa bba abaa abaa abaa bba bba bba 0 1 2 3 4 5 6 7 0 12 3 45 6 P = {abaa,bba} R = W / |P| = 8/2 = 4 pivot = min{R-1+|P i |, P i Є P} = 6 ws = max{R-1+|P i |, P i Є P} = 7

37 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK37/48 Multi-pattern BLIM Bitmask creation –straight forward as before. ScanOrder – Let s = min{ P i, P i Є P } S1 = { s-i, 2s-i,... ks-i}, i = 1,2,...,s (ks-i) < pivot ScanOrder = S1 U {pivot, pivot+1,...., ws-1}

38 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK38/48 Multi-BLIM Scan Order abaa bba abaa abaa abaa bba bba bba 0 1 2 3 4 5 6 7 0 12 3 45 6 s = min {4,3} = 3 ScanOrder=2,5,1,4,0,3,6

39 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK39/48 Multi-BLIM Shift Mechanism abaa bba abaa abaa abaa bba bba bba 0 1 2 3 4 5 6 7 tjtj t j+1 t j+2 t j+3 t j+4 t j+5 t j+6 t j+7........... bba bba bba abaa abaa abaa abaa Character Shift Value aa 4 ab 6 ba 6 bb 6 {c,d}a 7 {c,d}b 7...else 8

40 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK40/48 Experimental Results on Multi_BLIM Multi_BLIM is compared with Aho&Corasick and Commentz&Walter algorithms via the SPARE Parts 2003 toolkit. DNA pattern lenghts in between 4 to 30. NL pattern lenghts in between 2 to 20. Up to 32 patterns randomly collected for each test. Intel Xeon 2.4GHz, 3GB Memory Manzinis DNA corpus & enwik8.txt

41 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK41/48 Multi_BLIM Performance on DNA

42 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK42/48 Multi_BLIM Performance on Nat. Lan.

43 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK43/48 About q-gram utilization? Instead of reading one character at a time, read more by the help of the recent advances in CPU architecture –Fredriksson, Shift-or string matching with super alphabets,2003 –Durian et al., Tuning BNDM with q-grams, 2009 Unfortunately, not so much gain because of BLIMs random access structure

44 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK44/48 About q-gram utilization Mainly 2 reasons for low gain when using q-grams in BLIM: 1.BLIM does not pass over the text sequentially, but instead performs distant reads on the investigation window. 2.Mask is of size |Σ|*(W+m-1). As Σ grows with q-gram usage, Mask becomes large that is not fitting into the first level cache.

45 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK45/48 Conclusion An initial attempt to solve computer word size limitation in bit-parallel pattern matching The speed is in range of SBNDM, and SBNDM2, with an additional advantage that it does not require to do something special when input pattern length is longer than W.

46 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK46/48 Conclusion It must be noted that, in general, it is slower than Lecroqs new algorithm, and also for lengths longer than 100, backward (suffix)oracle matching is a better alternative (applies to all bit-parallel algorithms also). Multi pattern BLIM shows good performance and maybe a strong alternative for classical multi pattern search algorithms.

47 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK47/48 Acknowledgement Thanks to –Jorma Tarhio, –Kimmo Fredriksson, –Thierry Lecroq, for sharing their codes and comments.

48 Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London, UK48/48 Thank you! any question?


Download ppt "Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching M. Oğuzhan Külekci TÜBİTAK - UEKAE 5/2/2009LSD&LAW'09, King's College, London,"

Similar presentations


Ads by Google