Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.

Similar presentations


Presentation on theme: "1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine."— Presentation transcript:

1 1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine [2] Santa Clara University[1][2] [1] [2]

2 2 n-gram Search New pattern matching idea Matches algebraic signatures Preprocesses both : pattern & string (record) –String preprocessing is a new idea To the best of our knowledge Provides incidental protection of stored data Important for P2P & grid systems Fast processing Especially useful for DBs & longer patterns –ASCII, Unicode, DNA… –Should be then often faster than Boyer-Moore –Possibly the fastest known in this context

3 3 Algebraic Signature Symbols of the alphabet are elements of a Galois Field –GF (256) usually We choose there one primitive element  –Usually  = 2 The algebraic signature of the string of i symbols p 1… p i is the sum: p’ i = p 1  +…+p i  i. Here the addition and the multiplication are the operations in GF.

4 4 Algebraic Signature In our GF (2 f ) where f = 8,16: p + q = p – q = p XOR q One method for multiplying is : p*q = antilog (( log  p + log  q) mod 255) The division is then : p / q = antilog (( log  p - log  q) mod 255) The log and antilog are encoded in log and antilog tables with 2 f elements each. –Entry 0 is for element 0 of the GF and is by convention set to 2 f - 1.

5 5 Cumulative Algebraic Signature We encode every symbol p i in a string into the signature of the prefix p 1 …p i The value of a CAS symbol now encodes also the knowledge of values of all the previous ones Matching a single symbol means prefix matching

6 6 Application of CASs Protection against involuntary data disclosure On P2P & Grid Servers especially Numerous CAS encoded string matching algorithms –Prefix match with O (1) complexity –Pattern match by signature only Karp – Rabin like, linear O (L) complexity –Longest common string search –Longest common prefix search –…

7 7 CAS Properties O (K) encoding and decoding speed For encoding, for instance: p’ i = p’ i-1 + p i  i = CAS ( p i-1 ) + p i  i Fast n – gram signature calculus –For S k, l = p k …p l with k > 1 and l – k = n : AS ( S k, l ) = AS (S l - k+1 ) = (p’ l XOR p’ k - 1 ) /  k-1 Logarithmic Algebraic Signature (LAS) LAS ( S k, l ) = log AS ( S k, l ) = = ( log (p’ l XOR p’ k - 1 ) – (k-1)) mod 2 f – 1

8 8 The n-gram Search Key ideas Design a sublinear pattern match search –With speed about L / K Apply to CAS encoded DB –New idea for string search algorithm with preprocessing –Justified for a DB Store once, search many times

9 9 The n-gram Search Key ideas Preprocess the pattern to create a jump table –As in Boyer – Moore Use n –grams with n > 1 to increase the discriminative power of an attempt –Comparison of a sample from the pattern a single symbol for BM an LAS of an n – gram for a CAS-encoded string

10 10 The n-gram Search Key ideas If the alphabet uses m symbols, the probability that a symbol matches is 1/m –Assuming all symbols equally likely For usual ASCII pattern matching m = 20-25 For DNA m = 4 A single symbol may often match without the whole pattern matching e.g., ¼ times for DNA on the average Leading to small jumps, –by m symbols on the average

11 11 The n-gram Search Key ideas The probability of an n - gram matching may be : min ( 1/ 2 f, 1 / m n ) In our examples it can reach 1 / 256 – More discriminative sampling – Longer jumps By almost K or 256 symbols in general Useful for longer strings –DNA, text, images…

12 12 ASCII Exemple Usual Alphabet 2-grams => 5 jumps 1-gram => 6 jumps

13 13 DNA Exemple 4-letter Alphabet 3 jumps 4 jumps 11 jumps

14 14 The n-gram Search Preprocessing Encode every record (string) into its CAS –Done for incidental protection anyhow for SDDS-2006 Encode the terminal n - gram of the searched pattern S K into its LAS in variable V Fill up the jump table T for every other n - gram in S K –calculate every LAS –for each LAS, store in T its rightmost offset with respect to the end of S K

15 15 The n-gram Search Jump Table For GF (256), every n – gram S i, i+n-1 in the pattern and i = LAS (S i, i+n-1 ): –T ( i ) = the offset –T ( i ) = K – n + 1 otherwise Remainder : LAS (0) = 255 T can be also hash table –See the paper –Slower to use but possibly more memory efficient Probably more useful for a larger GF

16 16 ASCII Exemple Dauphine V = ne’’ 7 0 7 1 … … 1 in’’ … … 5 au’’ … … 3 ph’’ … … 7 255 Notation : xy’’ = LAS (xy)

17 17 The n-gram Search Processing Calculate LAS of the current n-gram in the string –Start with the n-gram S K-n+1, K –Continue depending on jump calculus Attempt to match V –If.true then calculate LAS of the entire current possibly matching substring of length K and ending with the current n-gram If.true, then resolve the possible collision –Either attempt to match all the K symbols –Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value

18 18 The n-gram Search Processing Otherwise –Go to T using LAS of the n-gram –Jump by the number of symbols found in T Update the “current” position for n-gram to attempt the match –Re-attempt the match as above Unless the n-gram to attempt is beyond the end of the string

19 19 ASCII Exemple Again 2-grams => 5 jumps 1-gram => 6 jumps

20 20 DNA Exemple Again 3 jumps 4 jumps 11 jumps

21 21 n-grams / BM Average shifts with n-grams can be typically longer Calculate an attempt & jump may be more expensive as well –About twice as long at first approach –The precise analysis remains to be done Rule of thumb: If shifts are more than 2 times longer, n-grams with n > 1 or should be faster than BM.

22 22 Experimental Results Searching large data of: –DNA –Typical ASCII –XML Documents Patterns of 6 to 500 symbols (bytes) 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64 Processors

23 23 Results Compared to BM DNA Up to 72 times faster Typical ASCII Up to about 11 times faster XML Documents Up to more than 5 times faster Search faster for longer pattern –Average shifts are longer

24 24 DNA

25 25 ASCII

26 26 Boyer-Moore searchNgram search Pattern sizeRecord sizePrepr. timeElapsed timeNb shiftsPos. shiftsAvg. shiftsPrepr. timeElapsed timeNb shiftsPos. shiftsAvg. shiftsRatio 51119392113968448665411050794.54293383056024311193882.001.173042 71119392102996436353211034556.07291712828233911193883.961.749416 101119392132083524430611049559.05431010216159511193846.932.062463 111119392112153726371010924658.2931908614345511193877.802.370350 131119392122005322308010654589.5540723711262611193879.942.770900 2711193921411672134974108908616.1436349647727111938423.453.338673 511119392199719105588108955920.6443244027498111936640.713.983197 186111939240468734028110863965.1682139190941119298123.083.369518 237111939249430737738110865858.769580281191119327137.875.370324 386111939274464732918110869167.3613391380721119024138.635.089814 5671119392103338530560110857472.5518381963121118932177.274.133089 XML

27 27 Related Work Implemented in SDDS-2006 Applies best to –longer patterns where many jumps occur –alphabets much smaller than the size of GF used Instead of shifts of size m in the average, one reaches almost min (K, 2 f ) per shift –up to almost 256 for DNA or ASCII with GF (256) –up to almost 64K for DNA or Unicode with GF (64K) instead of 4 or 25 respectively –For Boyer-Moore especially

28 28 Related Work In SDDS 2006 & P2P or Grid System in general Wish to hide what is searched for ? Use the signature only based search –Usually slower since linear only

29 29 Conclusion A new pattern matching algorithm Uses algebraic signatures Preprocesses both the pattern and the string Appears particularly efficient –For databases –For longer patterns Possibly faster in this context than any other algorithm known know But all this are only preliminray results

30 30 Future Work Performance Analysis –Theoretical Jump Length –Median, Average… –Experimental Actual text –Non uniform symbol distribution DNA –Actual DNA strings

31 31 Future Work Variants –Jump Table –Partial Signatures of n –grams Symbol p i encodes the n –gram signature up to p i- n+1 …p i –No more XORing & Division to find this signature –Faster unsuccessful attempt to match –Approximate Match Tolerating match errors –E.g., and at most 1 symbol

32 32 Thank You for Your Attention witold.litwin@dauphine.fr


Download ppt "1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine."

Similar presentations


Ads by Google