1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Information Retrieval in Practice
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Introduction to Algorithms
Top k Knapsack Joins and Closure Early Results Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France
Space-for-Time Tradeoffs
1 CSC 421: Algorithm Design & Analysis Spring 2013 Space vs. time  space/time tradeoffs  examples: heap sort, data structure redundancy, hashing  string.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off (extra space in tables - breathing.
Dictionaries and Hash Tables1  
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Witold Litwin Riad Mokadem Thomas Schwartz Disk Backup Through Algebraic Signatures.
Reverse Colussi algorithm
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Hashing General idea: Get a large array
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Using Algebraic Signatures in Storage Applications Thomas Schwarz, S.J. Associate Professor, Santa Clara University Associate, SSRC UCSC Storage Systems.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Application: String Matching By Rong Ge COSC3100
CSC 211 Data Structures Lecture 13
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Sorting and Searching by Dr P.Padmanabham Professor (CSE)&Director
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
ECE 101 An Introduction to Information Technology Information Coding.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CSC 421: Algorithm Design & Analysis
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
CSG523/ Desain dan Analisis Algoritma
CSC 421: Algorithm Design & Analysis
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hash functions Open addressing
Rabin & Karp Algorithm.
Space-for-time tradeoffs
Chapter 7 Space and Time Tradeoffs
Space-for-time tradeoffs
Space-for-time tradeoffs
Erasure Correcting Codes for Highly Available Storage
Data Protection and String Search in SDDS-2005
Space-for-time tradeoffs
Space-for-time tradeoffs
CS 3343: Analysis of Algorithms
Presentation transcript:

1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine [2] Santa Clara University[1][2] [1] [2]

2 n-gram Search New pattern matching idea Matches algebraic signatures Preprocesses both : pattern & string (record) –String preprocessing is a new idea To the best of our knowledge Provides incidental protection of stored data Important for P2P & grid systems Fast processing Especially useful for DBs & longer patterns –ASCII, Unicode, DNA… –Should be then often faster than Boyer-Moore –Possibly the fastest known in this context

3 Algebraic Signature Symbols of the alphabet are elements of a Galois Field –GF (256) usually We choose there one primitive element  –Usually  = 2 The algebraic signature of the string of i symbols p 1… p i is the sum: p’ i = p 1  +…+p i  i. Here the addition and the multiplication are the operations in GF.

4 Algebraic Signature In our GF (2 f ) where f = 8,16: p + q = p – q = p XOR q One method for multiplying is : p*q = antilog (( log  p + log  q) mod 255) The division is then : p / q = antilog (( log  p - log  q) mod 255) The log and antilog are encoded in log and antilog tables with 2 f elements each. –Entry 0 is for element 0 of the GF and is by convention set to 2 f - 1.

5 Cumulative Algebraic Signature We encode every symbol p i in a string into the signature of the prefix p 1 …p i The value of a CAS symbol now encodes also the knowledge of values of all the previous ones Matching a single symbol means prefix matching

6 Application of CASs Protection against involuntary data disclosure On P2P & Grid Servers especially Numerous CAS encoded string matching algorithms –Prefix match with O (1) complexity –Pattern match by signature only Karp – Rabin like, linear O (L) complexity –Longest common string search –Longest common prefix search –…

7 CAS Properties O (K) encoding and decoding speed For encoding, for instance: p’ i = p’ i-1 + p i  i = CAS ( p i-1 ) + p i  i Fast n – gram signature calculus –For S k, l = p k …p l with k > 1 and l – k = n : AS ( S k, l ) = AS (S l - k+1 ) = (p’ l XOR p’ k - 1 ) /  k-1 Logarithmic Algebraic Signature (LAS) LAS ( S k, l ) = log AS ( S k, l ) = = ( log (p’ l XOR p’ k - 1 ) – (k-1)) mod 2 f – 1

8 The n-gram Search Key ideas Design a sublinear pattern match search –With speed about L / K Apply to CAS encoded DB –New idea for string search algorithm with preprocessing –Justified for a DB Store once, search many times

9 The n-gram Search Key ideas Preprocess the pattern to create a jump table –As in Boyer – Moore Use n –grams with n > 1 to increase the discriminative power of an attempt –Comparison of a sample from the pattern a single symbol for BM an LAS of an n – gram for a CAS-encoded string

10 The n-gram Search Key ideas If the alphabet uses m symbols, the probability that a symbol matches is 1/m –Assuming all symbols equally likely For usual ASCII pattern matching m = For DNA m = 4 A single symbol may often match without the whole pattern matching e.g., ¼ times for DNA on the average Leading to small jumps, –by m symbols on the average

11 The n-gram Search Key ideas The probability of an n - gram matching may be : min ( 1/ 2 f, 1 / m n ) In our examples it can reach 1 / 256 – More discriminative sampling – Longer jumps By almost K or 256 symbols in general Useful for longer strings –DNA, text, images…

12 ASCII Exemple Usual Alphabet 2-grams => 5 jumps 1-gram => 6 jumps

13 DNA Exemple 4-letter Alphabet 3 jumps 4 jumps 11 jumps

14 The n-gram Search Preprocessing Encode every record (string) into its CAS –Done for incidental protection anyhow for SDDS-2006 Encode the terminal n - gram of the searched pattern S K into its LAS in variable V Fill up the jump table T for every other n - gram in S K –calculate every LAS –for each LAS, store in T its rightmost offset with respect to the end of S K

15 The n-gram Search Jump Table For GF (256), every n – gram S i, i+n-1 in the pattern and i = LAS (S i, i+n-1 ): –T ( i ) = the offset –T ( i ) = K – n + 1 otherwise Remainder : LAS (0) = 255 T can be also hash table –See the paper –Slower to use but possibly more memory efficient Probably more useful for a larger GF

16 ASCII Exemple Dauphine V = ne’’ … … 1 in’’ … … 5 au’’ … … 3 ph’’ … … Notation : xy’’ = LAS (xy)

17 The n-gram Search Processing Calculate LAS of the current n-gram in the string –Start with the n-gram S K-n+1, K –Continue depending on jump calculus Attempt to match V –If.true then calculate LAS of the entire current possibly matching substring of length K and ending with the current n-gram If.true, then resolve the possible collision –Either attempt to match all the K symbols –Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value

18 The n-gram Search Processing Otherwise –Go to T using LAS of the n-gram –Jump by the number of symbols found in T Update the “current” position for n-gram to attempt the match –Re-attempt the match as above Unless the n-gram to attempt is beyond the end of the string

19 ASCII Exemple Again 2-grams => 5 jumps 1-gram => 6 jumps

20 DNA Exemple Again 3 jumps 4 jumps 11 jumps

21 n-grams / BM Average shifts with n-grams can be typically longer Calculate an attempt & jump may be more expensive as well –About twice as long at first approach –The precise analysis remains to be done Rule of thumb: If shifts are more than 2 times longer, n-grams with n > 1 or should be faster than BM.

22 Experimental Results Searching large data of: –DNA –Typical ASCII –XML Documents Patterns of 6 to 500 symbols (bytes) 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64 Processors

23 Results Compared to BM DNA Up to 72 times faster Typical ASCII Up to about 11 times faster XML Documents Up to more than 5 times faster Search faster for longer pattern –Average shifts are longer

24 DNA

25 ASCII

26 Boyer-Moore searchNgram search Pattern sizeRecord sizePrepr. timeElapsed timeNb shiftsPos. shiftsAvg. shiftsPrepr. timeElapsed timeNb shiftsPos. shiftsAvg. shiftsRatio XML

27 Related Work Implemented in SDDS-2006 Applies best to –longer patterns where many jumps occur –alphabets much smaller than the size of GF used Instead of shifts of size m in the average, one reaches almost min (K, 2 f ) per shift –up to almost 256 for DNA or ASCII with GF (256) –up to almost 64K for DNA or Unicode with GF (64K) instead of 4 or 25 respectively –For Boyer-Moore especially

28 Related Work In SDDS 2006 & P2P or Grid System in general Wish to hide what is searched for ? Use the signature only based search –Usually slower since linear only

29 Conclusion A new pattern matching algorithm Uses algebraic signatures Preprocesses both the pattern and the string Appears particularly efficient –For databases –For longer patterns Possibly faster in this context than any other algorithm known know But all this are only preliminray results

30 Future Work Performance Analysis –Theoretical Jump Length –Median, Average… –Experimental Actual text –Non uniform symbol distribution DNA –Actual DNA strings

31 Future Work Variants –Jump Table –Partial Signatures of n –grams Symbol p i encodes the n –gram signature up to p i- n+1 …p i –No more XORing & Division to find this signature –Faster unsuccessful attempt to match –Approximate Match Tolerating match errors –E.g., and at most 1 symbol

32 Thank You for Your Attention