1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Information Retrieval in Practice
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Introduction to Algorithms
Top k Knapsack Joins and Closure Early Results Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France
Space-for-Time Tradeoffs
Hashing Part One Reaching for the Perfect Search Most of this material stolen from "File Structures" by Folk, Zoellick and Riccardi.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
1 CSC 421: Algorithm Design & Analysis Spring 2013 Space vs. time  space/time tradeoffs  examples: heap sort, data structure redundancy, hashing  string.
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off (extra space in tables - breathing.
Dictionaries and Hash Tables1  
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Cmpt-225 Algorithm Efficiency.
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Chapter 7 Space and Time Tradeoffs Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Witold Litwin Riad Mokadem Thomas Schwartz Disk Backup Through Algebraic Signatures.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Reverse Colussi algorithm
Hashing General idea: Get a large array
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Using Algebraic Signatures in Storage Applications Thomas Schwarz, S.J. Associate Professor, Santa Clara University Associate, SSRC UCSC Storage Systems.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Application: String Matching By Rong Ge COSC3100
CSC 211 Data Structures Lecture 13
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Sorting and Searching by Dr P.Padmanabham Professor (CSE)&Director
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CSC 421: Algorithm Design & Analysis
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
CSG523/ Desain dan Analisis Algoritma
CSC 421: Algorithm Design & Analysis
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hash functions Open addressing
Space-for-time tradeoffs
Chapter 7 Space and Time Tradeoffs
Space-for-time tradeoffs
Space-for-time tradeoffs
Erasure Correcting Codes for Highly Available Storage
Data Protection and String Search in SDDS-2005
Space-for-time tradeoffs
Space-for-time tradeoffs
CS 3343: Analysis of Algorithms
Presentation transcript:

1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine [2] Santa Clara University[1][2] [1] [2]

2 n-gram Search New pattern matching idea Matches algebraic signatures Preprocesses both : pattern & string (record) –String preprocessing is a new idea To the best of our knowledge Provides incidental protection of stored data Important for P2P & grid systems Fast processing Especially useful for DBs & longer patterns –ASCII, Unicode, DNA… –Should be then often faster than Boyer-Moore –Possibly the fastest known in this context

3 Algebraic Signature Symbols of the alphabet are elements of a Galois Field –GF (256) usually We choose there one primitive element  –Usually  = 2 The algebraic signature of the string of i symbols p 1… p i is the sum: p’ i = p 1  +…+p i  i. Here the addition and the multiplication are the operations in GF.

4 Algebraic Signature In our GF (2 f ) where f = 8,16: p + q = p – q = p XOR q One method for multiplying is : p*q = antilog (( log  p + log  q) mod 255) The division is then : p / q = antilog (( log  p - log  q) mod 255) The log and antilog are encoded in log and antilog tables with 2 f elements each. –Entry 0 is for element 0 of the GF and is by convention set to 2 f - 1.

5 Cumulative Algebraic Signature We encode every symbol p i in a string into the signature of the prefix p 1 …p i The value of a CAS symbol now encodes also the knowledge of values of all the previous ones Matching a single symbol means prefix matching

6 Application of CASs Incidental stored data protection –On P2P & Grid Servers especially Numerous CAS encoded string matching algorithms –Prefix match with O (1) complexity –Pattern match by signature only Karp – Rabin like, linear O (L) complexity –Longest common string search –Longest common prefix search –…

7 CAS Properties O (K) encoding and decoding speed For encoding, for instance: p’ i = p’ i-1 + p i  i = CAS ( p i-1 ) + p i  i Fast n – gram signature calculus –For S k, l = p k …p l with k > 1 and l – k = n : AS ( S k, l ) = AS (S l - k+1 ) = (p’ l XOR p’ k - 1 ) /  k-1 Logarithmic Algebraic Signature (LAS) LAS ( S k, l ) = log AS ( S k, l ) = = ( log (p’ l XOR p’ k - 1 ) – (k-1)) mod 2 f – 1

8 The n-gram Search Key ideas Design a sublinear pattern match search –With speed about L / K Apply to CAS encoded DB –New idea for string search algorithm with preprocessing –Justified for a DB Store once, search many times

9 The n-gram Search Key ideas Preprocess the pattern to create a jump table –As in Boyer – Moore Use n –grams with n > 1 to increase the discriminative power of an attempt –Comparison of a sample from the pattern a single symbol for BM an LAS of an n – gram for a CAS-encoded string

10 The n-gram Search Key ideas If the alphabet uses m symbols, the probability that a symbol matches is 1/m –Assuming all symbols equally likely For usual ASCII pattern matching m = For DNA m = 4 A single symbol may often match without the whole pattern matching e.g., ¼ times for DNA on the average Leading to small jumps, –by m symbols on the average

11 The n-gram Search Key ideas The probability of an n - gram matching may be : min ( 1/ 2 f, 1 / m n ) In our examples it can reach 1 / 256 – More discriminative sampling – Longer jumps By almost K or 256 symbols in general Useful for longer strings –DNA, text, images…

12 ASCII Exemple Usual Alphabet 2-grams => 5 jumps 1-gram => 6 jumps

13 DNA Exemple 4-letter Alphabet 3 jumps 4 jumps 11 jumps

14 The n-gram Search Preprocessing Encode every record (string) into its CAS –Done for incidental protection anyhow for SDDS-2006 Encode the terminal n - gram of the searched pattern S K into its LAS in variable V Fill up the jump table T for every other n - gram in S K –calculate every LAS –for each LAS, store in T its rightmost offset with respect to the end of S K

15 The n-gram Search Jump Table For GF (256), every n – gram S i, i+n-1 in the pattern and i = LAS (S i, i+n-1 ): –T ( i ) = the offset –T ( i ) = K – n + 1 otherwise Remainder : LAS (0) = 255 T can be also hash table –See the paper –Slower to use but possibly more memory efficient Probably more useful for a larger GF

16 ASCII Exemple Dauphine V = ne’’ … … 1 in’’ … … 5 au’’ … … 3 ph’’ … … Notation : xy’’ = LAS (xy)

17 The n-gram Search Processing Calculate LAS of the current n-gram in the string –Start with the n-gram S K-n+1, K –Continue depending on jump calculus Attempt to match V –If.true then calculate LAS of the entire current possibly matching substring of length K and ending with the current n-gram If.true, then resolve the possible collision –Either attempt to match all the K symbols –Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value

18 The n-gram Search Processing Otherwise –Go to T using LAS of the n-gram –Jump by the number of symbols found in T Update the “current” position for n-gram to attempt the match –Re-attempt the match as above Unless the n-gram to attempt is beyond the end of the string

19 ASCII Exemple Again 2-grams => 5 jumps 1-gram => 6 jumps

20 DNA Exemple Again 3 jumps 4 jumps 11 jumps

21 Related Work Implemented in SDDS-2006 Applies best to –longer patterns where many jumps occur –alphabets much smaller than the size of GF used Instead of jump of size m in the average, one reaches almost min (K, 2 f ) per jump –up to almost 256 for DNA or ASCII with GF (256) –up to almost 64K for DNA or Unicode with GF (64K) instead of 4 or 25 respectively –For Boyer-Moore especially

22 n-grams / BM Jumps with n-grams can be typically longer Calculate an attempt & jump are more expensive as well –About twice as long at first approach –The precise analysis remains to be done Rule of thumb: If jumps are more than 2 times longer, n-grams with n > 1 or should be faster than BM. In both our examples, it should be the case of patterns longer than : –50 symbols for ASCII –8 symbols for DNA

23 Related Work In SDDS 2006 & P2P or Grid System in general Wish to hide what is searched for ? Use the signature only based search –Usually slower since linear only

24 Conclusion A new pattern matching algorithm Uses algebraic signatures Preprocesses both the pattern and the string Appears particularly efficient –For databases –For longer patterns Possibly faster in this context than any other algorithm known know But all this are only preliminray results

25 Future Work Performance Analysis –Theoretical Jump Length –Median, Average… –Experimental Actual text –Non uniform symbol distribution DNA –Actual DNA strings

26 Future Work Variants –Jump Table –Partial Signatures of n –grams Symbol p i encodes the n –gram signature up to p i- n+1 …p i –No more XORing & Division to find this signature –Faster unsuccessful attempt to match –Approximate Match Tolerating match errors –E.g., and at most 1 symbol

27 Thank You for Your Attention