An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International.

An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International Conference on Computational Intelligence and Security (CIS), 2014 Presenter: Kuan-Chieh Feng Date: 2015/11/18 Department of Computer Science and Information Engineering National Cheng Kung University

Outline Introduction Wu-Manber’s algorithm The Improved algorithm Experiment Results National Cheng Kung University CSIE Computer & Internet Architecture Lab 2

Introduction For single-pattern matching, the two most well-known algorithms are the Knuth- Morris-Pratt (KMP) algorithm and the Boyer-Moore (BM) algorithm. For multi-pattern matching, the two widely used algorithms are the Aho-Corasick (AC) algorithm and the WM algorithm. National Cheng Kung University CSIE Computer & Internet Architecture Lab 3

Introduction In this paper, an improved multi-pattern matching algorithm based on the framework of the Wu-Manber (WM) algorithm is proposed to effectively deal with the large pattern sets. National Cheng Kung University CSIE Computer & Internet Architecture Lab 4

Wu-Manber’s algorithm A shift-based algorithm which can match all patterns in the same time, we call it as multi- pattern match Can support large number of patterns because its data structure doesn’t occupy much memory space Need to pre-process the pattern set to construct its data structure National Cheng Kung University CSIE Computer & Internet Architecture Lab 5 [4] Sun Wu, Udi Manber, “A fast algorithm For Multi-Pattern Searching,” Technical Report TR 94-17, University of Arizona at Tuscon, May 1994

Wu-Manber’s algorithm Contains Two Stages : Preprocessing Stage Scanning Stage National Cheng Kung University CSIE Computer & Internet Architecture Lab 6

Wu-Manber’s algorithm Preprocessing Stage: LSP : length of shortest pattern in pattern set (scanning window size) feature-string : first LSP characters of each pattern denoted f-string feature-string set : denoted set F B-gram : usually set to 2 or 3 (block size) Based on F, we can build three tables named SHIFT table, HASH table and PREFIX table. National Cheng Kung University CSIE Computer & Internet Architecture Lab 7

Wu-Manber’s algorithm Pattern = {archer}, window size (LSP) = 4, block size (B-gram) = 2 National Cheng Kung University CSIE Computer & Internet Architecture Lab 8 B-gramShift value archerch0 archerrc1 archerar2

Wu-Manber’s algorithm National Cheng Kung University CSIE Computer & Internet Architecture Lab 9 indexPatternf-string 1such 2rich 3archerarch 4checkchec B-gramShift value ar3 -> 2 ch3 -> 0 ec3 -> 0 he3 -> 1 ic3 -> 1 ri3 -> 2 su3 -> 2 uc3 -> 1 others3 {archer } {archer, such, rich, check} {check} {rich} {such} Pattern set Shift table Default value of shift table entries is LSP-B+1, m is window size and k is block size LSP = 4 and B-gram = 2 Shift value is LSP-q, q is the rightmost position of each B-gram

Wu-Manber’s algorithm BCarchecheicrisuucothers Shift value200112213 National Cheng Kung University CSIE Computer & Internet Architecture Lab 10 indexPattern 1such 2rich 3archer 4check keyvalue ch1~3 ec4 others0 Input text = s u c t i r i c h e c k Shift table Hash table Pattern table Matched! After full matching, shift 1 character Matched! m = 4, k = 2 End scan input text PrefixPattern ririch chcheck …… Prefix table

Improved algorithm Two limitations in WM algorithm: The performance is severely affected by LSP.If LSP is very small, there is little opportunity for the algorithm to shift far. With the growing of the pattern set, the lists tied to the HASH table may become unbalance (some lists will be much longer than others). National Cheng Kung University CSIE Computer & Internet Architecture Lab 11

Improved algorithm Two aspects: A selection method for choosing of f-strings :  Reduce the number of candidate patterns for a scan window. INDEX Table :  Reduce the time for finding candidate patterns in hash lists. National Cheng Kung University CSIE Computer & Internet Architecture Lab 12

Improved algorithm - selection method The original WM algorithm always chooses the first LSP characters of one pattern as its f-string without considering any characteristics of the pattern set. National Cheng Kung University CSIE Computer & Internet Architecture Lab 13

Improved algorithm - selection method National Cheng Kung University CSIE Computer & Internet Architecture Lab 14 Here we give a simple selection strategy only depending on the pattern set itself, it contains two steps: Step 1 : For every possible B-gram, we count and record how many patterns in the pattern set containing that B-gram as a substring Step 2 : For a given pattern, among all its substrings of length LSP, we pick out the one whose B-gram suffix has a minimum occurrence and make it to be the f-string of the pattern

Improved algorithm - selection method Step 1 : For example, the number of times each 2-gram occurs in the given pattern set P is: National Cheng Kung University CSIE Computer & Internet Architecture Lab 15 abbdduucce 55332 enntctdeothers 21110 P = { p1: "abden", p2:"abduct", p3:"abd", p4:"abduce", p5: "abducent" }

Improved algorithm - selection method Step 2 : National Cheng Kung University CSIE Computer & Internet Architecture Lab 16 The corresponding set F of the given pattern set P is : F = { fp1:“bde”, fp2:“uct”, fp3:“abd”, fp4:"uce", fp5:"ent" }

Improved algorithm – INDEX table In the original algorithm, when the B-gram suffix (assumes it is hashed into i) of the current scan window is encountered during scanning, and then every pattern in the hash list related to HASH[i] will be checked (using PREFIX table) for the candidacy. It is inefficient for long hash lists. National Cheng Kung University CSIE Computer & Internet Architecture Lab 17

Improved algorithm - INDEX table Here we design a simple subordinate data structure called "INDEX table" to take the place of PREFIX table. National Cheng Kung University CSIE Computer & Internet Architecture Lab 18

Experiment Results National Cheng Kung University CSIE Computer & Internet Architecture Lab 19

Experiment Results – under various LSPs The size of each pattern set and the size of the text are fixed to 5x10^5 and 100MB National Cheng Kung University CSIE Computer & Internet Architecture Lab 20

Experiment Results – under various number of patterns The LSP of each pattern set is fixed to 7, and the size of the text is still 100MB. Let B = 3. National Cheng Kung University CSIE Computer & Internet Architecture Lab 21

An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International.

Similar presentations

Presentation on theme: "An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International.

Similar presentations

Presentation on theme: "An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International."— Presentation transcript:

Similar presentations

About project

Feedback