Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories.

Slides:



Advertisements
Similar presentations
Deep Packet Inspection: Where are We? CCW08 Michela Becchi.
Advertisements

Deep packet inspection – an algorithmic view Cristian Estan (U of Wisconsin-Madison) at IEEE CCW 2008.
Automata Theory Part 1: Introduction & NFA November 2002.
Automata Theory December 2001 NPDAPart 3:. 2 NPDA example Example: a calculator for Reverse Polish expressions Infix expressions like: a + log((b + c)/d)
CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Compiler Construction
Reviewer: Jing Lu Gigabit Rate Packet Pattern- Matching Using TCAM Fang Yu, Randy H. Katz T. V. Lakshman UC Berkeley Bell Labs, Lucent ICNP’2004.
XFA : Faster Signature Matching With Extended Automata Author: Randy Smith, Cristian Estan and Somesh Jha Publisher: IEEE Symposium on Security and Privacy.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Protomatching Network Traffic for High Throughput Network Intrusion Detection Shai RubinSomesh JhaBarton P. Miller Microsoft Security Analysis Services.
A hybrid finite automaton for practical deep packet inspection Department of Computer Science and Information Engineering National Cheng Kung University,
Deterministic Memory- Efficient String Matching Algorithms for Intrusion Detection Nathan Tuck, Timothy Sherwood, Brad Calder, George Varghese Department.
1 Languages and Finite Automata or how to talk to machines...
1 Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Department of Computer Science and Information Engineering National.
A High Throughput String Matching Architecture for Intrusion Detection and Prevention Lin Tan U of Illinois, Urbana Champaign Tim Sherwood UC, Santa Barbara.
Introduction to Finite Automata Adapted from the slides of Stanford CS154.
Finite Automata Costas Busch - RPI.
Deep Packet Inspection with Regular Expression Matching Min Chen, Danny Guo {michen, CSE Dept, UC Riverside 03/14/2007.
Department of Electrical and Computer Engineering Kekai Hu, Harikrishnan Chandrikakutty, Deepak Unnikrishnan, Tilman Wolf, and Russell Tessier Department.
Liu Yang New Pattern Matching Algorithms for Network Security Applications Liu Yang Department of Computer Science Rutgers University April 4th, 2013.
Improving Signature Matching using Binary Decision Diagrams Liu Yang, Rezwana Karim, Vinod Ganapathy Rutgers University Randy Smith Sandia National Labs.
Presentation by : Samad Najjar Enhancing the performance of intrusion detection system using pre-process mechanisms Supervisor: Dr. L. Mohammad Khanli.
High-Speed Parallel Processing of Protocol-Aware Signatures Jordi Ros-Giralt, James Ezick, Peter Szilagyi, Richard Lethin Unclassified, DISTRIBUTION STATEMENT.
A High Throughput String Matching Architecture for Intrusion Detection and Prevention Lin Tan, Timothy Sherwood Appeared in ISCA 2005 Presented by: Sailesh.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author : Michela Becchi 、 Patrick Crowley Publisher : ANCS’07 Presenter : Wen-Tse Liang.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
REGULAR LANGUAGES.
1 Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Fang Yu Microsoft Research, Silicon Valley Work was done in UC Berkeley,
By: Er. Sukhwinder kaur.  What is Automata Theory? What is Automata Theory?  Alphabet and Strings Alphabet and Strings  Empty String Empty String 
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Space-Time Tradeoffs in Software-Based Deep Packet Inspection Anat Bremler-Barr Yotam Harchol ⋆ David Hay IDC Herzliya, Israel Hebrew University, Israel.
SI-DFA: Sub-expression Integrated Deterministic Finite Automata for Deep Packet Inspection Authors: Ayesha Khalid, Rajat Sen†, Anupam Chattopadhyay Publisher:
Lexical Analysis Constructing a Scanner from Regular Expressions.
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Authors: Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, Randy H.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
TFA : A Tunable Finite Automaton for Regular Expression Matching Author: Yang Xu, Junchen Jiang, Rihua Wei, Tang Song and H. Jonathan Chao Publisher: Technical.
An Efficient Regular Expressions Compression Algorithm From A New Perspective  Author: Tingwen Liu, Yifu Yang, Yanbing Liu, Yong Sun, Li Guo  Publisher:
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Peeping Tom in the Neighborhood Keystroke Eavesdropping on Multi-User Systems USENIX 2009 Kehuan Zhang, Indiana University, Bloomington XiaoFeng Wang,
StriD 2 FA: Scalable Regular Expression Matching for Deep Packet Inspection Author: Xiaofei Wang, Junchen Jiang, Yi Tang, Bin Liu, and Xiaojun Wang Publisher:
Sampling Techniques to Accelerate Pattern Matching in Network Intrusion Detection Systems Author : Domenico Ficara, Gianni Antichi, Andrea Di Pietro, Stefano.
Deterministic Finite Automaton for Scalable Traffic Identification: the Power of Compressing by Range Authors: Rafael Antonello, Stenio Fernandes, Djamel.
Department of Computer Science and Engineering Applied Research Laboratory Architecture for a Hardware Based, TCP/IP Content Scanning System David V. Schuehler.
Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection Sailesh Kumar Sarang Dharmapurikar Fang Yu Patrick Crowley Jonathan.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions Publisher : Conference on emerging Networking EXperiments and Technologies.
TCAM –BASED REGULAR EXPRESSION MATCHING SOLUTION IN NETWORK Phase-I Review Supervised By, Presented By, MRS. SHARMILA,M.E., M.ARULMOZHI, AP/CSE.
Memory-Efficient Regular Expression Search Using State Merging Author: Michela Becchi, Srihari Cadambi Publisher: INFOCOM th IEEE International.
A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching Yao Song 11/05/2015.
Author : Randy Smith & Cristian Estan & Somesh Jha Publisher : IEEE Symposium on Security & privacy,2008 Presenter : Wen-Tse Liang Date : 2010/10/27.
TFA: A Tunable Finite Automaton for Regular Expression Matching Author: Yang Xu, Junchen Jiang, Rihua Wei, Yang Song and H. Jonathan Chao Publisher: ACM/IEEE.
Lecture 2 Overview Topics What I forgot from last lecture Proof techniques continued Alphabets, strings, languages Automata June 2, 2015 CSCE 355 Foundations.
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Publisher : ANCS’ 06 Author : Fang Yu, Zhifeng Chen, Yanlei Diao, T.V.
An Improved DFA for Fast Regular Expression Matching Author : Domenico Ficara 、 Stefano Giordano 、 Gregorio Procissi Fabio Vitucci 、 Gianni Antichi 、 Andrea.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
1 Section 11.2 Finite Automata Can a machine(i.e., algorithm) recognize a regular language? Yes! Deterministic Finite Automata A deterministic finite automaton.
Deflating the Big Bang: Fast and Scalable Deep Packet Inspection with Extended Finite Automata Date:101/3/21 Publisher:SIGCOMM 08 Author:Randy Smith Cristian.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
A DFA with Extended Character-Set for Fast Deep Packet Inspection
Advanced Algorithms for Fast and Scalable Deep Packet Inspection
Yan Chen Department of Electrical Engineering and Computer Science
Discrete Controller Synthesis
Pipelined Architecture for Multi-String Matching
A Hybrid Finite Automaton for Practical Deep Packet Inspection
High-Performance Pattern Matching for Intrusion Detection
Presentation transcript:

Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories 2

Applications of Regular Expressions Signatures Network traffic Alerts NIDS Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures.

Applications of Regular Expressions (cont.) Connectors (rule set) SIEM Web security compliance security compliance Security information and event management (SIEM) systems employ regular expressions to normalize event logs generated by hardware connectors and software systems.

Submatch Extraction … username=(.*), hostname=(.*) … Rule set username=Bob, hostname=Foo Submatch extraction $1 = Bob, $2 = Foo

Signature Matching Non-deterministic finite automaton (NFAs) –Space efficient, time inefficient Deterministic finite automaton (DFAs) –Time efficient, states blow-up Recursive backtracking –Fast in general –Vulnerable to algorithmic complexity attacks

Motivation: Time/Space Tradeoff Space Time Ideal DFA (deterministic finite automaton) NFA (non-deterministic finite automaton) Backtracking Our approach

Our Contributions A novel way of annotating capturing groups, tagged-NFAs Design of a novel technique on submatch extraction (called Submatch-OBDD) –Extending Thompson’s algorithm –Using Boolean functions to represent tagged- NFAs –Using ordered binary decision diagrams (OBDDs) to improve time efficiency Evaluation and comparison with RE2 and PCRE Note: RE2 is a hybrid approach, using a mix of DFA/NFA, while PCRE uses recursive backtracking.

Solution Overview RegExps with capturing groups Tagged-NFAs Boolean Representations OBDD representations

NFA Representation of RegExps E = a*aa Current state (x)Input symbol (i)Next state (y) 1a1 1a2 2a3 NFA of regexp “a*aa” Transition table T(x,i,y)

Submatch Tagging: tagged NFAs E = (a*)aa Current state (x)Input symbol (i)Next state (y)Output tags (t) 1a1{t 1 } 1a2{} 2a3 Tagged NFA of “(a*)aa” with submatch tagging t 1 Extended transition table T(x,i,y,t) of the tagged NFA / t 1 Tag(E) = (a*) t aa 1

Match Test RegExp=(a*)aa; Input: aaaa aa a a {1}{1,2}{1,2,3} {t 1 } accept Frontier

Submatch Extraction aa a a {t 1 } accept {1}{1,2}{1,2,3} Frontier Any path from an accept state to a start state generates a valid assignment of submatches. $1=aa

Complexity of Tagged NFAs Match test: Submatch extraction: n – size of tagged NFA l – length of input string Can we make the operations faster?

Submatch-OBDD Representing tagged NFAs using Boolean functions –Updating frontiers in one-step using a single Boolean formula Using OBDDs to manipulate Boolean functions

Transitions as Boolean Functions Current state (x)Input symbol (i)Next state (y)Output tag (t) 1a1{t1} 1a2{} 2a3 T(x,i,y,t) = (1 Λ a Λ 1 Λ t1) V (1 Λ a Λ 2 Λ{}) V (2 Λ a Λ 3 Λ{}) RegExp: (a*)aa

Match Test using Boolean Functions {1} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1) V (1ΛaΛ 2 Λ{}) {1,2} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λ t1) V (1ΛaΛ 2 Λ{}) V (2ΛaΛ 3 Λ{}) {1,2,3} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1) V (1ΛaΛ 2 Λ{}) V (2ΛaΛ 3 Λ{}) Input symbol Start states Transition table Intermediate transitions Next states Current states Accept aaaa …

Submatch Extraction using Boolean Functions (1 Λ a Λ 1 Λ t1) V (1 Λ a Λ 2 Λ {}) V (2 Λ a Λ 3 Λ {}) a Λ 3 Λ Accept state The last input symbol Intermediate transitions [4] 2 Λ a Λ 3 Λ {} Previous state of 3 aΛ2ΛaΛ2Λ (1 Λ a Λ 1 Λ t1) V (1 Λ a Λ 2 Λ {}) V (2 Λ a Λ 3 Λ {}) 1 Λ a Λ 2 Λ {} Rename previous state as current state and continue No output submatch tag Intermediate transitions [3] Previous state of 2 Start from the last symbol, going backwards aaaa

Submatch Extraction using Boolean Functions aΛ1ΛaΛ1Λ (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) V (2ΛaΛ3Λ{}) 1ΛaΛ1Λ t1 Output submatch tag aΛ1ΛaΛ1Λ (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) 1ΛaΛ1Λ t1 Output submatch tag aaaa t1 $1=aa Intermediate transitions [2] Intermediate transitions [1] Previous state of 1 aaaa

More Formal: Match Test Finding new frontiers after processing an input symbol: Next frontiers = Checking acceptance:

More Formal: Submatch Extraction Submatch extraction: the last consecutive sequence of characters that are assigned with t i A back traversal approach: starting from the last input symbol.

Submatch-OBDD Representation of tagged NFAs, match test, and submatch extraction using OBDDs OBDD representations for –Transitions with submatch tags –Intermediate transitions –Submatch tags –Set of start states –Set of accept states –Set of frontiers –Input symbols

Implementation R E 2T NFA T NFA 2O BDD P ATTERN M ATCH RegExps Tagged NFAs OBDDs Input strings / network traffic Matched at reg# Submatches $1= …, $2 = … No match Toolchain in C++, interfacing with the CUDD* *CUDD is a package for manipulation of Binary Decision Diagrams

Feasibility Study Data sets –Snort-2009 RegExps: 115 regexps with capturing groups from HTTP rules Traces –1.2GB department network traffic (average packet size 126 bytes) –1.3GB Twitter traffic (average packet size 1202 bytes) –1MB synthetic trace (average string length 311 bytes) –Snort-2012 RegExps: 403 regexps with capturing groups from HTTP rules Traces –1.2GB department network traffic (average packet size 126 bytes) –1.3GB Twitter traffic (average packet size 1202 bytes) –1MB synthetic trace (average string length 689 bytes) –Firewall-504 RegExps: 504 patterns from a commercial firewall F Trace: 87MB of firewall logs (average line size 87 bytes)

Experimental Setup Platform: Intel Core2 Duo E7500, Linux , 2GB RAM Two configurations on pattern matching –Conf. S patterns compiled individually Compiled pattern matched sequentially against input traces –Conf.C patterns combined with UNION and compiled combined pattern matched against input traces

Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2009 data set

Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2012 data set

Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Firewall-504 data set

Related Work NFA-OBDD [ Yang et al., RAID’10, Chasaki and Wolf, ANCS’10 ] RE2 [ Cox, code.google.com/p/re2 ] PCRE [ ] TNFA [ Laurikari et al., SPIRE’00 ] MDFA [ Yu et al., ANCS’06 ] Hybrid FA [ Becchi and Crowley, CoNEXT’07 ] XFA [ Smith et al., Oakland’08 ] More – see paper for details

Conclusion A novel way of annotating capturing groups Submatch-OBDD: a novel technique on submatch extraction using OBDDs Feasibility study –Submatch-OBDD achieves ideal performance when patterns are combined –Faster than RE2 and PCRE when patterns are combined