Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP Anat Bremler-Barr Interdisciplinary Center Herzliya Shimrit Tzur David Interdisciplinary.

Slides:



Advertisements
Similar presentations
Shift-based Pattern Matching for Compressed Web Traffic Presented by Victor Zigdon 1* Joint work with: Dr. Anat Bremler-Barr 1* and Yaron Koral 2 The SPC.
Advertisements

Deep Packet Inspection(DPI) Engineering for Enhanced Performance of Network Elements and Security Systems PIs: Dr. Anat Bremler-Barr (IDC) Dr. David.
Fast and Scalable Pattern Matching for Content Filtering Sarang Dharmapurikar John Lockwood.
Space-Time Tradeoffs in Software-based Deep Packet Inspection Author: Anat Bremler-Barr, Yotam Harchol, and David Hay Published in Proc. IEEE HPSR 2011.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Suffix Trees Construction and Applications João Carreira 2008.
Multi-Core Packet Scattering to Disentangle Performance Bottlenecks Yehuda Afek Tel-Aviv University.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP Author: Anat Bremler-Barr, Yaron Koral, Shimrit Tzur David, David Hay Publisher:
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Reviewer: Jing Lu Gigabit Rate Packet Pattern- Matching Using TCAM Fang Yu, Randy H. Katz T. V. Lakshman UC Berkeley Bell Labs, Lucent ICNP’2004.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Modern Information Retrieval
Deterministic Memory- Efficient String Matching Algorithms for Intrusion Detection Nathan Tuck, Timothy Sherwood, Brad Calder, George Varghese Department.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 15 Instructor: Paul Beame.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Improved TCAM-based Pre-Filtering for Network Intrusion Detection Systems Department of Computer Science and Information Engineering National Cheng Kung.
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.
Aho-Corasick String Matching An Efficient String Matching.
1 HEXA: Compact Data Structures or Faster Packet Processing Author: Sailesh Kumar, Jonathan Turner, Patrick Crowley, Michael Mitzenmacher. Publisher: ICNP.
1 Performing packet content inspection by longest prefix matching technology Authors: Nen-Fu Huang, Yen-Ming Chu, Yen-Min Wu and Chia- Wen Ho Publisher:
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Deep Packet Inspection with Regular Expression Matching Min Chen, Danny Guo {michen, CSE Dept, UC Riverside 03/14/2007.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
CSE7701: Research Seminar on Networking
Deep Packet Inspection as a Service Anat Bremler-Barr IDC Herzliya Joint work with Yotam Harchol, David Hay and Yaron Koral The Hebrew University Appeared.
Sampling Techniques to Accelerate Pattern Matching in Network Intrusion Detection Systems Author: Domenico Ficara, Gianni Antichi, Andrea Di Pietro, Stefano.
Huffman Encoding Veronica Morales.
1 Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Fang Yu Microsoft Research, Silicon Valley Work was done in UC Berkeley,
Accelerating Multipattern Matching on Compressed HTTP Traffic Published in : IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 20, NO. 3, JUNE 2012 Authors : Bremler-Barr,
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Space-Time Tradeoffs in Software-Based Deep Packet Inspection Anat Bremler-Barr Yotam Harchol ⋆ David Hay IDC Herzliya, Israel Hebrew University, Israel.
Space-Time Tradeoffs in Software-Based Deep Packet Inspection Anat Bremler-Barr Yotam Harchol ⋆ David Hay IDC Herzliya, Israel Hebrew University, Israel.
Shift-based Pattern Matching for Compressed Web Traffic Author: Anat Bremler-Barr, Yaron Koral,Victor Zigdon Publisher: IEEE HPSR,2011 Presenter: Kai-Yang,
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Leveraging Traffic Repetitions for High- Speed Deep Packet Inspection Author: Anat Bremler-Barr, Shimrit Tzur David, Yotam Harchol, David Hay Publisher:
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Authors: Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, Randy H.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
An Efficient Regular Expressions Compression Algorithm From A New Perspective  Author: Tingwen Liu, Yifu Yang, Yanbing Liu, Yong Sun, Li Guo  Publisher:
Web Search Algorithms By Matt Richard and Kyle Krueger.
Efficient Processing of Multi-Connection Compressed Web Traffic Yaron Koral 1 with: Yehuda Afek 1, Anat Bremler-Barr 1 * 1 Blavatnik School of Computer.
THE CHURCH-TURING T H E S I S “ TURING MACHINES” Part 1 – Pages COMPUTABILITY THEORY.
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
StriD 2 FA: Scalable Regular Expression Matching for Deep Packet Inspection Author: Xiaofei Wang, Junchen Jiang, Yi Tang, Bin Liu, and Xiaojun Wang Publisher:
Memory Compression Algorithms for Networking Features Sailesh Kumar.
Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions Publisher : Conference on emerging Networking EXperiments and Technologies.
TCAM –BASED REGULAR EXPRESSION MATCHING SOLUTION IN NETWORK Phase-I Review Supervised By, Presented By, MRS. SHARMILA,M.E., M.ARULMOZHI, AP/CSE.
Author : Sarang Dharmapurikar, John Lockwood Publisher : IEEE Journal on Selected Areas in Communications, 2006 Presenter : Jo-Ning Yu Date : 2010/12/29.
A Fast Regular Expression Matching Engine for NIDS Applying Prediction Scheme Author: Lei Jiang, Qiong Dai, Qiu Tang, Jianlong Tan and Binxing Fang Publisher:
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Publisher : ANCS’ 06 Author : Fang Yu, Zhifeng Chen, Yanlei Diao, T.V.
An Improved DFA for Fast Regular Expression Matching Author : Domenico Ficara 、 Stefano Giordano 、 Gregorio Procissi Fabio Vitucci 、 Gianni Antichi 、 Andrea.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Advanced Algorithms for Fast and Scalable Deep Packet Inspection Author : Sailesh Kumar 、 Jonathan Turner 、 John Williams Publisher : ANCS’06 Presenter.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Deep Packet Inspection as a Service Author : Anat Bremler-Barr, Yotam Harchol, David Hay and Yaron Koral Conference: ACM 10th International Conference.
CSE7701: Research Seminar on Networking
What Are They? Who Needs ‘em? An Example: Scoring in Tennis
Binary Tree and General Tree
Advanced Algorithms for Fast and Scalable Deep Packet Inspection
Speculative Parallel Pattern Matching
What Are They? Who Needs ‘em? An Example: Scoring in Tennis
Author: Yaron Weinsberg ,Shimrit Tzur-David ,Danny Dolev and Tal Anker
A Hybrid Finite Automaton for Practical Deep Packet Inspection
Presentation transcript:

Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP Anat Bremler-Barr Interdisciplinary Center Herzliya Shimrit Tzur David Interdisciplinary Center Herzliya & The Hebrew University, Jerusalem David Hay The Hebrew University, Jerusalem Yaron Koral Tel Aviv University 1

Outline Motivation Background ◦AC algorithm Our solution ◦The offline Phase ◦The online phase Experimental Results 2

Deep Packet Inspection (DPI) Search for patterns in the packets` payload Signatures-based NIDS ◦Intrusion Preventions Web-Application Firewalls ◦Leakage prevention ◦Content Filtering Challenges: ◦Thousands of known malicious patterns ◦Real time, link rate Security tools performance is dominated by the pattern matching engine (Fisk & Varghese 2002) 3

Compressed HTTP 4 19% increase in 8 month! 84.1% of the top 1,000 sites compress their traffic. Data compression is done by adding references to repeated data. There are two types of compression: ◦Intra-response compression – the references point to bytes within the response (Gzip/Deflate) ◦Inter-responses/connections compression – the references point to bytes in a separate file, called dictionary (Google’s SDCH).

Example – Intra-Response Compression File1.html: abcdefgabcd File2.html abcdxyzbcdtr Encode repeated strings by pointer: {distance, length} 5 TCP Connection Setup GET File1.html abcdefg(7,4) GET File2.html abcdxyz(6,3)tr

Example – Inter-Response Compression Dictionary: abcd File1.html: abcdefgabcd File2.html abcdxyzbcdtr Copy repeated strings from the dictionary: (address, length) 6 TCP Connection Setup GET File1.html Delta file: (0,4)efg(0,4) GET File2.html Delta file:(0,4)xyz(1,3)tr GET dictionary abcd

Current NIDS Operation (1) 7 ServerClient Http uncompressed NIDS GET \index.html Accept-Encoding: SDCH Scan for Intrusions Http uncompressed GET \index.html Accept-Encoding: SDCH

Current NIDS Operation (2) 8 ServerClient Http compressed NIDS GET \index.html Accept-Encoding: SDCH Do Not Scan/ Decompress, Scan, Compress Http compressed GET \index.html Accept-Encoding: SDCH

9 ServerClient Http compressed NIDS GET \index.html Accept-Encoding: SDCH Scan directly with no decompression Http compressed GET \index.html Accept-Encoding: SDCH

Our Solution: Decompression-Free Scanning Focused on inter-response compression Our algorithm works in two phases ◦Offline phase - Scanning the dictionary ◦Online phase - Scanning the delta files Works at the rate of the compressed traffic ◦Gain 56% improvement compared with scanning the plain-text directly 10

Outline Motivation Background ◦Aho-Corasick (AC) algorithm Our solution ◦The offline Phase ◦The online phase Experimental Results 11

Aho-Corasick (AC) Algorithm Finite State Machine (FSM) ◦Regular states, accepting states Goto function (black arrows) ◦g(state,symbol)  state Each state corresponds to a label- the sequence of characters on its goto path from the root. ◦The length of the label is the depth of the state Failure function (red arrows) ◦f(state)  state ◦Taken when there is no goto function ◦Goes to a state that its label is the longest suffix of the current state’s label s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A The label of S 14 is BCAA g(S 11,B) = S 12 g(S 11,A) = ? Patterns: E BE BD BCAA BCD CDBCAB f(S 11 ) = S 13  g(S 11,A)  g(S 13,A)=S 14

Aho-Corasick Insights The automaton remembers only its current state ◦The input text ends with the label of current state ◦This label is the longest suffix in the text that can be a prefix of a match No future pattern can begin before this label s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A

Outlines Motivation Background ◦Aho-Corasick (AC) algorithm Our solution ◦The offline Phase ◦The online phase Experimental Results 14

Accelerator Algorithm Idea The algorithm operates in two phases: The Offline Phase: ◦Scan the dictionary and store information about the pattern matching results The Online Phase: ◦Scan the delta file and skip almost all referenced bytes that were already scanned for patterns. 15

The Offline Phase The dictionary is scanned using AC (from its first byte and from s 0 ). We save the state after each byte CBACBDCAAEBD S5S5 S 12 S 11 S 10 S9S9 S8S8 S7S7 S0S0 S0S0 S3S3 S2S2 S0S0 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s14s14 s13s13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A State: We also save information of matched patterns that are found in the dictionary

Challenges Dictionary: Delta file: ABDB(5,4)AAB(1,4) The uncompressed data is: We copy from arbitrary position in the dictionary when the automaton in an arbitrary state ◦We show that no matter in what state and which symbol we start to copy, the resulting state is reachable via failure transitions from the saved state. 17 A B D B C D B C A A B B E A A Patterns/ Signatures: E BE BD BCAA BCD CDBCAB Types of matches: Right boundary Internal Left boundary DBEAACDBCABC

The Online Phase Scan the delta file: Uncompressed bytes - scan using AC. Copy instruction (p,x) ◦The compressed data that we already scanned in the offline phase. ◦We will save the scan for almost all these bytes. The internal match is trivial, see paper for details. 18

The Online Phase - Right Boundary When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1] ◦If the label of the state is longer than the copy- value  The label begins before the copy value  The context of this state is not as in the online scan  We take failure transitions to find state with sufficiently short label. ◦Otherwise  The label of the state is contained in the copy value  This is the longest suffix that can lead to a match 19

Example – Right Boundary Uncompressed data: …B 20 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A CBACBDCAAEBD S5S5 S 12 S 11 S 10 S9S9 S8S8 S7S7 S0S0 S0S0 S3S3 S2S2 S0S0 State: BCABCOPY(7,4): Go to State[10]=s 12. depth(s 12 ) > 4. Go to f(s 12 )=s 2 depth(s 2 ) ≤ 4 Current state is S 2

The Online Phase – Left Boundary When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1] ◦If the number of bytes we read from the copy value is less than the depth of the current state  The label of the state begins before the copied bytes  We scan the copy value till we reach a state that its label is shorter than the number of read bytes. ◦Otherwise  The label of the state is contained in the copy value  Both offline and online scans have the same context 21

Example – Left Boundary Uncompressed data: …B 22 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s14s14 s13s13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A CBACBDCAAEBD S5S5 S 12 S 11 S 10 S9S9 S8S8 S7S7 S0S0 S0S0 S3S3 S2S2 S0S0 State: CDBCCOPY(5,4): j=0 depth=1 Continue j=1 Depth=2 Continue j=2 Depth=3 Continue j=3 Stop scanning (depth(s 9 )≤3)

Outline Motivation Background ◦Aho-Corasick (AC) algorithm Our solution ◦The offline Phase ◦The online phase Experimental Results 23

Experimental Results Input: ◦google.com dictionary ◦Pages for 1000 most popular Google queries. Patterns ◦Snort The synthetic case ◦A patterns file for each input file so the input file has a different percentage of matches, from 25% to 100%. 24

The Algorithm Overheads 1. Traversing the failure transitions ◦In the right boundary 2. Scanning the copy value ◦In the left boundary 3. Memory consumption: ◦The additional information of the offline phase. ◦Total: 420 KB (per dictionary)  Can be further reduced by a variable-length pointer encoding. 25

Failure Transitions – Right Boundaries If length ≥ depth, no failure transition is taken In our experiments: ◦The average is 2.35 failure transitions per file  (average of 557 copy instructions per file) 26

Scanning the Copy Value - Left Boundary Compression ratio – compressed/uncompressed Scan ratio – scanned/uncompressed. Snort ◦low percentage of matches scan-ratio ~ compression ratio The synthetic case ◦high percentage of matches ◦Unrealistic case ◦scan-ratio is between 1.05 to 1.2 times compression- ratio. 27

Regular Expression Results Strings were extracted from the regular expression and were added to the pattern set. When needed, we use off-the-shelf perl compatible regular expression engine to scan additional parts of the text. The overhead of the regular expression is around 1% which is almost negligible 28

Questions?? 29

Regular Expression Very common in security purpose patterns. ◦In Snort, 55% of the rules contain regular expression. Composed of anchors and pcre tokens. For example, in the pattern: abc[1-9]*xyza{3,7} The anchors are: ◦abc ◦xyz The pcre tokens are: ◦[1-9]* ◦a{3,7} 30

Dealing with Regular Expression 1. The anchors are extracted from the regular expression offline. 2. The anchors are added to the patterns set. 3. If there is a regular expression which all its anchors were matched: ◦run an off the-shelf regular expression engine until, either a mismatch, a full pattern match, or the whole (limited) text is searched. 31

Regular Expression – Limited Search In most cases, we can limit the search in at least one direction. ◦If before the first anchor all tokens have a limited size, there is a bounded number of characters we should examine before the matched anchor. ◦If after the last anchor all tokens have a limited size there is a bounded number of characters we should examine after the matched anchor. 32

Memory Consumption 1. Doubling the size of the dictionary (for saving the offline scan results, one pointer per symbol) 2. Saving the matched list (for internal matches) Our experiments: ◦Match list size 40,000 ◦Dictionary size 116K symbols ◦Pointer size 17 bits Total memory consumption is 420 KB (per dictionary) ◦Can be further reduced by a variable-length pointer encoding. 33