Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain,

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain, John Lockwood, Bob Morley

2 - Sailesh Kumar - 11/24/2015 Focus on 3 Network Features n In this proposal, we focus on 3 network features n Packet payload inspection »Network security n Packet header processing »Packet forwarding, classification, etc n Packet buffering and queuing »QoS

3 - Sailesh Kumar - 11/24/2015 Overview of the Presentation n Packet payload inspection »Previous work –D 2 FA and CD 2 FA »New ideas to implement regular expressions »Initial results n IP Lookup »Tries and pipelined tries »Previous work: CAMP »New direction: HEXA n Hashing used for packet header processing »Why do we need better hashing? »Previous work: Segmented Hash »New direction: Peacock Hashing n Packet buffering and queuing »Previous work: multichannel packet buffer, aggregated buffer »New direction: DRAM based buffer, NP based queuing assist

4 - Sailesh Kumar - 11/24/2015 Delayed Input DFA (D 2 FA), SIGCOMM’06 n Many transitions in a DFA »256 transitions per state »50+ distinct transitions per state (real world datasets) »Need 50+ words per state n Can we reduce the number of transitions in a DFA Three rules a+, b+c, c*d+ 2 1 3 b 4 5 a d a c a b d a c b c b b a c d d d c 4 transitions per state Look at state pairs: there are many common transitions. How to remove them?

5 - Sailesh Kumar - 11/24/2015 Delayed Input DFA (D 2 FA), SIGCOMM’06 n Many transitions in a DFA »256 transitions per state »50+ distinct transitions per state (real world datasets) »Need 50+ words per state n Can we reduce the number of transitions in a DFA Three rules a+, b+c, c*d+ 1 3 a a a b b 2 5 4 c b b c d d d c 4 transitions per state Alternative Representation d c a b d c a 1 3 a a a b b 2 5 4 c b b c d d d c d c a b d c a Fewer transitions, less memory

6 - Sailesh Kumar - 11/24/2015 D 2 FA Operation 1 3 a a a b b 2 5 4 c b b c d d d c d c a b d c a 1 3 a 2 5 4 c c b d Heavy edges are called default transitions Take default transitions, whenever, a labeled transition is missing DFA D 2 FA

7 - Sailesh Kumar - 11/24/2015 D 2 FA versus DFA n D 2 FAs are compact but requires multiple memory accesses »Up to 20x increased memory accesses »Not desirable in off-chip architecture n Can D 2 FAs match the performance of DFAs »YES!!!! »Content Addressed D 2 FAs (CD 2 FA) n CD 2 FAs require only one memory access per byte »Matches the performance of a DFA in cacheless system »Systems with data cache, CD 2 FA are 2-3x faster n CD 2 FAs are 10x compact than DFAs

8 - Sailesh Kumar - 11/24/2015 Introduction to CD 2 FA, ANCS’06 n How to avoid multiple memory accesses of D 2 FAs? »Avoid lookup to decide if default path needs to be taken »Avoid default path traversal n Solution: Assign labels to each state, labels contain: »Characters for which it has labeled transitions »Information about all of its default states »Characters for which its default states have labeled transitions find node R at location R R c d a b all ab,cd,R cd,R R V U find node U at hash(c,d,R) find node V at hash(a,b,hash(c,d,R)) Content Labels

9 - Sailesh Kumar - 11/24/2015 Introduction to CD 2 FA R c d all ab,cd,R cd,R R V U Input char = hash(a,b,hash(c,d,R)) Z l m P q all X Y pq,lm,Z lm,Z hash(c,d,R) Current state: V (label = ab,cd,R) hash(p,q,hash(l,m,Z)) a b d a (R, a) (R, b) … (Z, a) (Z, b) … lm,Z pq,lm,Z (X, p) (X, q) (V, a) (V, b) → X (label = pq,lm,Z)

10 - Sailesh Kumar - 11/24/2015 Construction of CD 2 FA n We seek to keep the content labels small n Twin Objectives: »Ensure that states have few labeled transitions »Ensure that default paths are as small as possible n Proposed new heuristic called CRO to construct CD 2 FA »Details in ANCS’06 paper »Default path bound = 2 edges => CRO algorithm constructs upto 10x space efficient CD 2 FAs

11 - Sailesh Kumar - 11/24/2015 Memory Mapping in CD 2 FA R c d all ab,cd,R cd,R R V U Z l m P q all X Y pq,lm,R lm,R a b (R, a) (R, b) … (Z, a) (Z, b) … WE HAVE ASSUMED THAT HASHING IS COLLISION FREE hash(a,b,hash(c,d,R)) hash(c,d,R)) hash(p,q,hash(l,m,Z)) COLLISION

12 - Sailesh Kumar - 11/24/2015 Collision-free Memory Mapping a a b c p q r l m n d e f bc, …. pqr, n, def, hash (abc, …) hash (def, …) hash (pqr, …) hash (lmn, …) hash (edf, …) lm hash (mln, …) Add edges for all Possible choices Four states 4 memory locations

13 - Sailesh Kumar - 11/24/2015 Bipartite Graph Matching n Bipartite Graph »Left nodes are state content labels »Right nodes are memory locations »An edge for every choice of content label »Map state labels to unique memory locations »Perfect matching problem n With n left and right nodes »Need O(logn) random edges »n = 1M implies, we need ~20 edges per node n If we provide slight memory over-provisioning »We can uniquely map state labels with much fewer edges n In our experiments, we found perfect matching without memory over-provisioning

14 - Sailesh Kumar - 11/24/2015 Reg-ex – New Directions n Three Key problems with traditional DFA based reg-ex matching »1.Employ complete signature to parse input data –Even if normal data matches only a small prefix portion –Full signature => large DFA »2.Only one active state of execution and no memory about the previous matches –Combinations of partial matches requires new DFA states »3. Inability to count certain sub-expressions –E.g. a{1024} will require 1024 DFA states n We aim at addressing each of these problems in the proposed research

15 - Sailesh Kumar - 11/24/2015 Addressing the First Problem n Divide the processing into fast and slow path n Split the signature into prefix and suffix »employ signature prefixes in fast path »Upon a match in fast path, trigger the slow path »Appropriate splitting can maintain low triggering rate n Benefits: »Fast path can employ a composite DFA for all prefixes –Due to small prefixes composite DFA will remain small –Higher parsing rate »Slow path will use separate DFA for each signature –No state explosion in slow path –Due to low triggering rate, slow path will not become a bottleneck »Reduces per-flow state –Fast path uses composite DFA, one active state per flow

16 - Sailesh Kumar - 11/24/2015 Fast and Slow Path Processing n Here we assume that ε fraction of the flows are diverted to the slow path n Fast path stores a per flow DFA state n Slow path may store multiple active states

17 - Sailesh Kumar - 11/24/2015 Splitting Reg-exes n Splitting can be performed based upon data traces n Assign probability to NFA states and make a cut so that slow path cumulative probability is low r 1 =.*[ gh ]d[^ g ]*g e r 2 =.* fag [^ i ]* i [^ j ]* j r 3 =.* a [ gh ] i [^ l ]*[ ae ] c Cumulative probability of slow path = 0.05

18 - Sailesh Kumar - 11/24/2015 Splitting Reg-exes Slow path will comprise of three separate DFAs, one for each signature Fast path will contain a composite DFA (14 states) p 1 =.*[ gh ]d[^ g ]*g p 2 =.* fa p 3 =.* a [ gh ] i r 1 =.*[ gh ]d[^ g ]*g e r 2 =.* fag [^ i ]* i [^ j ]* j r 3 =.* a [ gh ] i [^ l ]*[ ae ] c Notice the start state

19 - Sailesh Kumar - 11/24/2015 Protection against DoS Attacks n An attacker can attack such system by sending data that match the prefixes more often than provisioned »Slow path will become the bottleneck n Solution: Look at the history and determine if a flow is an attack flow or not »Compute anomaly index: weighted moving average of the number of times a flow has triggered the slow path »If a flow has high anomaly index, send it to a low rate queue

20 - Sailesh Kumar - 11/24/2015 Initial Simulation Results

21 - Sailesh Kumar - 11/24/2015 Addressing the Second Problem n NFA: compact but O(n) active states n DFA: 1 active state but state explosion »How to avoid state explosion while also keeping the per-flow active state information small n Propose a novel machine called History based Finite Automaton or H-FA »Augment a DFA with a history buffer »Transitions are taken looking at the history buffer contents »During certain transitions, items are inserted/removed from the history buffer n Claim: a small history buffer is sufficient to avoid state explosion and also keep a single active state

22 - Sailesh Kumar - 11/24/2015 Example of H-FA Construction DFA NFA state 2 is present in 4 DFA states. If remove the NFA state 2 from these DFA states, then we will have just 6 states

23 - Sailesh Kumar - 11/24/2015 H-FA DFA NFA state 2 is present in 4 DFA states. If remove the NFA state 2 from these DFA states, then we will have just 6 states This new machine uses a history flag in addition to its transitions to make moves

24 - Sailesh Kumar - 11/24/2015 H-FA This new machine uses a history flag in addition to its transitions to make moves  0  3,0 set is flag because  c  4,0  d reset  0 is flag because  c Input data = c d a b c reset  flag 1,0   a  set 0   b

25 - Sailesh Kumar - 11/24/2015 H-FA n In general, if we maintain a flag for each NFA state that represents a Kleene closure, we can avoid any state explosion n k closures will require at most k-bits in history buffer n There are some challenges associated with the efficient implementation of conditional transitions »We plan to work on these in the proposed research

26 - Sailesh Kumar - 11/24/2015 Addressing the Third Problem ab[^a]{1024}c def Replace flag by a counter Replace flag=1 condition with ctr=1024 Replace flag=0 condition with ctr=0 Increment ctr if ctr>0; reset when ctr reaches 1024 One of the primary goals of research to enable efficient implementation of counter conditions

27 - Sailesh Kumar - 11/24/2015 Early Results

29 - Sailesh Kumar - 11/24/2015 IP Address Lookup n Routing tables at router input ports contain (prefix, next hop) pairs n Address in the packet is compared to the stored prefixes, starting at left. n Prefix that matches largest number of address bits is desired match. n Packet is forwarded to the specified next hop. 1*5 00*3 01*5 0*7 001*2 011*3 1011*4 prefix next hop routing table address: 0110 0100 1000

30 - Sailesh Kumar - 11/24/2015 Address Lookup Using Tries n Prefixes stored in “alphabetical order” in tree. n Prefixes “spelled” out by following path from top. »green dots mark prefix ends n To find best prefix, spell out address in tree. n Last green dot marks longest matching prefix. address: 0110 0100 1000 10 0 1 1 1 1 0 3 1*5 00*3 01*5 0*7 001*2 011*3 1011*4 1

31 - Sailesh Kumar - 11/24/2015 Pipelined Trie-based IP-lookup Each level in different stage → overlap multiple packets Tree data-structure, prefixes in leaves (leaf pushing) Process IP address level-by-level to find the longest match P4 = 10010* 1 0 1 0 0 1 0 P1P2 P4 P3 P5 1 P6 P7 Stages of different size: - Requires more memory - Largest stage becomes the bottleneck

32 - Sailesh Kumar - 11/24/2015 Circular Pipeline, ANCS’06 n Use circular pipeline and allow requests to enter/exit at any stage n Mapping: »Divide the trie into multiple sub-tries »Map each sub-trie with its root starting at different stage

33 - Sailesh Kumar - 11/24/2015 Mapping in Circular Pipeline

34 - Sailesh Kumar - 11/24/2015 Circular Pipeline n Benefits: »Uniform stage sizes »Less memory – no over-provisioning is needed in face of arbitrary trie shape »Higher throughput

35 - Sailesh Kumar - 11/24/2015 New Direction: HEXA n HEXA (History-based Encoding, eXecution and Addressing) »Challenges the assumption that graph structures must store log2n bits pointers to identify successor nodes n If labels of the path leading to every node is unique then these labels can be used to identify the node »In tries every node has a unique path starting at the root node »Thus, labels along the path will become the identifier of the node »Note that these labels need not be explicitly stored

36 - Sailesh Kumar - 11/24/2015 Traditional Implementation Addrdata 10, 2, 3 20, 4, 5 31, NULL, 6 41, NULL, NULL 50, 7, 8 61, NULL, NULL 70, 9, NULL 81, NULL, NULL 9 There are nine nodes; we will need 4-bit node identifiers Total memory = 9 x 9 bits

37 - Sailesh Kumar - 11/24/2015 HEXA based Implementation Define HEXA identifier of a node as the path which leads to it from the root 1. - 2. 0 3. 1 4. 00 5. 01 6. 11 7. 010 8. 011 9. 0100 Notice that these identifiers are unique Thus, they can potentially be mapped to unique memory address

38 - Sailesh Kumar - 11/24/2015 HEXA based Implementation Use hashing to map the HEXA identifier to memory address 1. - 2. 0 3. 1 4. 00 5. 01 6. 11 7. 010 8. 011 9. 0100 If we have a minimal perfect hash function f -A function that maps elements to unique location Then we can store the trie as shown below f(010) = 5 f(011) = 3 f(0100) = 6 f(-) = 4 f(0) = 7 f(1) = 9 f(00) = 2 f(01) = 8 f(11) = 1 AddrFast pathPrefix 1 1,0,0 P3 2 1,0,0 P2 3 1,0,0 P4 4 0,1,1 5 0,1,0 6 1,0,0 P5 7 0,1,1 8 9 1,0,1 P1 Here we use only 3-bits per node in fast path

39 - Sailesh Kumar - 11/24/2015 Devising One-to-one Mapping n Finding a minimal perfect hash function is difficult »One-to-one mapping is essential for HEXA to work n Use discriminator bits »Append c-bits to every HEXA identifier, that we can modify »Thus a node can have 2 c choices of identifiers »Notice that we need to store these c-bits, thus more than just 3-bits per node are needed n With multiple choices of HEXA identifiers for a node, we can reduce the problem, to a bipartite graph matching problem »We need to find a perfect matching in the graph to map nodes to unique memory locations

40 - Sailesh Kumar - 11/24/2015 Devising One-to-one Mapping

41 - Sailesh Kumar - 11/24/2015 Initial Results n Our initial evaluation suggests that 2-bits discriminators are enough to find a perfect matching »Thus 2-bits per node is enough instead of log2n bits

42 - Sailesh Kumar - 11/24/2015 Initial Results n Memory comparison to Eatherton’s trie n In future »Complete evaluation of HEXA based IP lookup: throughput, die size and power estimate »Extend HEXA to string and finite automaton

44 - Sailesh Kumar - 11/24/2015 Hash Tables n Suppose our hash function gave us the following values: »hash("apple") = 5 hash("watermelon") = 3 hash("grapes") = 8 hash("cantaloupe") = 7 hash("kiwi") = 0 hash("strawberry") = 9 hash("mango") = 6 hash("banana") = 2 »hash("honeydew") = 6 n This is called collision »Now what kiwi banana watermelon apple mango cantaloupe grapes strawberry 01234567890123456789

45 - Sailesh Kumar - 11/24/2015 Collision Resolution Policies n Linear Probing »Successively search for the first empty subsequent table entry n Linear Chaining »Link all collided entries at any bucket as a linked-list n Double Hashing »Uses a second hash function to successively index the table

46 - Sailesh Kumar - 11/24/2015 Performance Analysis n Average performance is O(1) n However, worst-case performance is O(n) n In fact the likelihood that a key is at a distance > 1 is pretty high These keys will take twice time to be probed These will take thrice the time to be probed Pretty high probability that throughput is half or three times lower than the peak throughput

47 - Sailesh Kumar - 11/24/2015 Segmented Hashing, ANCS’05 n Uses power of multiple choices »has been proposed earlier by Azar et. al n A N-way segmented hash »Logically divides the hash table array into N equal segments »Maps the incoming keys onto a bucket from each segment »Picks the bucket which is either empty or has minimum keys k i h( ) k i is mapped to this bucket k i+1 h( ) k i+1 is mapped to this bucket 211121212 A 4-way segmented hash table 1 2

48 - Sailesh Kumar - 11/24/2015 Segmented Hash Performance n More segments improves the probabilistic performance »With 64 segments, probability that a key is inserted at distance > 2 is nearly zero even at 100% load »Improvement in average case performance is still modest

49 - Sailesh Kumar - 11/24/2015 Adding per Segment Filters 0 1 0 2111201212 k i h( ) k i can go to any of the 3 buckets 1 0 0 0 0 1 1 0 1 h 1 (kiki ) h 2 (kiki ) h k (kiki ) : m b bits We can select any of the above three segments and insert the key into the corresponding filter

50 - Sailesh Kumar - 11/24/2015 Selective Filter Insertion Algorithm 0 1 0 k i h( ) 2111201212 k i can go to any of the 3 buckets 1 0 0 0 0 1 1 0 1 h 1 (kiki ) h 2 (kiki ) h k (kiki ) : m b bits Insert the key into segment 4, since fewer bits are set. Fewer bits are set => lower false positive With more segments (or more choices), our algorithm sets far fewer bits in the Bloom filter

51 - Sailesh Kumar - 11/24/2015 Problem with Segmented Hash n Bloom filter size is proportional to the total number of elements n An O(1) lookup can be maintained even if we omit the Bloom filter of one segment »With many segments and each of equal size, this omission will not lead to much reduction in Bloom filter size n An alternative is to use segments of different sizes and omit the Bloom filter in the largest segment »If the largest segment is say 90% of the total memory, then this will result in 90% reduction in the Bloom filter size »Peacock hashing utilizes this property

52 - Sailesh Kumar - 11/24/2015 Peacock Hashing K (actual keys) U (universe of keys) k 1 k 3 k 4 k 6 k 5 k 7 k 1 k 5 k 6 k 7 k 4 h 5 ( )h 4 h 3` ( )h 2 h 1 k2k2 k 2 k 3 Size of 1 st segment = 1 Size of second segment = c Size of i th segment = c x size of i -1 st segment No element will be discarded Until the first segment is filled

53 - Sailesh Kumar - 11/24/2015 Peacock Hash n Use Bloom filter for all segments but the largest segment »Thus, for c = say 10, the Bloom filter will be 10x smaller n Lookup is obvious »First consult all Bloom filters »If none of them shows a membership, then lookup in the largest segment »Else lookup into the segments which shows a membership n In order to enable deletes we require counting Bloom filters, but counters can be kept in slow path n Deletes however lead to imbalance in the loading

54 - Sailesh Kumar - 11/24/2015 Peacock Hash n A series of “delete and insert” may lead to overflow of the smaller segments 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0102030405060708090100110120130140150 Simulation time (sampling interval is 1000) Discard rate (%) Segment 5 43 Segment 6 Second phase begins 2 1

55 - Sailesh Kumar - 11/24/2015 Peacock Hash n Following every delete we perform a re-balancing, i.e. search the smaller segments and move elements to larger segment if possible 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0102030405060708090100110120130140150 Simulation time (sampling interval is 1000) Discard rate (%) Segment 5 4 3 Segment 6 Second phase begins 2

56 - Sailesh Kumar - 11/24/2015 Issues and Future Directions n It is not clear, how to perform rebalancing efficiently »In the previous simulation, we use a brute force approach and search the entire segment, leading to O(n) rebalancing cost n Complicating factors: »Collision length higher than 1 in some segments »Double hashing collision policy »Use of 2-ary hashing may improve the efficiency, but will again complicate the re-balancing n Future Research Objectives: »Develop efficient re-balancing algorithm »Develop Bloom filters which better utilizes the power of multiple choices »Extend the scheme to memory segments with different bandwidth and access latency

58 - Sailesh Kumar - 11/24/2015 Packet Buffering and Queuing n First objective is to extend the multichannel packet buffer architecture to DRAM memories n We also plan to consider memories with different size, bandwidth and access latency »Extension of –Sailesh Kumar, Patrick Crowley, and Jonathan Turner, "Design of Randomized Multichannel Packet Storage for High Performance Routers", Proceedings of IEEE Symposium on High Performance Interconnects (HotI-13), Stanford, August 17-19, 2005.Design of Randomized Multichannel Packet Storage for High Performance Routers n Work on a NP specific queuing hardware assist »Extension of –Sailesh Kumar, John Maschmeyer, and Patrick Crowley, "Queuing Cache: Exploiting Locality to Ameliorate Packet Queue Contention and Serialization", Proceedings of ACM International Conference on Computing Frontiers (ICCF), Ischia, Italy, May 2-5, 2006.Queuing Cache: Exploiting Locality to Ameliorate Packet Queue Contention and Serialization

59 - Sailesh Kumar - 11/24/2015 n The proposed research is expected to take one year n Acknowledgments »Jon Turner »Patrick Crowley »Michela Becchi »Sarang Dharmapurikar »John Lockwood »Roger Chamberlain »Robert Morley »Balakrishnan Chandrasekaran »Michael Mitzenmacher, Harvard Univ. »George Varghese, UCSD »Will Eatherton, Cisco »John Williams, Cisco

60 - Sailesh Kumar - 11/24/2015 n Questions???

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain,

Similar presentations

Presentation on theme: "Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain,

Similar presentations

Presentation on theme: "Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain,"— Presentation transcript:

Similar presentations

About project

Feedback