Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Similar presentations


Presentation on theme: "High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi."— Presentation transcript:

1 High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi Date: 2014/12/2 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

2 Introduction (1/2) This paper investigate GPU’s characteristics in parallelism and memory accessing, and implement our packet classifier using CUDA. The basic operations of this design are binary range-tree search and bitwise AND operation. Optimize the design by storing the range-trees using compact arrays without explicit pointers in shared memory. National Cheng Kung University CSIE Computer & Internet Architecture Lab 2

3 Introduction (2/2) When the size of rule set is 512, this design can achieve the throughput of 85 MPPS and the average processing latency of 4.9 us per packet. Compared with the implementation on the state-of-the-art multi-core platform, this design demonstrates 1.9x improvement with respect to throughput. National Cheng Kung University CSIE Computer & Internet Architecture Lab 3

4 CUDA Memory Model National Cheng Kung University CSIE Computer & Internet Architecture Lab 4 TypeLocationAccess cycleSize Global memoryOff-chip>1001~32GB per GPU L1 cacheOn-chip1~3216 or 48KB per SMX L2 cacheOn-chip1~3264KB per SMX RegistersOn-chipn/a32-bit x 65536 per SMX

5 Algorithm Phase 1: each thread examines N/K rules and produces a local classification result. Phase 2: the rule with the highest priority among the K local results is identified in logK steps. National Cheng Kung University CSIE Computer & Internet Architecture Lab 5

6 Pre-process Pre-process rules to construct a binary range-tree for each individual field. Every leaf node is assigned with BVs, which can infer which rules are matched when reaching the leaf node. National Cheng Kung University CSIE Computer & Internet Architecture Lab 6

7 Search Each thread performs binary range-tree search sequentially field by field. After 5 tree searches, 5 BVs are produced. Merge the 5 BVs by bitwise AND operation to obtain a final BV. The result is the index of first non-zero bit. Ex: BV=00100, Result=2 Ex: BV=00000, Result=65536 National Cheng Kung University CSIE Computer & Internet Architecture Lab 7

8 Search in Binary Range Tree National Cheng Kung University CSIE Computer & Internet Architecture Lab 8 Ex: Search 4

9 Identify Global Result National Cheng Kung University CSIE Computer & Internet Architecture Lab 9

10 Experimental Platform CUDA 5.0 Intel E5-2665 x2 2.4 GHz 8-core NVIDIA K20 Kepler GPU 705.5 MHz 13 SMX with total 2496 CUDA cores 5GB GDDR5 National Cheng Kung University CSIE Computer & Internet Architecture Lab 10

11 Latency and Throughput National Cheng Kung University CSIE Computer & Internet Architecture Lab 11

12 Comparison with implementation on Multi-core [8] S. Zhou, Y. Qu and V. K. Prasanna, “Multi-core implementation of decomposition-based packet classification algorithms,” in Parallel Computing Techniques (PaCT), pp. 105-119, 2013. National Cheng Kung University CSIE Computer & Internet Architecture Lab 12


Download ppt "High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi."

Similar presentations


Ads by Google