2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.

2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos Vasiliadis, and Sotiris Ioannidis Publisher/Conference: IEEE/ACM Transactions on Networking (Volume: 25 , Issue: 3 , June 2017) Referenced: 4 Presenter: 林宇翔 Date: 2019/05/22 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C. CSIE CIAL Lab 1

Introduction We propose an adaptive scheduling approach that supports the heterogeneous and asymmetric hardware, tailored for network packet processing applications. Our scheduler is able to respond quickly to dynamic performance fluctuations that occur at real time, such as traffic bursts, application overloads…and provide consistently good performance. The experimental results show that our system is able to match the peak throughput (to meet the input traffic rate of the packets) of a diverse set of packet processing applications, while consuming up to 3.5× less energy. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Introduction Overall, the CPU cores are good at handling branch-intensive packet processing workloads, while discrete GPUs tend to operate efficiently in data-parallel workloads. Between those two, the integrated GPU features high energy efficiency without significantly compromising the processing rate or latency. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Power Instrumentation
We utilize four high-precision Hall effect current sensors to constantly monitor the three four ATX powersupply power lines (+12.0a, +12.0b +5.0, +3.3 Volts). To calculate their power consumption, we use a utilization-based model. National Cheng Kung University CSIE Computer & Internet Architecture Lab

2019/10/19 Applications IPv4 Packet Forwarding: RadixTrie lookup algorithm and use a routing table of 17,000 entries. Deep Packet Inspection: We port a DFA implementation of the Aho-Corasick algorithm for string searching, and use the content patterns (about 10,000 fixed strings) of the latest Snort distribution, which we compile into the same state machine. Packet Hashing:Once receiving a new packet, the “ packet store “ is updated, and the “ fingerprint table “ is checked to determine whether the packet includes a significant fraction of content cached in the packet store; if yes, an encoded version that eliminates this (recently observed) content is transmitted. We have implemented the MD5 algorithm Encryption:We implement AES-CBC encryption using a different 128-bit key for each communication session. National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

Architecture Two different models for capturing the network traffic and distributing it to different computational devices for processing, namely master-worker and shared-nothing. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Architecture The advantages of shared-nothing, compared to the masterworker architecture, is that it alleviates the overhead caused by the synchronization required to assure the proper execution of the worker threads. we only use the “shared-nothing” architecture for packet capturing for the remainder of the paper. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Performance Characterization - solo
National Cheng Kung University CSIE Computer & Internet Architecture Lab

Performance Characterization - combo
National Cheng Kung University CSIE Computer & Internet Architecture Lab

Performance Characterization
Different applications (or the same application on different devices) require a different batch size to reach maximum throughput. Computationally intensive applications (i.e. AES) benefit more from large batch sizes, while memory intensive applications (i.e. IPv4 forwarding) require smaller batch sizes to reach the peak throughput. This is mainly the effect of cache sizes in the memory hierarchy of the specific device. Increasing the batch size, after the maximum throughput has been reached, results to linear increases in latency. The performance of DPI has large fluctuations; when there is no match in the input traffic the throughput achieved by all devices is much higher. The number of pattern matches decreases, the DFA algorithm needs to access only a few different states. These states are stored in the cache memory. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Energy Efficiency National Cheng Kung University CSIE Computer & Internet Architecture Lab

Adaptation Algorithm For each combination of our parameter space, we measure the sustained throughput, latency and power, and store them to a dictionary; the dictionary will be used at runtime in order to acquire the most suitable configuration. We use a different red black tree to store each achieved outcome (i.e. throughput, latency, and power) for each configuration. Each node in the tree holds all the configurations that correspond to the requested result. In order to prevent from overloading the tree, before inserting a new node , we check if its performance differs with its parent by a threshold δ. If not, we merge them in order to save space. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Adaptation Algorithm Our scheduling algorithm is laid out as follows.
Measure the current traffic rate. Get the best configuration from the red black tree using the desired requirement (i.e. latency-, throughput-, or energy-aware). Change to this configuration only if it was measured better than the current one by a factor of λ. Start creating batches of the specified size. If more than one devices are required, create batches for each device accordingly. The batches are inserted into the queue of the corresponding device(s). Measure the performance3 achieved by each of the devices for the submitted batch(es). If the sustained performance is similar to the one requested from the red black tree (up to a threshold δ), return to Step 1; otherwise, update the tree accordingly, and: If the performance achieved by each device is worse, increase the batch size by a factor of 2 If the performance achieved by each device is better, decrease the batch size by a factor of 2 National Cheng Kung University CSIE Computer & Internet Architecture Lab

2019/10/19 We use an energy-critical policy, i.e. handle all input traffic at the maximum energy efficiency. Evaluation National Cheng Kung University CSIE Computer & Internet Architecture Lab CSIE CIAL Lab

2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.

Similar presentations

Presentation on theme: "2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.

Similar presentations

Presentation on theme: "2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos."— Presentation transcript:

Similar presentations

About project

Feedback