NFV Compute Acceleration APIs and Evaluation

Slides:



Advertisements
Similar presentations
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Advertisements

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Using the Trigger Test Stand at CDF for Benchmarking CPU (and eventually GPU) Performance Wesley Ketchum (University of Chicago)
Panda: MapReduce Framework on GPU’s and CPU’s
Introduction to Android Platform Overview
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Introduction of Intel Processors
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
GPU Architecture and Programming
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
1 Latest Generations of Multi Core Processors
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
My Coordinates Office EM G.27 contact time:
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
Gnort: High Performance Network Intrusion Detection Using Graphics Processors Date:101/2/15 Publisher:ICS Author:Giorgos Vasiliadis, Spiros Antonatos,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
NFP: Enabling Network Function Parallelism in NFV
Operating System Overview
Virtualization.
M. Bellato INFN Padova and U. Marconi INFN Bologna
GPUNFV: a GPU-Accelerated NFV System
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
CS427 Multicore Architecture and Parallel Computing
Enabling Effective Utilization of GPUs for Data Management Systems
Heterogeneous Computation Team HybriLIT
A Closer Look at Instruction Set Architectures
6WIND MWC IPsec Demo Scalable Virtual IPsec Aggregation with DPDK for Road Warriors and Branch Offices Changed original subtitle. Original subtitle:
CMPE419 Mobile Application Development
Linchuan Chen, Xin Huo and Gagan Agrawal
NFP: Enabling Network Function Parallelism in NFV
GEN: A GPU-Accelerated Elastic Framework for NFV
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
NFP: Enabling Network Function Parallelism in NFV
Chapter 1 Introduction.
TensorFlow: A System for Large-Scale Machine Learning
Graphics Processing Unit
CMPE419 Mobile Application Development
6- General Purpose GPU Programming
Multicore and GPU Programming
Fast Accesses to Big Data in Memory and Storage Systems
Presentation transcript:

NFV Compute Acceleration APIs and Evaluation draft-perumal-nfvrg-nfv-compute-acceleration-00.txt Bose Perumal Wenjing Chu R. Krishnan S. Hemalatha Dell Peter Wills BT

NFV Compute Acceleration– Software Architecture

Compute Acceleration for NFV Motivation Network functions are being virtualized. Networks packet based architecture provides scope for parallel processing. Parallel processing can be done in GPUs, Coprocessors like Intel Xeon Phi and multicore CPUs. Parallel processing happens (implicitly in HW) in implementations such as Intel QuickAssist etc. Generic Software Architecture and APIs Generic software architecture provides the framework components APIs provides ease of adding network functions, traffic streams and directing traffic flow Evaluation Multi string matching on all incoming packets using Aho-Corasick algorithm Implementation in OpenCL to avoid hardware dependency.

Aho Corasick – Multistring matching Algorithm Aho Corasick Algorithm used for multistring matching Initial state machine is built with the keyword strings Input string is matched using the state machine Number of keyword strings do not have big impact on processing time.

Test Setup –Dell R720 with Nvidia Tesla K10 Server Intel Xeon E5-2665 @2.40GHz 2 processors with 16 cores each 2.5MB Cache per core. RAM 96 GB 1 PCIe x16 full length, full height slot which supports PCIe3.0. GPU Card Nvidia GPU Tesla K10 @745 MHz 2 GPU processors with 1536 cores each 8GB memory(4GB per processor) (256- bit GDDR5) PCI 3.0 x16 interface Development Environment Linux distribution - Fedora 20 Compiler - OpenCL with Nvidia 6.0 libraries

Complete Packet – Multi String Search Variable packet sizes – Average packet size 583 bytes Signature data base - Top 5000 website names

Complete Packet – Worst case 0 string match Different Signature data bases - 1000 / 2500 / 5000 website names Different Fixed packet sizes - 64 / 583 / 1450 Variable packet size with averages packet size - 583

Results – GPU multi string matching 35% packets match Traffic generation and string matching Average packet size 583 16384 packets batched for processing in GPU Each packet checked for 5000 strings(Aho Corasick state machine) 35 % of packets matched Time for single batch processing Single copy time from CPU to GPU 0.923 milliseconds Execution time 4.38 milliseconds Result buffer copy from GPU to CPU 0.168 milliseconds Operations in one second 464 iterations executed in one second 7.6 million packets processed in one second 33.02 Gbps processed in one second

Results – GPU multi string matching 0 packets match Traffic generation and string matching Average packet size 583 16384 packets batched for processing in GPU Each packet checked for 5000 strings(Aho Corasick state machine) 0 packet matched Time for single batch processing Single copy time from CPU to GPU 0.903 milliseconds Execution time 9.784 milliseconds Result buffer copy from GPU to CPU 0.161 milliseconds Operations in one second 209 iterations executed in one second 3.42 million packets processed in one second 14.9 Gbps processed in one second

Portability Code written OpenCL is portable to different platforms only with makefile changes. Implementation tested in different platforms GPU based acceleration with baremetal Linux Server GPU based acceleration from Virtual Machine on Vmware Hypervisor Intel Xeon Phi based acceleration from baremetal Linux Server Intel Multicore CPU based acceleration on baremetal Linux Server

NFV Compute Acceleration API Having a common API for NFV Compute Acceleration (NCA) can abstract the hardware details and enable NFV applications to use compute acceleration. API will be a C library, user can compile it along with their code. No API Description 1 nca_add_network_function Add network function 2 nca_traffic_stream_init Add traffic stream and attach network functions 3 nca_add_packets Add packets to buffer 4 nca_buffer_ready Process packets 5 notify_callback Event notification 6 nca_read_results Read results

Future Work Refining the APIs and adding more algorithms for network functions such as encryption/decryption and compression/decompression. Integration with I/O acceleration, like Intel DPDK, ODP etc. Verifying in additional platforms The final goal is to have a common API for SW or HW based acceleration schemes in a OpenDaylight/OpenStack/OPNFV framework.

Thank You

Backup Slides

Implemented Optimizations Variable size Packet Packing Multiple copies between CPU and GPU are costly. Variable size packets are packed into a single buffer. Initial portion of the buffer holds all the packet starting offsets, packet contents follow that. Pinned Memory Using pinned memory reduces copy time 3x Used pinned memory for CPU to GPU copy and GPU to CPU copy Scheduler for Pipe lining Written a scheduler to do pipelining with multiple command queues With 3 command queues we are able to hide 99% of copy time. Reducing GPU global memory access Reduce the GPU global memory access Copy to private memory if needed. Instead of byte by byte copy, copying 32 bytes using vload8 with float type. Move common structures to constant memory Organizing GPU cores If there are more memory accesses , overload number of kernels to hide the latency.