Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Modified from Silberschatz, Galvin and Gagne ©2009 Lecture 7 Chapter 4: Threads (cont)

1 Author: Ioannis Sourdis, Sri Harsha Katamaneni Publisher: IEEE ASAP,2011 Presenter: Jia-Wei Yo Date: 2011/11/16 Longest prefix Match and Updates in Range.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,

IP Address Lookup for Internet Routers Using Balanced Binary Search with Prefix Vector Author: Hyesook Lim, Hyeong-gee Kim, Changhoon Publisher: IEEE TRANSACTIONS.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:

Scalable IPv6 Lookup/Update Design for High-Throughput Routers Authors: Chung-Ho Chen, Chao-Hsien Hsu, Chen -Chieh Wang Presenter: Yi-Sheng, Lin ( 林意勝.

Parallel IP Lookup using Multiple SRAM-based Pipelines Authors: Weirong Jiang and Viktor K. Prasanna Presenter: Yi-Sheng, Lin ( 林意勝 ) Date:

Parallel-Search Trie-based Scheme for Fast IP Lookup

Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Scalable Name Lookup in NDN Using Effective Name Component Encoding

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

GPEP : Graphics Processing Enhanced Pattern- Matching for High-Performance Deep Packet Inspection Author: Lucas John Vespa, Ning Weng Publisher: 2011 IEEE.

1 Towards Practical Architectures for SRAM-based Pipelined Lookup Engines Author: Weirong Jiang, Viktor K. Prasanna Publisher: INFOCOM 2010 Presenter:

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

PARALLEL-SEARCH TRIE- BASED SCHEME FOR FAST IP LOOKUP Author: Roberto Rojas-Cessa, Lakshmi Ramesh, Ziqian Dong, Lin Cai Nirwan Ansari Publisher: IEEE GLOBECOM.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

Department of Computer Science and Software Engineering

1 Presenter: Ming-Shiun Yang 2013/01/21 SAGA : SystemC Acceleration on GPU Architectures Design Automation Conference (DAC), th ACM/EDAC/IEEE Sara.

Memory-Efficient and Scalable Virtual Routers Using FPGA Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,

Synchronization These notes introduce:

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

HIGH-PERFORMANCE LONGEST PREFIX MATCH LOGIC SUPPORTING FAST UPDATES FOR IP FORWARDING DEVICES Author: Arun Kumar S P Publisher/Conf.: 2009 IEEE International.

GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

LightFlow : Speeding Up GPU-based Flow Switching and Facilitating Maintenance of Flow Table Author : Nobutaka Matsumoto and Michiaki Hayashi Conference:

Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Chapter 4: Threads.

Chapter 4: Threads.

Multithreaded Programming

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Synchronization These notes introduce:

6- General Purpose GPU Programming

Presentation transcript:

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu Publisher: IEEE INFOCOM 2011 Presenter: Ye-Zhi Chen Date: 2012/01/11

INTRODUCTION  This paper present GALE, a GPU-Accelerated Lookup Engine, to enable high-performance IP lookup in software routers.  Authors leverage the Compute Unified Device Architecture (CUDA) programming model to enable parallel IP lookup on the many-core GPU. 2

INTRODUCTION The key innovations include: 1) IP addresses are directly translated into memory addresses using a large direct table on GPU memory that stores all possible routing prefixes 2) Authors also map the route-update operations to CUDA in order to exploit GPU’s vast processing power 3

architecture The traditional trie-based routing table in system memory GALE also stores the next-hop information for all the possible IP prefixes in a fixed-size large memory block on GPU, which is referred to as direct table The reason for maintaining two routing tables is that there is an inherent tradeoff between efficiency and overhead when considering a direct table versus trie for particular functionality. 4

architecture 5

GALE employs a control thread and a pool of working threads to implement the lookup and update requests. The processing of requests is divided into the following steps : 1) Control thread reads a group of requests of the same type from the request queue classified by Lookup, Insertion, Modification, Deletion operations. 2) Control thread activates one idle working thread in the thread pool to process the group of requests. 3) Working threads invoke corresponding GPU and CPU code to perform lookup and update operations 6

architecture With GALE, the parallelism in request processing is exploited in two aspects. 1. As the CPU we use also has multi-cores, the working threads can be potentially executed simultaneously. Meanwhile, each working thread will instruct the GPU to launch one kernel. As GTX 470 supports multiple kernels on different SMs, the different request groups thus can be scheduled in parallel 2. Inside one kernel, the group of requests are of the same type. Parallelism is thus achieved by mapping these requests to the CUDA threads, which in turn are scheduled in parallel to different SPs on GPU 7

Lookup Authors explore an IP lookup solution with O(1) memory access and computational complexity. each IP address has exactly one-to-one mapping relation with entries in the direct table. They found that over 40% of IP prefixes have the length of 24, and over 99% of the IP prefixes are less than or equal to 24. They leverage this fact and propose a solution that storing all the possible prefixes with length less than 24 in one direct table(which only cost 2 24 =16M memory) and use a separate long table to store the prefixes that are longer than 24 bits 8

Lookup Only one address translation is enough to find the corresponding routing entry. Meanwhile, only one memory access is required to fetch the next-hop destination. 9

UPdate route-update will involve operations to both the direct table and the trie structure. As direct table has a fixed size, we actually need not allocate or release entry space upon route-update operations. To alleviate the trie-traversal operations in GPU, we introduce a length table to denote the prefix length. 10

UPdate 1) Insertion and Modification: Insertion and modification are essentially the same operation for direct table 11

UPdate 2) Deletion : The deletion operation is similar to modification except that during updating the range of entries, the next-hop information is modified to the parent node’s nexthop information in the trie Deleting an entry is as follows: replacing the nexthop and prefix length of the updated entry with the parent’s nexthop and prefix length. The parent node is obtained from the trie-traversal during deleting the entry node in trie by CPU. 12

UPdate 13

The routing dataset used in the experiments are from FUNET and RIS Desktop PC equipment : ($1122) Core i5 750 CPU 2.66Ghz (with 4 cores) NVIDIA GeForce GTX470 with 1280MB of global memory and 448 stream processors ($428) OS: 64-bit x86 Fedora 13 Linux distribution with unmodified kernel CUDA version : Experiment

the experiments use the following default configuration: With 8 working threads. Every CUDA thread (running on GPU ) takes just one lookup request, or updates one entry while performing update tasks. Every CUDA grid contains 44 thread blocks, and every block contains threads. 15

Experiment 16

Experiment 17

Experiment 18

Experiment 19