2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

Slides:



Advertisements
Similar presentations
EE384y: Packet Switch Architectures
Advertisements

Computer Networks TCP/IP Protocol Suite.
Chapter 5: CPU Scheduling
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
1 OpenFlow + : Extension for OpenFlow and its Implementation Hongyu Hu, Jun Bi, Tao Feng, You Wang, Pingping Lin Tsinghua University
Router Internals CS 4251: Computer Networking II Nick Feamster Spring 2008.
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
Router Internals CS 4251: Computer Networking II Nick Feamster Fall 2008.
11 Application of CSF4 in Avian Flu Grid: Meta-scheduler CSF4. Lab of Grid Computing and Network Security Jilin University, Changchun, China Hongliang.
Adapted Multimedia Internet KEYing (AMIKEY): An extension of Multimedia Internet KEYing (MIKEY) Methods for Generic LLN Environments draft-alexander-roll-mikey-lln-key-mgmt-01.txt.
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
The Impact of Soft Resource Allocation on n-tier Application Scalability Qingyang Wang, Simon Malkowski, Yasuhiko Kanemasa, Deepal Jayasinghe, Pengcheng.
Multipath Routing for Video Delivery over Bandwidth-Limited Networks S.-H. Gary Chan Jiancong Chen Department of Computer Science Hong Kong University.
A Bandwidth Allocation/Sharing/Extension Protocol for Multimedia Over IEEE Ad Hoc Wireless LANs Shiann-Tsong Sheu and Tzu-fang Sheu IEEE JOURNAL.
1 IP - The Internet Protocol Relates to Lab 2. A module on the Internet Protocol.
SE-292 High Performance Computing
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Sweet Storage SLOs with Frosting Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, Randy Katz.
Shredder GPU-Accelerated Incremental Storage and Computation
Virtual Switching Without a Hypervisor for a More Secure Cloud Xin Jin Princeton University Joint work with Eric Keller(UPenn) and Jennifer Rexford(Princeton)
Chapter 20 Network Layer: Internet Protocol
1 Network Address Translation (NAT) Relates to Lab 7. Module about private networks and NAT.
Towards Software Defined Cellular Networks
Doc. :IEEE /314r0 Submission Sai Shankar et al., Philips ResearchSlide 1 May 2002 TXOP Request: in Time vs. in Queue Size? Sai Shankar, Javier.
Virtual Machine Queue Architecture Review Ali Dabagh Architect Windows Core Networking Don Stanwyck Sr. Program Manager NDIS Virtualization.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.
Misbah Mubarak, Christopher D. Carothers
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Making Cellular Networks Scalable and Flexible Li Erran Li Bell Labs, Alcatel-Lucent Joint work with collaborators at university of Michigan, Princeton,
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,
1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.
Min-Sheng Lee Efficient use of memory bandwidth to improve network processor throughput Jahangir Hasan 、 Satish ChandraPurdue University T. N. VijaykumarIBM.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Programmable Data Planes COS 597E: Software Defined Networking.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
A 50-Gb/s IP Router 참고논문 : Craig Partridge et al. [ IEEE/ACM ToN, June 1998 ]
Salim Hariri HPDC Laboratory Enhanced General Switch Management Protocol Salim Hariri Department of Electrical and Computer.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Optimization Problems in Wireless Coding Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
Signaling Transport Options in GMPLS Networks: In-band or Out-of-band Malathi Veeraraghavan & Tao Li Charles L. Brown Dept. of Electrical and Computer.
Sunpyo Hong, Hyesoon Kim
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Courtesy Piggybacking: Supporting Differentiated Services in Multihop Mobile Ad Hoc Networks Wei LiuXiang Chen Yuguang Fang WING Dept. of ECE University.
VIRTUAL NETWORK PIPELINE PROCESSOR Design and Implementation Department of Communication System Engineering Presented by: Mark Yufit Rami Siadous.
Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.
Gwangsun Kim, Jiyun Jeong, John Kim
Addressing: Router Design
IP Control Gateway (IPCG)
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.
Presentation transcript:

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*, Yangdong Deng ‡, Yubei Chen ‡ 1 Presenters: Abraham Addisie, Vaibhav Gogte *Electrical and Computer Engineering University of Texas at Austin ‡ Institute of Microelectronics Tsinghua University

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 22 2 Introduction Motivation Related work GPU Overview Hermes Architecture Adaptive warp scheduling Hardware Implementation Experimental Analysis Conclusion Outline

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 33 3 Processing of an IP packet at a router 1. Checking IP Header 2. Packet Classification 3. Routing Table Lookup 4. Decrementing Time to Live (TTL) value 5. IP Fragmentation (if > Max Transmission Unit) Introduction Receive an IP packet New processing requirements are being added to the list Deep packet inspection IP Packet Processing Mac Header: Source Mac :mx Dest Mac :my IP Header: Source IP :x Dest IP :y Data Mac Header: Source Mac :new Dest Mac :new IP Header: Source IP :x Dest IP :y Data

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 44 Motivation 4 Internet traffic is increasing exponentially Multimedia application, social network, internet of things Network protocols are being added and modified Transition from IPv4(32 bit) to IPv6(128 bit) High Throughput Router High Programmable Router New high processing demanding task is being added Deep packet inspection

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 55 5 ASIC based router Network processor based router GPP (software) based router Related Work ASIC based router: Long design turnaround High non-recurring engineering cost NP based router: No effective programming model Intel discontinue its NP router business GPP (Software) based router: Low performance GPU based router: High performance + High programmability

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 66 6 GPP (Software) based router Related Work – CPU vs GPU Throughput GPU based software router Low throughput processorHigh throughput processor Packetshader: Han and et. al[2010]

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 77 7 Processing of a Packet is independent with the others Data level parallelism = Packet level parallelism Exploiting High Throughput GPU for IP routing GPU based router is shown to outperform software based router by 30x (in terms of throughput) Packetshader: Han and et. al[2010] Packet Queue Batching Parallel Processing by GPU

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 88 8 Memory mapping from CPU’s main memory to GPU’s device memory through PCIe bus with a pick bandwidth of 8GBps GPU throughput = 30x CPU’s, without memory mapping Reduced to 5x CPU’s, with memory mapping overhead Cannot guarantee minimum latency for an individual packet Limitation of existing GPU based router Solution: Hermes Architecture of NVIDIA GTX480

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 99 Shared Memory Hierarchy 9 Hermes, integrated CPU/GPU IP routing Lower packet transferring overhead Shared memory Lower per packet latency Adaptive warp scheduling

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 10 Adaptive Warp Issue Tradeoff in updating the FIFO Arrival pattern of packets Available resources in GPU Tradeoff in updating the FIFO: Too large – average packet delay increases Too low – complicated GPU fetch scheduling no. of packets to be processed SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP Minimum 1 warp fetch granularity Shared Memory Data transfer Task FIFO Monitor the packets CPU

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 11 In Order Commit 11 UDP protocol users expect packets to arrive in order DCQ entry id Warp id Lookup Table (LUT) Warp Allocator Warp Scheduler Write Back Stage Shader Core DCQ Warp id... DCQ entry id Warp id Maps DCQ entry to wrap ID Records warp ids in flight Warps committed in order

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 12 Task FIFO 32 bit entries Area = mm 2 Delay Commit Queue Size depends on maximally allowed concurrent warps (MCWs) and shader cores 8 bit – 1028 entries Area = mm 2 DCQ-Warp LUT Size depends on number of MCWs 16 bit – 32 entries Area = mm 2 Hardware and Area Overhead Hardware Overhead Negligible!

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 13 Cycle Accurate GPGPU-Sim to evaluate performance Experimental Setup Benchmarks Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection Both burst and sparse patterns QoS parameters – throughput, delay, delay variance

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 14 Throughput evaluation Burst traffic without DCQ Sparse traffic without DCQ No packet queueing CPU/GPU still unable to deliver at input rate Outperforms CPU/GPU by a factor of 5 Better resource utilization with increasing MCW Computing rates of benchmark applications

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 15 Delay analysis Simple processing in GPU, overlap of CPU side waiting with GPU processing Packet Delay reduction by 81.2%! Burst traffic without DCQ Divergent branches takes higher processing time starving the packets Delay - with DCQ vs without DCQ

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 16 Lack of QoS and CPU-GPU communication overhead major bottleneck Hermes – closely coupled CPU-GPU solution Meet stringent delay requirements Enable QoS through optimized configuration Minimal hardware extension Novel high quality packet processing engine for future software routers Conclusion

2 "A4rb_Premium" – _v02 – do not delete this text object! Speech 17 Are GPUs really easy to program for processing packets? How does the performance and area overhead compare with ASIC based routers? Is router programmability really a crucial concern? Discussion points