Max Transmission Unit) Introduction Receive an IP packet New processing requirements are being added to the list Deep packet inspection IP Packet Processing Mac Header: Source Mac :mx Dest Mac :my IP Header: Source IP :x Dest IP :y Data Mac Header: Source Mac :new Dest Mac :new IP Header: Source IP :x Dest IP :y Data"> Max Transmission Unit) Introduction Receive an IP packet New processing requirements are being added to the list Deep packet inspection IP Packet Processing Mac Header: Source Mac :mx Dest Mac :my IP Header: Source IP :x Dest IP :y Data Mac Header: Source Mac :new Dest Mac :new IP Header: Source IP :x Dest IP :y Data">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,

Similar presentations


Presentation on theme: "2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,"— Presentation transcript:

1 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*, Yangdong Deng ‡, Yubei Chen ‡ 1 Presenters: Abraham Addisie, Vaibhav Gogte *Electrical and Computer Engineering University of Texas at Austin ‡ Institute of Microelectronics Tsinghua University

2 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 22 2 Introduction Motivation Related work GPU Overview Hermes Architecture Adaptive warp scheduling Hardware Implementation Experimental Analysis Conclusion Outline

3 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 33 3 Processing of an IP packet at a router 1. Checking IP Header 2. Packet Classification 3. Routing Table Lookup 4. Decrementing Time to Live (TTL) value 5. IP Fragmentation (if > Max Transmission Unit) Introduction Receive an IP packet New processing requirements are being added to the list Deep packet inspection IP Packet Processing Mac Header: Source Mac :mx Dest Mac :my -------------------------- --- IP Header: Source IP :x Dest IP :y -------------------------- --- Data Mac Header: Source Mac :new Dest Mac :new -------------------------- --- IP Header: Source IP :x Dest IP :y -------------------------- --- Data

4 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 44 Motivation 4 Internet traffic is increasing exponentially Multimedia application, social network, internet of things Network protocols are being added and modified Transition from IPv4(32 bit) to IPv6(128 bit) High Throughput Router High Programmable Router New high processing demanding task is being added Deep packet inspection

5 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 55 5 ASIC based router Network processor based router GPP (software) based router Related Work ASIC based router: Long design turnaround High non-recurring engineering cost NP based router: No effective programming model Intel discontinue its NP router business GPP (Software) based router: Low performance GPU based router: High performance + High programmability

6 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 66 6 GPP (Software) based router Related Work – CPU vs GPU Throughput GPU based software router Low throughput processorHigh throughput processor Packetshader: Han and et. al[2010]

7 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 77 7 Processing of a Packet is independent with the others Data level parallelism = Packet level parallelism Exploiting High Throughput GPU for IP routing GPU based router is shown to outperform software based router by 30x (in terms of throughput) Packetshader: Han and et. al[2010] Packet Queue Batching Parallel Processing by GPU

8 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 88 8 Memory mapping from CPU’s main memory to GPU’s device memory through PCIe bus with a pick bandwidth of 8GBps GPU throughput = 30x CPU’s, without memory mapping Reduced to 5x CPU’s, with memory mapping overhead Cannot guarantee minimum latency for an individual packet Limitation of existing GPU based router Solution: Hermes Architecture of NVIDIA GTX480

9 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 99 Shared Memory Hierarchy 9 Hermes, integrated CPU/GPU IP routing Lower packet transferring overhead Shared memory Lower per packet latency Adaptive warp scheduling

10 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 10 Adaptive Warp Issue Tradeoff in updating the FIFO Arrival pattern of packets Available resources in GPU Tradeoff in updating the FIFO: Too large – average packet delay increases Too low – complicated GPU fetch scheduling no. of packets to be processed SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP SMPSMP Minimum 1 warp fetch granularity Shared Memory Data transfer Task FIFO - - - - - Monitor the packets CPU

11 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 11 In Order Commit 11 UDP protocol users expect packets to arrive in order DCQ entry id Warp id Lookup Table (LUT) Warp Allocator Warp Scheduler Write Back Stage...... Shader Core DCQ Warp id... DCQ entry id Warp id Maps DCQ entry to wrap ID Records warp ids in flight Warps committed in order

12 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 12 Task FIFO 32 bit - 1028 entries Area = 0.053 mm 2 Delay Commit Queue Size depends on maximally allowed concurrent warps (MCWs) and shader cores 8 bit – 1028 entries Area = 0.013 mm 2 DCQ-Warp LUT Size depends on number of MCWs 16 bit – 32 entries Area = 0.006 mm 2 Hardware and Area Overhead Hardware Overhead Negligible!

13 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 13 Cycle Accurate GPGPU-Sim to evaluate performance Experimental Setup Benchmarks Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection Both burst and sparse patterns QoS parameters – throughput, delay, delay variance

14 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 14 Throughput evaluation Burst traffic without DCQ Sparse traffic without DCQ No packet queueing CPU/GPU still unable to deliver at input rate Outperforms CPU/GPU by a factor of 5 Better resource utilization with increasing MCW Computing rates of benchmark applications

15 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 15 Delay analysis Simple processing in GPU, overlap of CPU side waiting with GPU processing Packet Delay reduction by 81.2%! Burst traffic without DCQ Divergent branches takes higher processing time starving the packets Delay - with DCQ vs without DCQ

16 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 16 Lack of QoS and CPU-GPU communication overhead major bottleneck Hermes – closely coupled CPU-GPU solution Meet stringent delay requirements Enable QoS through optimized configuration Minimal hardware extension Novel high quality packet processing engine for future software routers Conclusion

17 2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 17 Are GPUs really easy to program for processing packets? How does the performance and area overhead compare with ASIC based routers? Is router programmability really a crucial concern? Discussion points


Download ppt "2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech 1/45 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*,"

Similar presentations


Ads by Google