Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Similar presentations


Presentation on theme: "Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,"— Presentation transcript:

1 Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian, Jason H. Lin, Lixia Liu, Tao Liu, Roy Ju Discussion Prepared by Jennifer Chiang

2 Shangri-La: Some Insight A synonym for paradise Legendary place from James Hilton’s novel Lost Horizon Goal: achieve a perfect compiler

3 Introduction Problem: Programming network processors is challenging. Tight memory access and instruction budgets to sustain high line rates. Tight memory access and instruction budgets to sustain high line rates. Traditionally done by hand coded assembly. Traditionally done by hand coded assembly. Solution: Recently, researchers proposed high level programming languages for packet processing. Challenge: Is compiling these languages into code as competitive as hand tuned assembly?

4 Shangri-La Compiler from 10,000 foot view Consists of programming language, compiler, and runtime system targeted towards multi-core network processor. Consists of programming language, compiler, and runtime system targeted towards Intel IXP multi-core network processor. Accepts packet program written in Baker. Maximizes processor utilization Hot code paths mapped across processing elements. Hot code paths mapped across processing elements. No hardware caches on target. Delayed update software controlled caches for frequently accessed data. Delayed update software controlled caches for frequently accessed data. Packet handling optimizations Reduce per packet memory access and instruction counts. Reduce per packet memory access and instruction counts. Custom stack model Maps stack frames to fastest levels of target processor’s memory hierarchy. Maps stack frames to fastest levels of target processor’s memory hierarchy.

5 Baker Programming Language Backer programs are structured as a dataflow of packets from Rx to Tx. Module: container for holding related PPFs, wirings, support code & shared data. PPF (Packet processing functions): C like code that performs the actual packet processing. C like code that performs the actual packet processing. Hold temporary local states & access global data structures. Hold temporary local states & access global data structures. CC (Communication channels): Input and output channel endpoints of PPFs wired together. Input and output channel endpoints of PPFs wired together. Asynchronous queues ordered by FIFO. Asynchronous queues ordered by FIFO.

6 Baker Program Example ModulePPFCC

7 Packet Support Specify protocols using Backer’s protocol construct Metadata used to store state associated with a packet, but not contained in a packet. Useful for storing state associated with a packet from one PPF and used later by another PPF Useful for storing state associated with a packet from one PPF and used later by another PPFPacket_handle used to manipulate packets. used to manipulate packets. Data Metadata Packet_handle

8 IXP2400 Network Processor Intel XScale core – process control packets, execute noncritical application code, handle initialization and management of the network processor. 8 MEs (microengines) - lightweight, multi-threaded pipelined processors running special ISA designed for processing packets. 4 levels of memory: Local Memory, Scratch Memory, SRAM, DRAM Scratch Memory DRAM XScale Core Local Memory SRAM

9 Compiler Details

10 Aggregation Throughput model: t = n / p x k = number of MEs n = number of MEs k = pipeline stage with lowest throughput k = pipeline stage with lowest throughput t = throughput t = throughput P = total number of pipeline stages P = total number of pipeline stages of a packet through the system can be tolerated, but minimum forwarding rates must be guaranteed. Latency of a packet through the system can be tolerated, but minimum forwarding rates must be guaranteed., compiler uses pipeline or duplicates code across multiple processing elements. Maximize throughput, compiler uses pipeline or duplicates code across multiple processing elements. Techniques: pipelining, merging, duplication

11 Delayed-Update Software Controlled Caching Caching candidates: frequently read data structures with high hit rates, but infrequently written. Updates to these structures rely only on coherency of single atomic write to guarantee correctness. Reduces frequency and cost of coherency checks. Late penalty: packet delivery errors

12 PAC Packet access combining Packet data always stored in DRAM memory. Packet data always stored in DRAM memory. If every packet access mapped to DRAM access, packet forwarding rates are quickly limited by DRAM bandwidth. If every packet access mapped to DRAM access, packet forwarding rates are quickly limited by DRAM bandwidth. Code Generation stage of compiler: multiple protocol field accesses combined into a single wide DRAM access. Code Generation stage of compiler: multiple protocol field accesses combined into a single wide DRAM access. Same can be done for SRAM metadata accesses. Same can be done for SRAM metadata accesses.

13 Stack Layout Optimization Goal: allocate as many stack frames as possible to the limited amount of fast memory. Stack can grow into SRAM, but has high latency and impacts performance. Assign local memory to procedures higher in the program call graph. Assign SRAM memory when Local Memory is completely exhausted. Utilize physical and virtual stack pointers. stack pointers.

14 Experimental Results 3 benchmarks: L3-Switch, Firewall, MPLS Significant impact of PAC evident in the large reduction in packet handling SRAM and DRAM accesses. Code generated by Shangri-La for all 3 successfully achieved 100% forwarding rates at 2.5Gbps, which meets the designed spec of IXp24000. Also, same throughput target achieved by hand coded assembly written specifically for these processors. Also, same throughput target achieved by hand coded assembly written specifically for these processors.

15 Conclusions Shangri-La provides complete framework for aggressively compiling network programs. Reduce both instruction and memory access counts. Achieved goal of 100% packet forwarding rate at 2.5Gbps


Download ppt "Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,"

Similar presentations


Ads by Google