Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Slides:



Advertisements
Similar presentations
TRIPS Primary Memory System Simha Sethumadhavan 1.
Advertisements

Software-defined networking: Change is hard Ratul Mahajan with Chi-Yao Hong, Rohan Gandhi, Xin Jin, Harry Liu, Vijay Gill, Srikanth Kandula, Mohan Nanduri,
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.
Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Min-Sheng Lee Efficient use of memory bandwidth to improve network processor throughput Jahangir Hasan 、 Satish ChandraPurdue University T. N. VijaykumarIBM.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Guide To UNIX Using Linux Third Edition
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Software Development and Software Loading in Embedded Systems.
Memory Allocation via Graph Coloring using Scratchpad Memory
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Michael Wilson Block Design Review: ONL Header Format.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
designKilla: The 32-bit pipelined processor Brought to you by: Victoria Farthing Dat Huynh Jerry Felker Tony Chen Supervisor: Young Cho.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
IXP Lab 2012: Part 1 Network Processor Brief. NCKU CSIE CIAL Lab2 Outline Network Processor Intel IXP2400 Processing Element Register Memory Interface.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Processor Architecture
Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Sunpyo Hong, Hyesoon Kim
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Introduction to Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
SketchVisor: Robust Network Measurement for Software Packet Processing
Cache Memory Presentation I
Superscalar Processors & VLIW Processors
for Network Processors
Dynamic Packet-filtering in High-speed Networks Using NetFPGAs
Apparao Kodavanti Srinivasa Guntupalli
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
rePLay: A Hardware Framework for Dynamic Optimization
Presentation transcript:

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian, Jason H. Lin, Lixia Liu, Tao Liu, Roy Ju Discussion Prepared by Jennifer Chiang

Shangri-La: Some Insight A synonym for paradise Legendary place from James Hilton’s novel Lost Horizon Goal: achieve a perfect compiler

Introduction Problem: Programming network processors is challenging. Tight memory access and instruction budgets to sustain high line rates. Tight memory access and instruction budgets to sustain high line rates. Traditionally done by hand coded assembly. Traditionally done by hand coded assembly. Solution: Recently, researchers proposed high level programming languages for packet processing. Challenge: Is compiling these languages into code as competitive as hand tuned assembly?

Shangri-La Compiler from 10,000 foot view Consists of programming language, compiler, and runtime system targeted towards multi-core network processor. Consists of programming language, compiler, and runtime system targeted towards Intel IXP multi-core network processor. Accepts packet program written in Baker. Maximizes processor utilization Hot code paths mapped across processing elements. Hot code paths mapped across processing elements. No hardware caches on target. Delayed update software controlled caches for frequently accessed data. Delayed update software controlled caches for frequently accessed data. Packet handling optimizations Reduce per packet memory access and instruction counts. Reduce per packet memory access and instruction counts. Custom stack model Maps stack frames to fastest levels of target processor’s memory hierarchy. Maps stack frames to fastest levels of target processor’s memory hierarchy.

Baker Programming Language Backer programs are structured as a dataflow of packets from Rx to Tx. Module: container for holding related PPFs, wirings, support code & shared data. PPF (Packet processing functions): C like code that performs the actual packet processing. C like code that performs the actual packet processing. Hold temporary local states & access global data structures. Hold temporary local states & access global data structures. CC (Communication channels): Input and output channel endpoints of PPFs wired together. Input and output channel endpoints of PPFs wired together. Asynchronous queues ordered by FIFO. Asynchronous queues ordered by FIFO.

Baker Program Example ModulePPFCC

Packet Support Specify protocols using Backer’s protocol construct Metadata used to store state associated with a packet, but not contained in a packet. Useful for storing state associated with a packet from one PPF and used later by another PPF Useful for storing state associated with a packet from one PPF and used later by another PPFPacket_handle used to manipulate packets. used to manipulate packets. Data Metadata Packet_handle

IXP2400 Network Processor Intel XScale core – process control packets, execute noncritical application code, handle initialization and management of the network processor. 8 MEs (microengines) - lightweight, multi-threaded pipelined processors running special ISA designed for processing packets. 4 levels of memory: Local Memory, Scratch Memory, SRAM, DRAM Scratch Memory DRAM XScale Core Local Memory SRAM

Compiler Details

Aggregation Throughput model: t = n / p x k = number of MEs n = number of MEs k = pipeline stage with lowest throughput k = pipeline stage with lowest throughput t = throughput t = throughput P = total number of pipeline stages P = total number of pipeline stages of a packet through the system can be tolerated, but minimum forwarding rates must be guaranteed. Latency of a packet through the system can be tolerated, but minimum forwarding rates must be guaranteed., compiler uses pipeline or duplicates code across multiple processing elements. Maximize throughput, compiler uses pipeline or duplicates code across multiple processing elements. Techniques: pipelining, merging, duplication

Delayed-Update Software Controlled Caching Caching candidates: frequently read data structures with high hit rates, but infrequently written. Updates to these structures rely only on coherency of single atomic write to guarantee correctness. Reduces frequency and cost of coherency checks. Late penalty: packet delivery errors

PAC Packet access combining Packet data always stored in DRAM memory. Packet data always stored in DRAM memory. If every packet access mapped to DRAM access, packet forwarding rates are quickly limited by DRAM bandwidth. If every packet access mapped to DRAM access, packet forwarding rates are quickly limited by DRAM bandwidth. Code Generation stage of compiler: multiple protocol field accesses combined into a single wide DRAM access. Code Generation stage of compiler: multiple protocol field accesses combined into a single wide DRAM access. Same can be done for SRAM metadata accesses. Same can be done for SRAM metadata accesses.

Stack Layout Optimization Goal: allocate as many stack frames as possible to the limited amount of fast memory. Stack can grow into SRAM, but has high latency and impacts performance. Assign local memory to procedures higher in the program call graph. Assign SRAM memory when Local Memory is completely exhausted. Utilize physical and virtual stack pointers. stack pointers.

Experimental Results 3 benchmarks: L3-Switch, Firewall, MPLS Significant impact of PAC evident in the large reduction in packet handling SRAM and DRAM accesses. Code generated by Shangri-La for all 3 successfully achieved 100% forwarding rates at 2.5Gbps, which meets the designed spec of IXp Also, same throughput target achieved by hand coded assembly written specifically for these processors. Also, same throughput target achieved by hand coded assembly written specifically for these processors.

Conclusions Shangri-La provides complete framework for aggressively compiling network programs. Reduce both instruction and memory access counts. Achieved goal of 100% packet forwarding rate at 2.5Gbps