Presentation is loading. Please wait.

Presentation is loading. Please wait.

Network Processors A generation of multi-core processors INF5063: Programming Heterogeneous Multi-Core Processors April 15, 2015.

Similar presentations


Presentation on theme: "Network Processors A generation of multi-core processors INF5063: Programming Heterogeneous Multi-Core Processors April 15, 2015."— Presentation transcript:

1 Network Processors A generation of multi-core processors INF5063: Programming Heterogeneous Multi-Core Processors April 15, 2015

2 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Agere Payload Plus APP550 Classifier memory Classifier buffer Scheduler buffer Stream editor memory from Ingress from. co- processor to Egress to co- processor Scheduler memory Statistics memory PCI Bus

3 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Agere Payload Plus APP550 Classifier memory Classifier buffer Scheduler buffer Stream editor memory from Ingress from. co- processor to Egress to co- processor Scheduler memory Statistics memory PCI Bus Pattern Processing Engine - patterns specified by programmer - programmable using a special high-level language - only pattern matching instructions - parallelism by hardware using multiple copies and several sets of variables - access to different memories State Engine - gather information (statistics) for scheduling - verify flow within bounds - provide an interface to the host - configure and control other functional units Packet (protocol data unit) assembler - collect all blocks of a frame - not programmable Stream Editor (SED) - two parallel engines - modify outgoing packets (e.g., checksum, TTL, …) - configurable, but not programmable Reorder Buffer Manager - transfers data between classifier and traffic manager - ensure packet order due to parallelism and variable processing time in the pattern processing

4 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Embedded processors PowerNP Ingress queue Ingress data store Egress queue Egress data store 4 Interfaces (IN from net) 4 Interfaces (OUT to net) Internal memory External memory Control store Instruct. memory PowerPC core 2 Interfaces (OUT to host)2 Interfaces (IN from host) Hardware classifier Dispatch unit

5 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Embedded processors PowerNP Ingress queue Ingress data store Egress queue Egress data store 4 Interfaces (IN from net) 4 Interfaces (OUT to net) Internal memory External memory Control store Instruct. memory PowerPC core 2 Interfaces (OUT to host)2 Interfaces (IN from host) Hardware classifier Dispatch unit Embedded PowerPC GPU - no OS on the NPF Coprocessors - 8 embedded processors - 4 kbytes local memory each - 2 cores/processor - 2 threads/core Link layer - framing outside the processor

6 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo IXP1200 Architecture RISC processor: - StrongARM running Linux - control, higher layer protocols and exceptions - 232 MHz Microengines: - low-level devices with limited set of instructions - transfers between memory devices - packet processing - 232 MHz Access units: - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and synchronization

7 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo IXP2400 Architecture microengine 8 SRAM coprocessor FLASH DRAM SRAM access SDRAM access SCRATCH memory PCI access MSF access Embedded RISK CPU (XScale) PCI bus receive bus DRAM bus SRAM bus microengine 2 microengine 1 microengine 5 microengine 4 microengine 3 multiple independent internal buses slowport access … transmit bus RISC processor: - StrongArm  XScale - 233 MHz  600 MHz Microengines - 6  8 - 233 MHz  600 MHz Media Switch Fabric - forms fast path for transfers - interconnect for several IXP2xxx Receive/transmit buses - shared bus  separate busses Slowport - shared inteface to external units - used for FlashRom during bootstrap Coprocessors - hash unit - 4 timers - general purpose I/O pins - external JTAG connections (in-circuit tests) - several bulk cyphers (IXP2850 only) - checksum (IXP2850 only) - …

8 Example: SpliceTCP INF5063: Programming Heterogeneous Multi-Core Processors

9 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Internet INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo TCP Splicing SYN SYNACK Some client

10 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Internet INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo TCP Splicing ACK Some client

11 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Internet INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo TCP Splicing HTTP-GET DATA Some client

12 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Internet INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo TCP Splicing Some client

13 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo TCP Splicing Data link layer Physical layer Network layer Data link layer Transport layer Network layer Application layer Transport layer accept connect while(1) read write Linux Netfilter Establish upstream connection Receive entire packet Rewrite headers Forward packet IXP 2400 Establish upstream connection Parse packet headers Rewrite headers Forward packet

14 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo TCP Splicing  - kernel level splicing  - e.g. Linux netfilter  - PREROUTING - rewrite dest IP  - POSTROUTING - write src IP  - data still copied to main mem

15 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo TCP Splicing  - IXP 2400  - can start re-writing the header as soon as enough bytes have arrived  - all data goes through the card, but only once  - no scheduling of any user space  - it's an IXP2400, can start processing after first 64 bytes  - comment on the paper  - flow control processing end-to-end etc.

16 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Throughput vs Request File Size Graph from the presentation of the paper SpliceNP: A TCP Splicer using a Network Processor, ANCS2005, Princeton, NJ, Oct 27-28, 2005 By Li Zhao, Yan Lou, Laxmi Bhuyan (Univ. Calif. Riverside), Ravi Iyer (Intel) Major performance gain at all request sizes

17 Example: Transparent protocol translation and load balancing in a media streaming scenario slides from an ACM MM 2007 presentation by Espeland, Lunde, Stensland, Griwodz and Halvorsen INF5063: Programming Heterogeneous Multi-Core Processors

18 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Appears as ONE machine to the outside world Load Balancer Network...... IXP 2400 ingress egress Balancer 1.identify connection 2.if exist send to right server (select port to use) else create new session (select one server) send packet Monitor Historic and current loads of the different servers RSTP/RTP video server mplayer clients RTSP / RTP parser RTSP RTP/UDP RTSP RTP/UDP

19 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Transport Protocol Translator Network...... IXP 2400 ingress RSTP/RTP video server mplayer clients egress BalancerMonitor RTSP / RTP parser RTSP HTTP-streaming is frequently used today!! HTTP

20 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Transport Protocol Translator Network...... IXP 2400 ingress RSTP/RTP video server mplayer clients egress BalancerMonitor RTSP / RTP parser RTSP/RTP Protocol translator RTP/UDP HTTP RTSP/RTP HTTP RTP/UDP HTTP RTSP

21 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Machine setup IXP lab switch … IXP2400 local network (192.168.67.xxx) ssh connection (129.240.67.xxx) To 192.168.67.6 TRANSPARENT MEDIA SERVER LOAD BALANCER and TRANSLATOR switch To 192.168.67.5 media server cluster 192.168.67.5 192.168.67.6

22 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Results  The prototype works and both load balances and translates between HTTP/TCP and RTP/UDP  The protocol translation gives a much more stable bandwidth than using HTTP/TCP all the way from the server protocol translation HTTP

23 Example: Booster Boxes slide content and structure mainly from the NetGames 2002 presentation by Bauer, Rooney and Scotton INF5063: Programming Heterogeneous Multi-Core Processors

24 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Client-Server backbone network local distribution network local distribution network local distribution network

25 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Peer-to-peer backbone network local distribution network local distribution network local distribution network

26 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo IETF’s Middleboxes  Middlebox −network intermediate device that implements middlebox services −a middlebox function requires application specific intelligence  Examples −policy based packet filtering (a.k.a. firewall) −network address translation (NAT) −intrusion detection −load balancing −policy based tunneling −IPsec security −…  RFC3303 and RFC3304 −From traditional middleboxes Embed application intelligence within the device −To middleboxes supporting the MIDCOM protocol Externalize application intelligence into MIDCOM agents

27 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Booster boxes  Middleboxes −Attached directly to ISPs’ access routers −Less generic than, e.g. firewalls or NAT  Assist distributed event-driven applications −Improve scalability of client-server and peer-to-peer applications  Application-specific code −“Boosters” −Caching on behalf of a server −Aggregation of events −Intelligent filtering −Application-level routing

28 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Booster boxes  Booster Boxes ≈ Middleboxes −attached directly to ISPs’ access routers −less generic than, e.g., firewalls or NAT  Assist distributed event-driven applications −improve scalability of client-server and P2P applications  Application-specific code: “Boosters”

29 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Overlay networks backbone network backbone network backbone network LAN IP link Overlay link Overlay node IP path IP layer Overlay network layer Application layer

30 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Booster boxes backbone network local distribution network local distribution network local distribution network

31 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Booster boxes backbone network local distribution network local distribution network local distribution network Load redistribution by delegating server functions Process data close to the source

32 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Booster boxes  Application-specific code −Caching on behalf of a server Non-real time information is cached Booster boxes answer on behalf of servers −Aggregation of events Information from two or more clients within a time window is aggregated into one packet −Intelligent filtering Outdated or redundant information is dropped − Application-level routing Packets are forward based on  Packet content  Application state  Destination address

33 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Architecture  Data Layer −behaves like a layer-2 switch for the bulk of the traffic −copies or diverts selected traffic −IBM’s booster boxes use the packet capture library (“pcap”) filter specification to select traffic

34 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Architecture  Booster layer −Booster Application-specific code Executed either on the host CPU or the network processor −Library Boosters can call the data-layer operation −Generates a QoS-aware Overlay Network (Booster Overlay Network - BON)

35 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Data Aggregation Example: Floating Car Data Main booster task  Complex message aggregation  Statistical computations  Context information  Very low real-time requirements Traffic monitoring/predictions Pay-as-you-drive insurance Car maintenance Car taxes … Statistics gathering Compression Filtering … Transmission of  Position  Speed  Driven distance  …

36 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Interactive TV Game Show Main booster task  Simple message aggregation  Limited real-time requirements 1. packet generation 2. packet interception 3. packet aggregation 4. packet forwarding

37 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Game with large virtual space Main booster task  Dynamic server selection −based on current in- game location −Require application- specific processing handled by server 1 handled by server 2 server 1 Virtual space  High real-time requirements

38 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Summary  Scalability −by application-specific knowledge −by network awareness  Main mechanisms −Caching on behalf of a server −Aggregation of events −Attenuation −Intelligent filtering −Application-level routing  Application of mechanism depends on −Workload −Real-time requirements

39 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Auto-configuration and dynamic link metrics

40 Multimedia Examples INF5063: Programming Heterogeneous Multi-Core Processors

41 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment

42 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment

43 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment IO hub memory hub CPU memory

44 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment  Several ways to do video-quality adjustments −frame dropping −re-quantization −scalable video codec −…  Yamada et. al. 2002: use low-pass filter to eliminate high-frequency components of the MPEG-2 video signal and thus reduce data rate −determine a low-pass parameter for each GOP −use low-pass parameter to calculate how many DCT coefficients to remove from each macro block in a picture −by eliminating the specified number of DCT coefficients the video data rate is reduced −implemented the low-pass filter on an IXP1200

45 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment  Low-pass filter on IXP1200 −parallel execution on 200MHz StrongARM and microengines −24 MB DRAM devoted to StrongARM only −8 MB DRAM and 8 MB SRAM shared −test-filtering program on a regular PC determined work-distribution 75% of data from the block layer 56% of the processing overhead is due to DCT  five step algorithm: 1.StrongArm receives packet  copy to shared memory area 2.StrongARM process headers and generate macroblocks (in shared memory) 3.microengines read data and information from shared memory and perform quality adjustments on each block 4.StrongARM checks if the last macroblock is processed (if not, go to 2) 5.StrongARM rebuilds packet Yamada et. al. 2002

46 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment  Segmentation of MPEG-2 data −slice = 16 bit high stripes −macroblock = 16 x 16 bit square four 8 x 8 luminance two 8 x 8 chrominance  DCT transformed with coefficients sorted in ascending order  Data packetization for video filtering −720 x 576 pixels frames and 30 fps  36 “slices” with 45 macroblocks per frame −Each slice = one packet −8 Mbps stream  ~7Kb per packet Yamada et. al. 2002

47 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment Yamada et. al. 2002

48 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment Yamada et. al. 2002

49 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment Yamada et. al. 2002

50 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment Yamada et. al. 2002

51 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Multicast Video-Quality Adjustment  Evaluation – three scenarios tested −StrongARM only  550 kbps −StrongARM + 1 microengine  350 kbps −StrongARM + all microengines  1350 kbps −achieved real-time transcoding not enough for practical purposes, but distribution of workload is nice Yamada et. al. 2002

52 Parallelism, Pipelining & Workload Partitioning INF5063: Programming Heterogeneous Multi-Core Processors

53 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Divide and …  Divide a problem into parts – but how? Pipelining: Parallelism: Hybrid:

54 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Key Considerations  System topology −processor capacities: different processors have different capabilities −memory attachments: different memory types have different rates and access times different memory banks have different access times −interconnections: different interconnects/busses have different capabilities  Requirements of the workload? −dependencies  Parameters? −width of pipeline (level of parallelism) −depth of pipeline (number of stages) −number of jobs sharing busses

55 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Network Processor Example  Pipelining vs. Multiprocessor by Ning Weng & Tilman Wolf −network processor example −all pipelining, parallelism and hybrid is possible −packet processing scenario −what is the performance of the different schemes taking into account…? … processing dependencies … processing demands … contention on memory interfaces … pipelining and parallelism effects (experimenting with the width and the depth of the pipeline)

56 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Simulations  Several application examples in the paper giving different DAGs, e.g.,… −... flow classification: classify flows according to IP addresses and transport protocols  Measuring system throughput varying all the parameters −# processors in parallel (width) −# stages in the pipeline (depth) −# memory interfaces (busses) between each stage in the pipeline −memory access times

57 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Results  # memory interfaces per stage M = 1  Memory service time S = 10  Increases with the pipeline depth D −Good scalability – proportional to the # processors  Increases with the width W initially, but tails off for large W −Poor scalability due to contention on the memory channel  Efficiency per processing engine…?

58 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Lessons learned…  Memory contention can become a severe system bottleneck −the memory interface saturates with about two processing elements per interface −off-chip memory access cause significant reduction in throughput and drastic increase in queuing delay −performance increase with more memory channels lower access times  Most NP applications are of sequential nature which leads to highly pipelined NP topologies  Balancing processing tasks to avoid slow pipeline stages  Communication and synchronization are the main contributors to the pipeline stage time, next to the memory access delay  “Topology” has significant impact on performance

59 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Some References 1. Tatsuya Yamada, Naoki Wakamiya, Masayuki Murata, and Hideo Miyahara: "Implementation and Evaluation of Video-Quality Adjustment for heterogeneous Video Multicast“, 8th Asia- Pacific Conference on Communications, Bandung, September 2002, pp. 454-457 2. Daniel Bauer, Sean Rooney, Paolo Scotton, “Network Infrastructure for Massively Distributed Games”, NetGames, Braunschweig, Germany, April 2002 3. J.R. Allen, Jr., et al., “IBM PowerNP network processor: hardware, software, and applications”, IBM Journal of Research and Development, 47(2/3), pp. 177-193, March/May 2003 4. Ning Weng, Tilman Wolf, “Profiling and mapping of parallel workloads on network processors”, ACM Symposium of Applied Computing (SAC 2005), pp. 890-896 5. Ning Weng, Tilman Wolf, “Analytic modeling of network processors for parallel workload mapping”, ACM Trans. on Embedded Computing Systems, 8(3), 2009 6. Li Zhao, Yan Lou, Laxmi Bhuyan, Ravi Iyer, “SpliceNP: A TCP Splicer using a Network Processor”, ANCS2005, 2005 7. Håvard Espeland, Carl Henrik Lunde, Håkon Stensland, Carsten Griwodz, Pål Halvorsen, ”Transparent Protocol Translation for Streaming”, ACM Multimedia 2007

60 INF5063, Carsten Griwodz & Pål Halvorsen University of Oslo Summary  TODO


Download ppt "Network Processors A generation of multi-core processors INF5063: Programming Heterogeneous Multi-Core Processors April 15, 2015."

Similar presentations


Ads by Google