Frame Shared Memory: Line-Rate Networking on Commodity Hardware

Name: Frame Shared Memory: Line-Rate Networking on Commodity Hardware
Uploaded: 2017-12-30T05:16:53+00:00
Duration: PTM14S2
Description: Frame Shared Memory: Line-Rate Networking on Commodity Hardware

Frame Shared Memory: Line-Rate Networking on Commodity Hardware
John Giacomoni John K. Bennett, Douglas C. Sicker, and Manish Vachharajani University of Colorado at Boulder Alexander L. Wolf - Imperial College London Antonio Carzaniga - University of Lugano

? ? ? ? ? ? Problem Description How do we route? How do we protect?
Link Mbps fps ns/frame T-1 1.5 2,941 340,000 T-3 45.0 90,909 11,000 OC-3 155.0 333,333 3,000 OC-12 622.0 1,219,512 820 GigE 1,000.0 1,488,095 672 OC-48 2,500.0 5,000,000 200 10 GigE 10,000.0 14,925,373 67 OC-192 9,500.0 19,697,843 51 In the past things were simpler due to low frame rates. Focus on GigE, common LAN link type. Future of IEEE 802.3ba with 100 and 40 GigE, we’re all going to hurt How do we route? How do we protect? How do we correlate?

? ? ? ? ? ? ASIC Solutions How do we route? How do we protect?
Link Mbps fps ns/frame T-1 1.5 2,941 340,000 T-3 45.0 90,909 11,000 OC-3 155.0 333,333 3,000 OC-12 622.0 1,219,512 820 GigE 1,000.0 1,488,095 672 OC-48 2,500.0 5,000,000 200 10 GigE 10,000.0 14,925,373 67 OC-192 9,500.0 19,697,843 51 Expensive in terms of design, cheap to bulk produce. Yay MIT RAW tiles :) How do we route? How do we protect? How do we correlate?

Programmable Network Processors
Lower design cost than ASICs Notice that design is less complicated than a general purpose processor but every component is exposed and must be managed by the programmer. Still a powerful technique, Cloudshield built carnivore for OC-48 using a large array of intel IXP1200s Cyrus story. Still not future proof. Unit costs don’t scale as well as ASICs for large volume products Intel® IXP2855

:( I’ve painted a rather bleak picture, huh? ;)

Multicore Systems GPP Multicore systems Intel (2x2-core)
Individual cores less powerful than UP 10s-100s-1000s of cores Full OS & Library Support Asymmetric (Alpha) Heterogeneous (AMD, Intel) Convergence of general purpose processors and embedded processors. Special purpose instruction sets have a long history. Asymmetric processors have been considered in the past (Alpha) Heterogeneous processors are presently being considered (AMD, Intel) Intel (2x2-core) MIT RAW (16-core) 100-core 400-core

Moore’s Corollary vs. Moore’s Law
Not a flash in the pan, we’ve fallen off moore’s corollary SPEC Benchmark Suite Performance Predicted vs. actual Graph Courtesy Tipp Moseley

Soft Network Processing (Soft-NP)
Show replacing of switch with a commodity machine. Note, replacing full switches is likely to remain the domain of very specialized systems. How do we get the necessary performance.

Soft-NP Technique Frame Generation
Generate 1 (Gen) Application (App) Output (OP) Examine a frame generation pipeline. Other configuration are obvious. Get full pipeline overlap. 3x processing time per frame with no loss of throughput. Communication is critical, if cost is too expensive, we lose everything.

AMD Opteron System Overview
What does a modern commodity platform look like? Multicore Multiprocessor Lots of processing power per core NUMA simple network interface devices.

Data Flow Frame Generation
OP App Gen OS Assumption, communication via shared memory instead of specialized hardware. Let’s look at how data moves about the system to better understand the communication problem. Notice that all communication is through the memory subsystem, and in particular the caches.

Communication Overhead

Locks  200ns Standard lock based handoffs are expensive. <=40% of frame processing time available per stage for 64B frames at GigE GigE

Hardware  10ns Locks  200ns Hardware queues can give us fantastic performance but they aren’t prevalent and are expensive to implement. Multicore systems maybe problematic for hardware queues. GigE

Hardware  10ns Lamport  160ns Locks  200ns Lamport’s queue gives us better performance…. But not really good enough and we can do better :) GigE

Hardware  10ns FastForward  28ns Lamport  160ns Locks  200ns FastForward is much better. GigE

FastForward enqueue(data) { lock(queue); if (NEXT(head) == tail) {
unlock(queue); return EWOULDBLOCK; } buffer[head] = data; head = NEXT(head); return 0; enqueue_fastforward(data) { if (NULL != buffer[head]) { return EWOULDBLOCK; } buffer[head] = data; head = NEXT(head); return 0; Lamport doesn’t work on many modern machines with weak-memory models, proven only under sequential consistency. Critical details are discussed in this paper. For more details see our upcoming PPoPP paper that includes detailed performance characteristics of the queues and a proof of correctness. Cache-optimized CLF queues Works with strong to weak consistency models Hides die-die communication Giacomoni, Moseley, and Vachharajani. “FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue.” To appear: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2008

Frame Shared Memory (FShm)
Pure software stack communicating via shared memory Abstracted at the driver/NIC boundary Cross-Domain modules (Kernel/Process, T/T, P/P, K/K) Compatible with existing OS/library/language services Can communicate with any device on the memory interconnect Notice How this makes upgrading between platform revisions relatively painless. FastForward hides core-to-core communication

FShm Driver API struct ifdirect {
void (*if_direct_tick) (void *softc); void (*if_direct_attach) (struct ifnet *, void *); void (*if_direct_detach) (struct ifnet *, void *); int (*if_direct_tx) (void *softc, struct mbuf *txbuf); void (*if_direct_tx_post) (void *softc); void (*if_direct_tx_clean_pre) (void *softc); struct mbuf* (*if_direct_tx_clean) (void *softc); void (*if_direct_tx_clean_post) (void *softc); void (*if_direct_rx_pre) (void *softc); struct mbuf* (*if_direct_rx) (void *, struct mbuf *new_rxbuf); void (*if_direct_rx_post) (void *softc); }; Straightforward driver abstraction, similar to the one used in linux.

FShm Evaluation Methodology
AMD Opteron 2.0 GHz Dual-Processor & Dual-Core Compute average time per call TSC Get rid of picture, and perf counters

Frame Generation Data Flow
OP App Gen OS Remind people what the frame generation setup looked like

FShm Generate (linux pktgen)
Send stage is marked as zero time as no application code should be executed once destination NIC has been targeted. Limitation in evaluation hardware platform, probably PCI-X, limits frame sizes below 74B from reaching theoretic max. 64B*  1.36 Mfps

FShm Capture (IDS) 64B*  1.36 Mfps

FShm Forward (Bridge) 64B*  1.36 Mfps

FShm’s Future Hardware  10ns FastForward  28ns Lamport  160ns
Locks  200ns How will FShm scale to faster networks? OC-48 = 200ns per stage including comm time. ~120 ns which is sufficient for many fast path applications. Improved performance can be expected as 120ns assumes processors, memory, and interconnects remain the same speed as today. Finally, two points: 1) These numbers are for pipelines; Data-parallel techniques + additional processors will let FShm scale stage length as done today. 2) We’re currently investigating improved improving performance with cache forwarding techniques for payload data and shared application state. GigE OC-48

Questions?

Frame Shared Memory: Line-Rate Networking on Commodity Hardware

Similar presentations

Presentation on theme: "Frame Shared Memory: Line-Rate Networking on Commodity Hardware"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frame Shared Memory: Line-Rate Networking on Commodity Hardware

Similar presentations

Presentation on theme: "Frame Shared Memory: Line-Rate Networking on Commodity Hardware"— Presentation transcript:

Similar presentations

About project

Feedback