small register sets and degraded raw computational performance (bad cache locality) –Poor cost/performance ratio, hard to program(?) –Number of handlers waiting to run at a given time is determined by excess parallelism in application, not arrival rate of messages"> small register sets and degraded raw computational performance (bad cache locality) –Poor cost/performance ratio, hard to program(?) –Number of handlers waiting to run at a given time is determined by excess parallelism in application, not arrival rate of messages">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea.

Similar presentations


Presentation on theme: "Slide 1 von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea."— Presentation transcript:

1 Slide 1 von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea

2 Slide 2 Motivation for AM (review) How do we make parallel programs fast? Minimize communication overhead Overlap communication & computation (shoot for 100% utilization of all resources) Consider the entire program –Communication –Computation –Interactions between the two

3 Slide 3 Message-Driven Architectures Research systems – J-Machine/MDP, Monsoon, etc –Defining quality: all significant computation happens within the context of a handler –Computational model is basically dataflow programming - »Support languages with dynamic parallelism, e.g. MultiLISP –Interesting note: about 1/3 of all handlers in J-machine end up blocking and get swapped out by software Pros: –Low overhead communication – reaction to lousy performance of send/recv model traditionally used in message-passing systems –Tight integration with network – directly "execute" messages Cons: –Typically need hardware support in the NIC to achieve good performance - need more sophisticated buffering & scheduling –Poor locality of computation => small register sets and degraded raw computational performance (bad cache locality) –Poor cost/performance ratio, hard to program(?) –Number of handlers waiting to run at a given time is determined by excess parallelism in application, not arrival rate of messages

4 Slide 4 Message-Passing Architectures Commercial systems – nCube, CM-5 –Defining feature: all significant computation happens in a devoted computational thread => good locality, performance Traditional programming model is blocking, matched send/recv (implemented as 3-phase rendezvous) –Inherently a poor programming model for the lowest level: –Doesn't match the semantics of the NIC and performance gets lost in the translation –Doesn’t allow for overlap without expensive buffering There's no compelling reason to keep this model as our lowest level network interface, even for this arch –Sometimes easier to program, but we want the lowest overhead interface possible as the NIC-level interface –Can easily provide a send/recv abstraction upon a more efficient interface –No way to recapture lost performance if the lowest level interface is slow

5 Slide 5 Active Messages - a new "mechanism" Main idea: Take the best features of the message-driven model and unify them with the capabilities of message-passing hardware –Get the same or better performance as message- driven systems with little or no special-purpose hardware –Fix the mismatch between low-level software interface and hardware capabilities that cripples performance »Eliminate all buffering not required by transport »Expose out-of-order, asynchronous delivery –Need to restrict the allowable behavior of handlers somewhat to make this possible

6 Slide 6 Active Messages - Handlers User-provided handlers that "execute" messages –Handlers run immediately upon message arrival –Handlers run quickly and to completion (no blocking) –Handlers run atomically with respect to each other –These restrictions make it possible to implement handlers with no buffering on simple message-passing hardware The purpose of AM Handlers: –Quickly extract a message from the network and "integrate" the data into the running computation in an application-specific way, with a small amt of work –Handlers do NOT perform significant computation themselves »only the minimum functionality required to communicate »this is the crucial difference between AM and the message-driven model

7 Slide 7 Active Messages - Handlers (cont.) Miscellaneous Restriction: –Communication is strictly request-reply (ensures acyclic protocol dependencies) –prevents deadlock with strictly bounded buffer space (assuming 2 virtual networks are available) Still powerful enough to implement most if not all communication paradigms –Shared memory, message-passing, message-driven, etc AM is especially useful as a compilation target for higher-level languages (Split-C, Titanium, etc) –Acceptable to trade off programmability and possibly some protection to maximize performance –Code often generated by a compiler anyhow, so guarding against naïve users is less critical

8 Slide 8 Proof of Concept: Split-C Split-C: an explicitly parallel, SPMD version of C –Global address space abstraction, with a visible local/remote distinction –Split-phase, one-sided (asynchronous) remote memory operations –Sender executes put or get, then a sync on local counter for completion of 1 or more ops User/compiler explicitly specifies prefetching to get overlap Write in shared memory style, but remote operations explicit –local/global distinction important for high performance, so expose it to user –can also implement arbitrarily generalized data transfers (scatter-gather, strided) Important points: –AM can efficiently provide global memory space on existing message-passing systems in software, using the right model –evolutionary change rather than revolutionary (keep the architecture) –works very well for coarse-grained SPMD apps

9 Slide 9 Results Dramatic reduction in latency on commercial message-passing machines with NO additional hardware –nCUBE/2: »AM send/handle: 11us/15us overhead »Blocking message send/recv: 160us overhead –CM-5: »AM: <2us overhead »Blocking message send/recv: 86us overhead About an order of magnitude improvement with no hardware investment

10 Slide 10 Optional Hardware/Kernel Support for AM DMA transfer support => large messages Registers on NIC for composing messages –General registers, not FIFOs - allow message reuse –Ability to compose a request & reply simultaneously Fast user-level interrupts –Allow fully user-level interrupts (trap directly to handler) –PC injection is one way to do this –Any protection mechanisms required for kernel to allow user-level NIC interrupts Support for efficient polling

11 Slide 11 Problems with AM-1 paper Handler atomicity wrt. main computation –Addressed in vonEiken's thesis –Solutions: »Atomic instructions »Mechanism to temporarily disable NIC interrupts using a memory flag or reserved register Described as an abstract mechanism, not a solid portable spec Little support for recv protection, multi- threading, CLUMP's, abstract naming, etc AM-2 fixes the above problems

12 Slide 12 GAM & Active Messages-2 Done at Berkeley by Mainwaring, Culler, et al. Standardized API & generalized somewhat Adds support missing in AM-1 for: –multiple logical endpoints per application (modularity, multi-threading, multi-NIC) –non-SPMD configurations –recv-side protection mechanisms to catch non-malicious bugs (tags) –multi-threaded applications –level of indirection on handlers for non-aligned memory spaces (heterogeneous system) –fault-tolerance support for congestion, node failure, etc (return to sender) –opaque endpoint naming (client code portability, transparent multi- protocol implementations) –polling implicitly may happen on all calls, so explicit polls rarely required –enforce strict request/reply - eases implementation on some systems (HPAM)

13 Slide 13 Influence of Active Messages Many implementations of AM in some form –natively on NIC's: Myrinet (NOW project), Via (Buonadonna & Begel), HP Medusa (Richard Martin), Intel Paragon (Liu), Meiko CS-2 (Schauser) –on other transports: TCP (Liu and Mainwaring) UDP (me), MPI (me), LAPI (Yau & Welcome) –other interesting: Multi-protocol AM (shared memory & network for CLUMPS) (Lumetta) Used as compilation target for many parallel languages/systems: –Split-C, Id90/TAM, Titanium, PVM, UPC, MPI… Influenced the design of important systems –E.g: IBM SP supercomputer: LAPI - low-level messaging layer that is basically AM


Download ppt "Slide 1 von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea."

Similar presentations


Ads by Google