Presentation on theme: "AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,"— Presentation transcript:
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Background AM is a low-level communication architecture for high-performance parallel computing LAPI is IBM’s version of AM Very similar API’s Programs running on AM platform should be able to run on LAPI. Use AMLAPI layer to emulate AM using LAPI.
Similarities Both are low-level message- passing style architectures. Both use active messages: –One node initiates an active message. –Receiving node executes a handler upon reception of the active message.
Differences AM virtualized network interface with endpoints and bundles – allow multiple threads at each endpoint. AM requires handlers to be executed in the context of the application program; LAPI handlers execute in the context of the polling thread. LAPI separates handlers into header and completion handlers. LAPI uses counters for synchronization (guarantees execution of handlers); AM guarantees network has accepted data.
AM & LAPI Execution Model AM Execution LAPI Execution Send Msg Do work.. Get Msg Execute Handler (and send reply) Sender Receiver Do work… Send Msg Do work… Get Msg Exec Header Handler Sender Receiver Do work… Poll Get Footer Send Footer Exec Footer Handler Poll…
To Emulate AM on LAPI Emulate Endpoints and bundles –Maintain a list of endpoints per box –Each endpoint is represented by the box id and its position in the list Associate each endpoint bundle with a task queue. –An AM is done with a LAPI call which schedules a task on the queue at the remote end.
Design Sending an AM: –Package a LAPI Message and send to the receiving node –At receiving node, multiplex the message to the appropriate endpoint and put the associated function pointer with arguments on to the task queue Receiving an AM: –When the user Polls, check the task queue and execute a task from it. –Execute only one task since we do not want the user thread to spend too much time in the handler.
Picture Send Msg Do work… Get Msg Header Handler Sender Receiver Do work… Poll Get Footer Send Footer Footer Handler Poll…Execute Handler… 1. Sender executes AM_Send 2. Sender piggy backs information about the AM call and executes LAPI_Send 3. Network ships the message to receiver 4. Receiver’s network gets the request message, causes the polling thread to execute the header handler 5. Header handler allocates buffer space to which the message is copied. 6. LAPI copies the data into a buffer and calls the Footer handler. 7. Footer handler posts the AM handler with the arguments and AM information on the queue of the designation endpoint.. 8. When user application polls, it will pull out the handler from the task queue and executes it.
Evaluation Platform: SP3 Interconnect: –Advertised bandwidth = 350MB/s –Advertised latency = ~17 micro seconds. SMPs: –8 X Power3 processor SMPs –4 GB of memory per node Processor: –super-scalar, pipelined 64 bit RISC. –8 instructions per clock at 375 MHz. –64KB L1 cache, 8MB L2 cache OS: –AIX with IBM Parallel Environment.
Micro Benchmarks: Round trip latency: 473 us LAPI round trip latency: 32 us
Explanation Copying data from message buffer to an Endpoint’s VM segment takes up the bulk of the overhead. Context switching and packing AM info takes up the rest. Since SP3 is an SMP, the LAPI threads and application thread run on different nodes. Moving data from LAPI thread’s processor requires invalidating the processor cache on which the LAPI thread runs.
Conclusion Using low-level glue-ware is viable option to make programs portable if the communication layers match Future work: –Macro benchmarks –Improve short message latency by header handler –“Zero copy” to endpoint VM – make AM handler run in LAPI context