Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Similar presentations


Presentation on theme: "Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear."— Presentation transcript:

1 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Data Services and Trilinos A Brief Introduction to Trios Data Services Approved for Public Release: SAND2011-8379P 2011 Trilinos User Group Meeting Nov, 2011 Ron Oldfield Sandia National Laboratories

2 I/O Challenges for Exascale Storage systems are the slowest, most fragile, part of an HPC system Current usage models not appropriate for Petascale, much less Exascale –Checkpoints are a HUGE concern for I/O…currently primary focus of FS –App workflow uses storage as a communication conduit Simulate, store, analyze, store, refine, store, … most of the data is transient –High-level I/O libraries (e.g., HDF5, netCDF) have high overheads Trios Data Services to the rescue! 1.Reduce the “effective” I/O cost through data staging 2.Reduce amount of data written to storage (integrated analysis, data services) Nothing comes for free… –We use additional compute and memory resources –Data services introduce issues with resilience (we’re addressing this) 2

3 Trios Data Services I/O Software to Reduce I/O Approach –Leverage available compute/service node resources for I/O caching and data processing Application-Level I/O Services –First used for seismic imaging (mid 90s) –PnetCDF staging service –CTH real-time analysis –SQL Proxy (for NGC) –Interactive sparse-matrix visualization (for NGC) Nessie (NEtwork Scalable Service InterfacE) –Framework for developing data services –Client and server libs, cmake macros, utilities –Originally developed for lightweight file systems Client Application (compute nodes) I/O Service (compute/service nodes) Raw Data Processed Data Lustre File System Cache/aggregate /process Visualization Client 3

4 Some Details on Nessie Designed for Bulk Data Movement on HPC Platforms Goals of data-movement protocol –Low stress on servers (assume order of magnitude more clients than servers) –Efficient use of network (avoid copies, dropped messages, retransmissions, … Features of Nessie –Asynchronous, RPC-like API –User low-level RDMA transports Portals, InfiniBand, Gemini –Small requests –Server-directed for bulk data Writes: pull from client Reads: push to client 4 Client Server request queue data buffers write request pinned server-initiated client-initiated ok A B C D A B C D

5 Example: A Simple Transfer Service Trilinos/packages/trios/examples/xfer-service Used to test Nessie API –xfer_write_encode: client transfers data to server through RPC args –xfer_write_rdma: server pulls raw data using RDMA get –xfer_read_encode: server transfers data to client through RPC result –xfer_read_rdma: server transfers data to client using RDMA put Used for performance evaluation –Test low-level network protocols –Test overhead of XDR encoding –Tests async and sync performance Creating the Transfer Service –Define the XDR data structs and API arguments –Implement the client stubs –Implement the server 5 Client Application Xfer-Service

6 Transfer Service Implementing the Client Stubs Interface between scientific app and service Steps for client stub –Initialize the remote method arguments, in this case, it’s just the length of the array –Call the rpc function. The RPC function includes method arguments (args), and a pointer to the data available for RDMA (buf) The RPC is asynchronous –The client checks for completion by calling nssi_wait(&req) ; 6

7 Transfer Service Implementing the Server Implement server stubs –Using standard stub args –For xfer_write_rdma_srvr, the server pulls data from client Implement server executable –Initialize Nessie –Register server stubs/callbacks –Start the server thread(s) 7

8 Evaluating the Transfer Service 8 Performance of xfer_write_rdma on Red Storm

9 Summary and Staff Trios Data Services reduce the impact of I/O on applications –Reduce the “effective” I/O cost through data staging –Reduce amount of data written to storage (integrated analysis, data services) Nessie provides an effective framework for developing services –Client and server API, macros for XDR processing, utils for managing svcs –Supports most HPC interconnects (Seastar, Gemini, InfiniBand) Trilinos provides a great research vehicle –Common repository, testing support, broad distribution Trios Data Services Development Team (and current assignment) –Ron Oldfield: PI, CTH data service, Nessie development –Todd Kordenbrock: Nessie development, performance analysis –Gerald Lofstead: PnetCDF/Exodus service, transaction-based resilience –Craig Ulmer: Data-service APIs for accelerators (GPU, FPGA) –Ron Minnich: Protocol performance evaluations, Nessie BG/P support 9

10 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Data Services and Trilinos A Brief Introduction to Trios Data Services 2011 Trilinos User Group Meeting Nov, 2011 Ron Oldfield Sandia National Laboratories


Download ppt "Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear."

Similar presentations


Ads by Google