Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Slides:



Advertisements
Similar presentations
Three types of remote process invocation
Advertisements

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Remote Procedure Call Design issues Implementation RPC programming
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
1 Presentation at SciDAC face-to-face January 2005 Ron A. Oldfield Sandia National Laboratories The Lightweight File System.
PZ13B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13B - Client server computing Programming Language.
 Introduction Originally developed by Open Software Foundation (OSF), which is now called The Open Group ( Provides a set of tools and.
Desktop Computing Strategic Project Sandia National Labs May 2, 2009 Jeremy Allison Andy Ambabo James Mcdonald Sandia is a multiprogram laboratory operated.
Distributed Systems Lecture #3: Remote Communication.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy ACM Transactions Vol. 8, No. 1, February 1990,
Middleware Technologies compiled by: Thomas M. Cosley.
Update on Sandia’s Portal Project Interlab 2003 November 5–7, 2003 Cara Corey and Tracy Walker Sandia National Laboratories Sandia is a multiprogram laboratory.
1 I/O Management in Representative Operating Systems.
PRASHANTHI NARAYAN NETTEM.
.NET Mobile Application Development Introduction to Mobile and Distributed Applications.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
SAND Number: P Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department.
Automated Computer Account Management in Active Directory June 2 nd, 2009 Bill Claycomb Systems Analyst Sandia National Laboratories Sandia is a multiprogram.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.
SAINT2002 Towards Next Generation January 31, 2002 Ly Sauer Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation,
Slide 1 Auburn University Computer Science and Software Engineering Scientific Computing in Computer Science and Software Engineering Kai H. Chang Professor.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.
Introduction to Distributed Systems Slides for CSCI 3171 Lectures E. W. Grundke.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
DOE PI Meeting at BNL 1 Lightweight High-performance I/O for Data-intensive Computing Jun Wang Computer Architecture and Storage System Laboratory (CASS)
 Remote Procedure Call (RPC) is a high-level model for client-sever communication.  It provides the programmers with a familiar mechanism for building.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
The HDF Group Milestone 5.1: Initial POSIX Function Shipping Demonstration Jerome Soumagne, Quincey Koziol 09/24/2013 © 2013 The HDF Group.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
Threading Opportunities in High-Performance Flash-Memory Storage Craig Ulmer Sandia National Laboratories, California Maya GokhaleLawrence Livermore National.
STK (Sierra Toolkit) Update Trilinos User Group meetings, 2014 R&A: SAND PE Sandia National Laboratories is a multi-program laboratory operated.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Site Report DOECGF April 26, 2011 W. Alan Scott Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated.
ParaView III Software Methodology Update Where to now November 29, 2006 John Greenfield Sandia is a multiprogram laboratory operated by Sandia Corporation,
Virtualization Technology and Microsoft Virtual PC 2007 YOU ARE WELCOME By : Osama Tamimi.
CCA Common Component Architecture Insights from Quantum Chemistry Joseph P. Kenny Scalable Computing Research and Design Sandia National Laboratories Livermore,
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Nguyen Thi Thanh Nha HMCL by Roelof Kemp, Nicholas Palmer, Thilo Kielmann, and Henri Bal MOBICASE 2010, LNICST 2012 Cuckoo: A Computation Offloading Framework.
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.
Virtual Directory Services and Directory Synchronization May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
#01 Client/Server Computing
Ray-Cast Rendering in VTK-m
USF Health Informatics Institute (HII)
Integrating DPDK/SPDK with storage application
Trilinos I/O Support (TRIOS)
#01 Client/Server Computing
Presentation transcript:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Data Services and Trilinos A Brief Introduction to Trios Data Services Approved for Public Release: SAND P 2011 Trilinos User Group Meeting Nov, 2011 Ron Oldfield Sandia National Laboratories

I/O Challenges for Exascale Storage systems are the slowest, most fragile, part of an HPC system Current usage models not appropriate for Petascale, much less Exascale –Checkpoints are a HUGE concern for I/O…currently primary focus of FS –App workflow uses storage as a communication conduit Simulate, store, analyze, store, refine, store, … most of the data is transient –High-level I/O libraries (e.g., HDF5, netCDF) have high overheads Trios Data Services to the rescue! 1.Reduce the “effective” I/O cost through data staging 2.Reduce amount of data written to storage (integrated analysis, data services) Nothing comes for free… –We use additional compute and memory resources –Data services introduce issues with resilience (we’re addressing this) 2

Trios Data Services I/O Software to Reduce I/O Approach –Leverage available compute/service node resources for I/O caching and data processing Application-Level I/O Services –First used for seismic imaging (mid 90s) –PnetCDF staging service –CTH real-time analysis –SQL Proxy (for NGC) –Interactive sparse-matrix visualization (for NGC) Nessie (NEtwork Scalable Service InterfacE) –Framework for developing data services –Client and server libs, cmake macros, utilities –Originally developed for lightweight file systems Client Application (compute nodes) I/O Service (compute/service nodes) Raw Data Processed Data Lustre File System Cache/aggregate /process Visualization Client 3

Some Details on Nessie Designed for Bulk Data Movement on HPC Platforms Goals of data-movement protocol –Low stress on servers (assume order of magnitude more clients than servers) –Efficient use of network (avoid copies, dropped messages, retransmissions, … Features of Nessie –Asynchronous, RPC-like API –User low-level RDMA transports Portals, InfiniBand, Gemini –Small requests –Server-directed for bulk data Writes: pull from client Reads: push to client 4 Client Server request queue data buffers write request pinned server-initiated client-initiated ok A B C D A B C D

Example: A Simple Transfer Service Trilinos/packages/trios/examples/xfer-service Used to test Nessie API –xfer_write_encode: client transfers data to server through RPC args –xfer_write_rdma: server pulls raw data using RDMA get –xfer_read_encode: server transfers data to client through RPC result –xfer_read_rdma: server transfers data to client using RDMA put Used for performance evaluation –Test low-level network protocols –Test overhead of XDR encoding –Tests async and sync performance Creating the Transfer Service –Define the XDR data structs and API arguments –Implement the client stubs –Implement the server 5 Client Application Xfer-Service

Transfer Service Implementing the Client Stubs Interface between scientific app and service Steps for client stub –Initialize the remote method arguments, in this case, it’s just the length of the array –Call the rpc function. The RPC function includes method arguments (args), and a pointer to the data available for RDMA (buf) The RPC is asynchronous –The client checks for completion by calling nssi_wait(&req) ; 6

Transfer Service Implementing the Server Implement server stubs –Using standard stub args –For xfer_write_rdma_srvr, the server pulls data from client Implement server executable –Initialize Nessie –Register server stubs/callbacks –Start the server thread(s) 7

Evaluating the Transfer Service 8 Performance of xfer_write_rdma on Red Storm

Summary and Staff Trios Data Services reduce the impact of I/O on applications –Reduce the “effective” I/O cost through data staging –Reduce amount of data written to storage (integrated analysis, data services) Nessie provides an effective framework for developing services –Client and server API, macros for XDR processing, utils for managing svcs –Supports most HPC interconnects (Seastar, Gemini, InfiniBand) Trilinos provides a great research vehicle –Common repository, testing support, broad distribution Trios Data Services Development Team (and current assignment) –Ron Oldfield: PI, CTH data service, Nessie development –Todd Kordenbrock: Nessie development, performance analysis –Gerald Lofstead: PnetCDF/Exodus service, transaction-based resilience –Craig Ulmer: Data-service APIs for accelerators (GPU, FPGA) –Ron Minnich: Protocol performance evaluations, Nessie BG/P support 9

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Data Services and Trilinos A Brief Introduction to Trios Data Services 2011 Trilinos User Group Meeting Nov, 2011 Ron Oldfield Sandia National Laboratories