SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Accounting Manager Taking resource usage into your own hands Scott Jackson Pacific Northwest National Laboratory
System Software Environments Breakout Report June 27, 2002.
Towards a Virtual European Supercomputing Infrastructure Vision & issues Sanzio Bassini
High Performance Computing Course Notes Grid Computing.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
Vision/Benefits/Introduction Randy Armstrong (OPC Foundation)
Assessment of Core Services provided to USLHC by OSG.
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… John L. Mugler Stephen L. Scott Oak Ridge National Laboratory.
Computer System Architectures Computer System Software
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL.
SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 5-6, 2003.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
A View from the Top November Dallas TX. Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC.
Presented by Open Source Cluster Application Resources (OSCAR) Stephen L. Scott Thomas Naughton Geoffroy Vallée Network and Cluster Computing Computer.
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… Stephen L. Scott Oak Ridge National Laboratory Computer.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
A View from the Top Preparing for Review Al Geist February Chicago, IL.
Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist May 10-11, 2005 Chicago, ILL.
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL.
1 Introduction to Middleware. 2 Outline What is middleware? Purpose and origin Why use it? What Middleware does? Technical details Middleware services.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting September 11-12, 2003 Washington D.C.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1October 9, 2001 Sun in Scientific & Engineering Computing Grid Computing with Sun Wolfgang Gentzsch Director Grid Computing Cracow Grid Workshop, November.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
A View from the Top Al Geist June Houston TX.
SSS Build and Configuration Management Update February 24, 2003 Narayan Desai
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Process Manager Specification Rusty Lusk 1/15/04.
“Warehouse” Monitoring Software Infrastructure Craig Steffen, NCSA SSS Meeting June 5, Argonne, Illinois.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
ZIMBRA ROADMAP. Contains proprietary and confidential information owned by Synacor, Inc. © / 2015 Synacor, Inc. Deliver an advanced, feature rich collaboration.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
VisIt Project Overview
Introduction to Distributed Platforms
GWE Core Grid Wizard Enterprise (
Grid Computing.
Computing Experience…
Scalable Systems Software for Terascale Computer Centers
Introduction To Distributed Systems
Presentation transcript:

SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored by MICS Office of DOE

Checkpoint restart Resource & Queue Management Accounting & user mgmt System Build & Configure Job management System Monitoring Security Allocation management Fault Tolerance Scope of the Effort Improve productivity of both users and system administrators

Current State of Systems Software for Large-Scale Machines Both proprietary and open-source systems –Machine-specific, PBS, LSF, POE, SLURM, COOAE (Collections Of Odds And Ends), … Many are monolithic “resource management systems,” combining multiple functions –Job queuing, scheduling, process management, node monitoring, job monitoring, accounting, configuration management, etc. A few established separate components exist –Maui scheduler –Qbank accounting system Many home-grown, local pieces of software Scalability often a weak point

The Problem Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems – having to rewrite Commercial solutions not happening because business forces drive industry towards servers not HPC. System administrators and managers of terascale computer centers are facing a crisis:

Design a modular system software architecture Potable across diverse HW, make easy to adopt -allows plug and play components, and is language and wire protocol independent. Collectively (with industry) agree on and specify standardized interfaces between system components MPI-like process to promote interoperability, portability, and long- term usability. Produce a fully integrated suite of systems software and tools Reference Implementation for the management and utilization of terascale computational resources. Three Goals

The Team Coordinator: Al Geist Participating Organizations Includes DOE Labs, NSF Supercomputer Centers, Vendors ORNL ANL LBNL PNNL NCSA PSC SNL LANL Ames IBM Cray SGI Intel Open to all like MPI forum

Fundamentally change the way future high-end systems software is developed and distributed Reduced facility management costs reduce duplication of effort rewriting components reduce need to support ad hoc software better systems tools available able to get machines up and running faster and keep running More effective use of machines by scientific applications scalable launch of jobs and checkpoint/restart job monitoring and management tools allocation management interface Impact

System Software Architecture User DB AccountingScheduler System Monitor Meta Scheduler Meta Monitor Meta Manager Application Environment High Performance Communication & I/O Access control Security manager File System Interacts with all components Node Configuration &Build Manager Allocation Management Queue Manager Job Manager & Monitor Data Migration Usage Reports User utilities Checkpoint/ restart Testing & Validation

Example: How Users interact with Systems Software Components QMPM MPD’s mpdrun XML file mpiexec (MPI Std args) QM’s job submission language interactive simple scripts or hairy GUIs using SSS XML SSS Components application processes SSS XML SSS side MPD-based implementation side EM SD PM Sched NSM Other managers could go here instead

Highlights Designed Modular Architecture allows site to plug and play what they need Defined XML interfaces independent of language and wire protocol Reference Implementation Released Version 1.0 available at SC2005 Production Users ANL, Ames, PNNL, NCSA Adoption of API Maui (3000 downloads/month) Moab (Amazon.com, Ford, …)

Designed Modular Architecture Make it easy for sites to Adopt Easily replace a component that doesn’t meet their needs Use only parts of the suite that they need Components can be shared across facilities Open Source to allow sites to modify at will Components have well defined roles Independent of language and wire protocol Communicate through XML messages Service Directory, Event Manager and Communication Lib Form core and interact with all other components Provide plug and play registration and notification

Progress on Integrated Suite Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces authentication communication Components can be written in any mixture of C, C++, Java, Perl, and Python Checkpoint / Restart Testing & Validation Hardware Infrastructure Manager SSS-OSCAR

Defined XML Interfaces Components interact by exchanging XML messages Fully documented and publicly available in API imposes no restrictions on languages used. Components can be any mixture of: C, C++, Perl, Java, and Python Multiple wire protocols are supported. Components can use one or more of the wire protocols supplied in the communication library http(s), ssl, tcp, zlib, challenge authentication, more… The set of wire protocols is extensible

Full Integrated Suite Released SSS-OSCAR (Open Source Cluster Application Resources) Version 1.0 available at SC2004 Source and precompiled for Linux Components tested up to 5000 processors on NCSA cluster Quarterly updates through rest of the project Improve robustness, and support more platforms, Add capabilities – filling out the architecture components and XML interfaces Leverage popularity of OSCAR distributions Has been adopted by many cluster vendors Ten’s of thousands of downloads Raises our software suites profile and availability

Components in Suites Gold EM SD Grid scheduler Warehouse Meta Manager Maui shed NSM Gold QBank PM Usage Reports Meta Services Warehouse (superMon NWPerf) Bamboo QM BCM Multiple Component Implementations exits ssslib BLCR APITest HIM Compliant with PBS, Loadlever job scripts

Production Users Running a Full Suite in Production for over a year Argonne National Lab – 200 node Chiba City Ames Lab Running one or more components in Production Pacific Northwest National Lab – 11.4 TF cluster + others NCSA Running full suite on development systems Most participants Discussions with DOD-HPCMP sites Use of our scheduler and accounting components

Adoption of API Maui Scheduler now uses our API in client and server 3000 downloads/month 75 of the top 100 supercomputers in TOP 500 Commercial Moab Scheduler uses our API Amazon.com, Boeing, Ford, Dow Chemical, Lockeed-Martin, more… New Capabilities added to Schedulers due to API fairness, higher system utilization, improved response time Discussion with Cray – Leadership-class computers Don Mason attended our meetings plan to use XML messages to connect their system components exchanged info on XML format, API test software, more…

Production Lessons Learned: This Approach Really Works! Components can use one another’s data Functionality only needs to be implemented once –E.g., broadcast of messages Components are more robust, since they focus on one task Code volume shrinks because of less duplication of functionality Easy to add new functionality –File staging –MPISH –Rich infrastructure on which to build new components Communication, logging, location services Need not be limited by existing subcomponents of existing systems –Can replace just the functionality needed (get to solve the problem you want to solve, without re-implementing everything). –E.g. having queue manager accept requests for rebuilt nodes before starting jobs.

View to the Future HW, CS, and Science Teams all contribute to the science breakthroughs Leadership-class Platforms Breakthrough Science Software & Libs SciDAC CS teams Tuned codes Research team High-End science problem Computing Environment Common look&feel across diverse HW Ultrascale Hardware Rainer, Blue Gene, Red Storm OS/HW teams SciDAC Science Teams

SciDAC Phase 2 and CS ISICs Future CS ISICs need to be mindful of needs of National Leadership Computing facility w/ Cray, IBM BG, SGI, clusters, multiple OS No one architecture is best for all applications SciDAC Science Teams Needs depend on application areas chosen End stations? Do they have special SW needs? FastOS Research Projects Complement, don’t duplicate these efforts Cray software roadmap Making the Leadership computers usable, efficient, fast

Gaps and potential next steps Heterogeneous leadership-class machines science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines. Fault tolerance requirements in apps and systems software particularly as systems scale up to petascale around 2010 Support for application users submitting interactive jobs computational steering as means of scientific discovery High performance File System and I/O research increasing demands of security, scalability, and fault tolerance Security One-time-passwords and impact on scientific progress

Heterogeneous Machines Heterogeneous Architectures Vector architectures, Scalar, SMP, Hybrids, Clusters How is a science team to know what is best for them? Multiple OS Even within one machine, eg. Blue Gene, Red Storm How to effectively and efficiently administer such systems? Diverse programming environment science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines Diverse system management environment Managing and scheduling multiple node types System updates, accounting, … everything will be harder in round 2

Fault Tolerance Holistic Fault Tolerance Research into schemes that take into account the full impact of faults: application, middleware, OS, and hardware Fault tolerance in systems software Research into prediction and prevention Survivability and resiliency when faults can not be avoided Application recovery transparent failure recovery Research into Intelligent checkpointing based on active monitoring, sophisticated rule-based recoverys, diskless checkpointing… For petascale systems research into recovery w/o checkpointing

Interactive Computing Batch jobs are not the always the best for Science Good for large numbers of users, wide mix of jobs, but National Leadership Computing Facility has different focus Computational Steering as a paradigm for discovery Break the cycle: simulate, dump results, analyze, rerun simulation More efficient use of the computer resources Needed for Application development Scaling studies on terascale systems Debugging applications which only fail at scale

File System and I/O Research Lustre is today’s answer There are already concerns about its capabilities as systems scale up to 100+ TF What is the answer for 2010? Research is needed to explore the file system and I/O requirements for petascale systems that will be here in 5 years I/O continues to be a bottleneck in large systems Hitting the memory access wall on a node To expensive to scale I/O bandwidth with Teraflops across nodes Research needed to understand how to structure applications or modify I/O to allow applications to run efficiently

Security New stricter access policies to computer centers Attacks on supercomputer centers have gotten worse. One-Time-Passwords, PIV? Sites are shifting policies, tightening firewalls, going to SecureID tokens Impact on scientific progress Collaborations within international teams Foreign nationals clearance delays Access to data and computational resources Advances required in system software To allow compliance with different site policies and be able to handle tightest requirements Study how to reduce impact on scientists

Summary The Scalable Systems Software SciDAC project is addressing the problem of systems software for terascale systems. –Component architecture for systems software –Definitions of standard interfaces between components –An infrastructure to support component implementations within this framework –A set of component implementations, continuing to improve Reference software suite released –Quarterly updates planed Production use of the component architecture and some of the component implementations –Encourages development of sharable tools and solutions –ANL Blue Gene will run our suite

1.Node build, configuration, and information service 2.Resource management, scheduling, and allocation 3.Proccess management, system monitoring, and checkpointing 4.Validation and Integration A main notebook for general information & mtg notes And individual notebooks for each working group Web-based Project Notebooks (over 300 pages and growing) Project Management Quarterly Face-to Face Meetings Weekly Working Group telcoms Four different Working Groups