SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.

SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored by MICS Office of DOE

Checkpoint restart Resource & Queue Management Accounting & user mgmt System Build & Configure Job management System Monitoring Security Allocation management Fault Tolerance Scope of the Effort Improve productivity of both users and system administrators

Current State of Systems Software for Large-Scale Machines Both proprietary and open-source systems –Machine-specific, PBS, LSF, POE, SLURM, COOAE (Collections Of Odds And Ends), … Many are monolithic “resource management systems,” combining multiple functions –Job queuing, scheduling, process management, node monitoring, job monitoring, accounting, configuration management, etc. A few established separate components exist –Maui scheduler –Qbank accounting system Many home-grown, local pieces of software Scalability often a weak point

The Problem Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems – having to rewrite Commercial solutions not happening because business forces drive industry towards servers not HPC. System administrators and managers of terascale computer centers are facing a crisis:

Design a modular system software architecture Potable across diverse HW, make easy to adopt -allows plug and play components, and is language and wire protocol independent. Collectively (with industry) agree on and specify standardized interfaces between system components MPI-like process to promote interoperability, portability, and long- term usability. Produce a fully integrated suite of systems software and tools Reference Implementation for the management and utilization of terascale computational resources. Three Goals

www.scidac.org/ScalableSystems The Team Coordinator: Al Geist Participating Organizations Includes DOE Labs, NSF Supercomputer Centers, Vendors ORNL ANL LBNL PNNL NCSA PSC SNL LANL Ames IBM Cray SGI Intel Open to all like MPI forum

Fundamentally change the way future high-end systems software is developed and distributed Reduced facility management costs reduce duplication of effort rewriting components reduce need to support ad hoc software better systems tools available able to get machines up and running faster and keep running More effective use of machines by scientific applications scalable launch of jobs and checkpoint/restart job monitoring and management tools allocation management interface Impact

System Software Architecture User DB AccountingScheduler System Monitor Meta Scheduler Meta Monitor Meta Manager Application Environment High Performance Communication & I/O Access control Security manager File System Interacts with all components Node Configuration &Build Manager Allocation Management Queue Manager Job Manager & Monitor Data Migration Usage Reports User utilities Checkpoint/ restart Testing & Validation

Example: How Users interact with Systems Software Components QMPM MPD’s mpdrun XML file mpiexec (MPI Std args) QM’s job submission language interactive simple scripts or hairy GUIs using SSS XML SSS Components application processes SSS XML SSS side MPD-based implementation side EM SD PM Sched NSM Other managers could go here instead

Highlights Designed Modular Architecture allows site to plug and play what they need Defined XML interfaces independent of language and wire protocol Reference Implementation Released Version 1.0 available at SC2005 Production Users ANL, Ames, PNNL, NCSA Adoption of API Maui (3000 downloads/month) Moab (Amazon.com, Ford, …)

Designed Modular Architecture Make it easy for sites to Adopt Easily replace a component that doesn’t meet their needs Use only parts of the suite that they need Components can be shared across facilities Open Source to allow sites to modify at will Components have well defined roles Independent of language and wire protocol Communicate through XML messages Service Directory, Event Manager and Communication Lib Form core and interact with all other components Provide plug and play registration and notification

Progress on Integrated Suite Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces authentication communication Components can be written in any mixture of C, C++, Java, Perl, and Python Checkpoint / Restart Testing & Validation Hardware Infrastructure Manager SSS-OSCAR

Defined XML Interfaces Components interact by exchanging XML messages Fully documented and publicly available in API imposes no restrictions on languages used. Components can be any mixture of: C, C++, Perl, Java, and Python Multiple wire protocols are supported. Components can use one or more of the wire protocols supplied in the communication library http(s), ssl, tcp, zlib, challenge authentication, more… The set of wire protocols is extensible

Full Integrated Suite Released SSS-OSCAR (Open Source Cluster Application Resources) Version 1.0 available at SC2004 Source and precompiled for Linux Components tested up to 5000 processors on NCSA cluster Quarterly updates through rest of the project Improve robustness, and support more platforms, Add capabilities – filling out the architecture components and XML interfaces Leverage popularity of OSCAR distributions Has been adopted by many cluster vendors Ten’s of thousands of downloads Raises our software suites profile and availability

Components in Suites Gold EM SD Grid scheduler Warehouse Meta Manager Maui shed NSM Gold QBank PM Usage Reports Meta Services Warehouse (superMon NWPerf) Bamboo QM BCM Multiple Component Implementations exits ssslib BLCR APITest HIM Compliant with PBS, Loadlever job scripts

Production Users Running a Full Suite in Production for over a year Argonne National Lab – 200 node Chiba City Ames Lab Running one or more components in Production Pacific Northwest National Lab – 11.4 TF cluster + others NCSA Running full suite on development systems Most participants Discussions with DOD-HPCMP sites Use of our scheduler and accounting components

Adoption of API Maui Scheduler now uses our API in client and server 3000 downloads/month 75 of the top 100 supercomputers in TOP 500 Commercial Moab Scheduler uses our API Amazon.com, Boeing, Ford, Dow Chemical, Lockeed-Martin, more… New Capabilities added to Schedulers due to API fairness, higher system utilization, improved response time Discussion with Cray – Leadership-class computers Don Mason attended our meetings plan to use XML messages to connect their system components exchanged info on XML format, API test software, more…

Production Lessons Learned: This Approach Really Works! Components can use one another’s data Functionality only needs to be implemented once –E.g., broadcast of messages Components are more robust, since they focus on one task Code volume shrinks because of less duplication of functionality Easy to add new functionality –File staging –MPISH –Rich infrastructure on which to build new components Communication, logging, location services Need not be limited by existing subcomponents of existing systems –Can replace just the functionality needed (get to solve the problem you want to solve, without re-implementing everything). –E.g. having queue manager accept requests for rebuilt nodes before starting jobs.

View to the Future HW, CS, and Science Teams all contribute to the science breakthroughs Leadership-class Platforms Breakthrough Science Software & Libs SciDAC CS teams Tuned codes Research team High-End science problem Computing Environment Common look&feel across diverse HW Ultrascale Hardware Rainer, Blue Gene, Red Storm OS/HW teams SciDAC Science Teams

SciDAC Phase 2 and CS ISICs Future CS ISICs need to be mindful of needs of National Leadership Computing facility w/ Cray, IBM BG, SGI, clusters, multiple OS No one architecture is best for all applications SciDAC Science Teams Needs depend on application areas chosen End stations? Do they have special SW needs? FastOS Research Projects Complement, don’t duplicate these efforts Cray software roadmap Making the Leadership computers usable, efficient, fast

Gaps and potential next steps Heterogeneous leadership-class machines science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines. Fault tolerance requirements in apps and systems software particularly as systems scale up to petascale around 2010 Support for application users submitting interactive jobs computational steering as means of scientific discovery High performance File System and I/O research increasing demands of security, scalability, and fault tolerance Security One-time-passwords and impact on scientific progress

Heterogeneous Machines Heterogeneous Architectures Vector architectures, Scalar, SMP, Hybrids, Clusters How is a science team to know what is best for them? Multiple OS Even within one machine, eg. Blue Gene, Red Storm How to effectively and efficiently administer such systems? Diverse programming environment science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines Diverse system management environment Managing and scheduling multiple node types System updates, accounting, … everything will be harder in round 2

Fault Tolerance Holistic Fault Tolerance Research into schemes that take into account the full impact of faults: application, middleware, OS, and hardware Fault tolerance in systems software Research into prediction and prevention Survivability and resiliency when faults can not be avoided Application recovery transparent failure recovery Research into Intelligent checkpointing based on active monitoring, sophisticated rule-based recoverys, diskless checkpointing… For petascale systems research into recovery w/o checkpointing

Interactive Computing Batch jobs are not the always the best for Science Good for large numbers of users, wide mix of jobs, but National Leadership Computing Facility has different focus Computational Steering as a paradigm for discovery Break the cycle: simulate, dump results, analyze, rerun simulation More efficient use of the computer resources Needed for Application development Scaling studies on terascale systems Debugging applications which only fail at scale

File System and I/O Research Lustre is today’s answer There are already concerns about its capabilities as systems scale up to 100+ TF What is the answer for 2010? Research is needed to explore the file system and I/O requirements for petascale systems that will be here in 5 years I/O continues to be a bottleneck in large systems Hitting the memory access wall on a node To expensive to scale I/O bandwidth with Teraflops across nodes Research needed to understand how to structure applications or modify I/O to allow applications to run efficiently

Security New stricter access policies to computer centers Attacks on supercomputer centers have gotten worse. One-Time-Passwords, PIV? Sites are shifting policies, tightening firewalls, going to SecureID tokens Impact on scientific progress Collaborations within international teams Foreign nationals clearance delays Access to data and computational resources Advances required in system software To allow compliance with different site policies and be able to handle tightest requirements Study how to reduce impact on scientists

Summary The Scalable Systems Software SciDAC project is addressing the problem of systems software for terascale systems. –Component architecture for systems software –Definitions of standard interfaces between components –An infrastructure to support component implementations within this framework –A set of component implementations, continuing to improve Reference software suite released –Quarterly updates planed Production use of the component architecture and some of the component implementations –Encourages development of sharable tools and solutions –ANL Blue Gene will run our suite

www.scidac.org/ScalableSystems 1.Node build, configuration, and information service 2.Resource management, scheduling, and allocation 3.Proccess management, system monitoring, and checkpointing 4.Validation and Integration A main notebook for general information & mtg notes And individual notebooks for each working group Web-based Project Notebooks (over 300 pages and growing) Project Management Quarterly Face-to Face Meetings Weekly Working Group telcoms Four different Working Groups

SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.

Similar presentations

Presentation on theme: "SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.

Similar presentations

Presentation on theme: "SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored."— Presentation transcript:

Similar presentations

About project

Feedback