Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Slides:



Advertisements
Similar presentations
Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.
Advertisements

Accounting Manager Taking resource usage into your own hands Scott Jackson Pacific Northwest National Laboratory
AWS Moving Towards the Future Credits: The Instrument Division, Malaysian Meteorological Service Contact: Tel : (603) – ,
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Distributed Systems Architectures
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Simo Niskala Teemu Pasanen
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
Client/Server Architectures
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… John L. Mugler Stephen L. Scott Oak Ridge National Laboratory.
A View from the Top End of Year 1 Al Geist October Houston TX.
Grid Toolkits Globus, Condor, BOINC, Xgrid Young Suk Moon.
Powered by Employment Security Department WorkSource Integrated Technology Solution.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL.
SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 5-6, 2003.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
A View from the Top November Dallas TX. Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Presented by Open Source Cluster Application Resources (OSCAR) Stephen L. Scott Thomas Naughton Geoffroy Vallée Network and Cluster Computing Computer.
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… Stephen L. Scott Oak Ridge National Laboratory Computer.
CoG Kit Overview Gregor von Laszewski Keith Jackson.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
A View from the Top Preparing for Review Al Geist February Chicago, IL.
Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist May 10-11, 2005 Chicago, ILL.
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Open Solutions for a Changing World™ Copyright 2005, Data Access Worldwide June 6-9, 2005 Key Biscayne, Florida 1 Pervasive.SQL Version 9 - What’s New.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting September 11-12, 2003 Washington D.C.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Commodity Grid Kits Gregor von Laszewski (ANL), Keith Jackson (LBL) Many state-of-the-art scientific applications, such as climate modeling, astrophysics,
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1October 9, 2001 Sun in Scientific & Engineering Computing Grid Computing with Sun Wolfgang Gentzsch Director Grid Computing Cracow Grid Workshop, November.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
Oak Ridge National Laboratory -- U.S. Department of Energy 1 SSS Deployment using OSCAR John Mugler, Thomas Naughton & Stephen Scott May 2005, Argonne,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
A View from the Top Al Geist June Houston TX.
SSS Build and Configuration Management Update February 24, 2003 Narayan Desai
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Process Manager Specification Rusty Lusk 1/15/04.
CSC 480 Software Engineering Lecture 17 Nov 4, 2002.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Al Lilianstrom and Dr. Olga Terlyga NLIT 2016 May 4 th, 2016 Under the Hood of Fermilab’s Identity Management Service.
Collaborations and Interactions with other Projects
CSC 480 Software Engineering
University of Technology
Scalable Systems Software for Terascale Computer Centers
NCSA Supercluster Administration
Wide Area Workload Management Work Package DATAGRID project
Presentation transcript:

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by the Department of Energy’s Office of Science Office of Advanced Scientific Computing Research

2 Geist_SSS_SC07 Participating organizations The team Coordinator: Al Geist DOE Laboratories Vendors NSF Supercomputer Centers Open to all, like MPI forum

3 Geist_SSS_SC07 System administrators and managers of terascale computer centers are facing a crisis: The problem  Computer centers use incompatible, ad hoc sets of systems tools.  Present tools are not designed to scale to multiteraflop systems – tools must be rewritten.  Commercial solutions are not happening because business forces drive industry toward servers, not high-performance computing.

4 Geist_SSS_SC07 Scope of the effort Improve productivity of both users and system administrators System build and configure Job management System monitoring Resource and queue management Accounting and user management Allocation management Fault tolerance Security Checkpoin t/restart

5 Geist_SSS_SC07 Reduced facility management costs More effective use of machines by scientific applications Impact  Reduce duplication of effort in rewriting components  Reduce need to support ad hoc software  Better systems tools available  Able to get machines up and running faster and keep running  Especially important for LCF  Scalable launch of jobs and checkpoint/restart  Job monitoring and management tools  Allocation management interface Fundamentally change the way future high-end systems software is developed and distributed

6 Geist_SSS_SC07 System software architecture Testing and validation User utilities High-performance communication and I/O Checkpoint/ restart File system Usage reports Allocation management User database Queue manager Job manager and monitor Scheduler System monitor Node configuration and build manager Accounting Access control security manager (interacts with all components) Meta scheduler Meta monitor Meta manager Data migration Application environment

7 Geist_SSS_SC07 Highlights Designed modular architecture  Allows site to plug and play what it needs Defined XML interfaces  Independent of language and wire protocol Reference implementation released  Version 1.0 released at SC2005 Production users  ANL, Ames, PNNL, NCSA Adoption of API  Maui (3000 downloads/month)  Moab (Amazon.com, Ford, …)

8 Geist_SSS_SC07 Progress on integrated suite Components can be written in any mixture of C, C++, Java, Perl, and Python. Standard XML interfaces Node state manager Meta scheduler Meta monitor Meta manager Meta services Service directory Event manager Authentication Communication Allocation management Scheduler System and job monitor Accounting Node configuration and build manager Testing and validation Usage reports Job queue manager Process manager Checkpoint/ restart SSS- OSCAR Hardware infrastructure manager

9 Geist_SSS_SC07 Production users Running a full suite in production for over a year  Argonne National Laboratory: 200-node Chiba City and BG/L  Ames Laboratory Running one or more components in production  Pacific Northwest National Laboratory: 11.4-TF cluster + others  NCSA Running a full suite on development systems  Most participants Discussions with DOD-HPCMP sites  Use of our scheduler and accounting components

10 Geist_SSS_SC07 Adoption of API Maui scheduler now uses our API in client and server  3,000 downloads/month.  75 of the top 100 supercomputers in the top 500. Commercial Moab scheduler uses our API  Amazon.com, Boeing, Ford, Dow Chemical, Lockheed Martin, more… New capabilities added to schedulers due to API  Fairness, higher system utilization, improved response time. Discussion with Cray: Leadership-class computers  Don Mason attended our meetings  Plan to use XML messages to connect their system components.  Exchanged info on XML format, API test software, more … use of our scheduler and accounting components.

11 Geist_SSS_SC07 Contact Al Geist Computer Science Research Group Computer Science and Mathematics Division (865) Geist_SSS_SC07