Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

Slides:

Advertisements

Similar presentations

Slide 1 Insert your own content. Slide 2 Insert your own content.

Advertisements

KANSEI TESTBED OHIO STATE UNIVERSITY. HETEREGENOUS TESTBED Multiple communication networks, computation platforms, multi-modal sensors/actuators, and.

CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or

11 Application of CSF4 in Avian Flu Grid: Meta-scheduler CSF4. Lab of Grid Computing and Network Security Jilin University, Changchun, China Hongliang.

ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.

ATLAS/LHCb GANGA DEVELOPMENT Introduction Requirements Architecture and design Interfacing to the Grid Ganga prototyping A. Soroko (Oxford), K. Harrison.

The ANSI/SPARC Architecture of a Database Environment

1 Processes and Threads Creation and Termination States Usage Implementations.

Scheduling Introduction to Scheduling

1 DTI/EPSRC 7 th June 2005 Reacting to HCI Devices: Initial Work Using Resource Ontologies with RAVE Dr. Ian Grimstead Richard Potter BSc(Hons)

Configuration management

SAP - Online Transaction Processing (OLTP)

Interconnection Test Framework Josef Hammer jun. Marc Magrans de Abril · Christian Hartl · Thomas Themel · Franz Mittermayr 15 June 2011.

Adding services to PA and Plesk infrastructure with APS Ilya Baimetov Director of Program Management, Automation.

5.9 + = 10 a)3.6 b)4.1 c)5.3 Question 1: Good Answer!! Well Done!! = 10 Question 1:

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3: Operating Systems Computer Science: An Overview Tenth Edition.

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

7 april SP3.1: High-Performance Distributed Computing The KOALA grid scheduler and the Ibis Java-centric grid middleware Dick Epema Catalin Dumitrescu,

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.

Addition 1’s to 20.

Requirements Analysis 1. 1 Introduction b501.ppt © Copyright De Montfort University 2000 All Rights Reserved INFO2005 Requirements Analysis Introduction.

Test B, 100 Subtraction Facts

Chapter 11 Telecommunications Management Network Chapter 11 Network Management: Principles and Practice © Mani Subramanian

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Chapter 19: Network Management Business Data Communications, 5e.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

EU-GRID Work Program Massimo Sgaravatto – INFN Padova Cristina Vistoli – INFN Cnaf as INFN members of the EU-GRID technical team.

Chapter 19: Network Management Business Data Communications, 4e.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.

The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.

Requirements Management Services Requirements Management JobScheduler SNMP Support Information for Interested Parties Information for Interested Parties.

Effective Methods for Software and Systems Integration

© 1998 GENIAS Software GmbH GENIAS Software GmbH GRD Mannheim/1 GRD Success Stories Customer Scenarios for Global Distributed Workload Management Wolfgang.

CHEP 2000 Smart Resource Management Software in High Energy Physics Wolfgang Gentzsch and Lothar Lippert Gridware GmbH & Inc. Padua, 9 February 2000.

Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

April 2000Dr Milan Simic1 Network Operating Systems Windows NT.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.

ABone Architecture and Operation ABCd — ABone Control Daemon Server for remote EE management On-demand EE initiation and termination Automatic EE restart.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

A Systematic Approach to the Design of Distributed Wearable Systems Urs Anliker, Jan Beutel, Matthias Dyer, Rolf Enzler, Paul Lukowicz Computer Engineering.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

Software Maintenance Speaker: Jerry Gao Ph.D. San Jose State University URL: Sept., 2001.

Network design Topic 6 Testing and documentation.

WebFlow High-Level Programming Environment and Visual Authoring Toolkit for HPDC (desktop access to remote resources) Tomasz Haupt Northeast Parallel Architectures.

Slide 1 2/22/2016 Policy-Based Management With SNMP SNMPCONF Working Group - Interim Meeting May 2000 Jon Saperia.

DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Chapter 19: Network Management

Applied Operating System Concepts

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

Integration of Network Services Interface version 2 with the JUNOS Space SDK

WLCG Collaboration Workshop;

Rational Unified Process (RUP)

Software Development Process

Miami-Dade County Public Schools

Network+ Guide to Networks, Fourth Edition

Wide Area Workload Management Work Package DATAGRID project

Cloud-Enabling Technology

Operating System Introduction.

I Datagrid Workshop- Marseille C.Vistoli

Presentation transcript:

Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG

SOS Workshop 2000 (New Orleans, LA)2 Agenda The Supercomputer Lifecycle then and now The Swiss-T1 Management SW: COSMOS Commodity Supercomputer Management Operating System The goals of COSMOS The concept of COSMOS Implementation of COSMOS Software Integration with existing Parts Roadmap of COSMOS

SOS Workshop 2000 (New Orleans, LA)3 Supercomputers – Then and Now Development by vendor Hardware was hand-made Software was tailored for hardware Customers just had to order out of the vendors catalogue TestManageNeedOrder $$$

SOS Workshop 2000 (New Orleans, LA)4 Supercomputers – Then and Now System looks like a puzzle Commodity parts, multiple vendors Zoo of interacting software components Individual system management Millions of lines of code (scripts, daemons) SimulationManageThoughtDesign Architecture Topology Needs Specification $$$ & t

SOS Workshop 2000 (New Orleans, LA)5 COSMOS – Goals Integrated management for whole lifecycle Design the supercomputer on-line Simulate the supercomputer performance on-line Build the designed and simulated supercomputer Manage the built supercomputer Complete run-time system management Fault-tolerance on all (or most) system levels Remote manageability of the whole supercomputer Low run-time overhead for the system management

SOS Workshop 2000 (New Orleans, LA)6 COSMOS – Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has its +/– Resource usage Cost of the supercomputer Space, electrical power Performance estimation

SOS Workshop 2000 (New Orleans, LA)7 COSMOS – Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has its +/– Resource usage Cost of the supercomputer Space, electrical power Performance estimation

SOS Workshop 2000 (New Orleans, LA)8 COSMOS – Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has its +/– Resource usage Cost of the supercomputer Space, electrical power Performance estimation

SOS Workshop 2000 (New Orleans, LA)9 COSMOS – Supercomputer Design Architecture selection SAN technology Nodes technology Topology selection Every topology has its +/– Resource usage Cost of the supercomputer Space, electrical power Performance estimation

SOS Workshop 2000 (New Orleans, LA)10 COSMOS – Goals Single-system view of whole system Allows one-point system management Allows remote system management High availability of the system management Allows high over-all system up-times Allows dynamic configuration changes Modular software design System-independent concept & design Interfaces to existing management software modules

SOS Workshop 2000 (New Orleans, LA)11 COSMOS – Concept Configuration Control the system Monitoring Observe the system Planning When? Who? What? Security Stability & independence Faults & Traps Help the system Accounting Charge the usage Complete, integrated system management Remote management from everywhere No administrative programming necessary

SOS Workshop 2000 (New Orleans, LA)12 COSMOS – Implementation System Management Node Management SAN Management Process Management Resource Management Storage Management LAN Management User Interface State control and monitoring of the nodes, accounting SAN-dependent management and monitoring, accounting Support of and co-operation with parallel environments as MPI/FCI Resource management: Priorities, allocation, queues Vendor-dependent storage management software SNMP-based management of used LAN components User-privilege-based management and monitoring

SOS Workshop 2000 (New Orleans, LA)13 COSMOS – Implementation Management Center COSMOS Center Node 0 COSMOS Agent Process 0 Node 1 COSMOS Agent Node 3 COSMOS Agent Node 2 COSMOS Agent Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Management Center COSMOS Center Management Center COSMOS Center

SOS Workshop 2000 (New Orleans, LA)14 Gridware GRD/Codine Powerful resource management Integrates resource and batch management Ticket-based job scheduling scheme Well-defined interfaces Some drawbacks at this moment GRD/Codine is not topology-aware GRD/Codine is a commercial product

SOS Workshop 2000 (New Orleans, LA)15 COSMOS – Interaction with GRD/Codine System Management Node Management SAN Management Process Management Storage Management LAN Management User Interface GRD/Codine Node Monitoring Process Monitoring Resource Management User Interface Accounting Resource Management

SOS Workshop 2000 (New Orleans, LA)16 Roadmap of COSMOS Development Prototype release plan for COSMOS 1Q2000– Centralised process and SAN management 2Q2000– Distributed system management framework 3Q2000– Complete non-interactive management 4Q2000– Complete interactive management Interaction between COSMOS & GRD/Codine Transfer of topology and configuration information Exchange of monitoring information

Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG