1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin.

1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin University Geelong, Vic 3217, Australia

June 2004COSET’20042 A Need for More than Execution Performance Performance is a critical assessment criterion Performance is a critical assessment criterion Security, reliability, and ease of programming are neglected Security, reliability, and ease of programming are neglected Furthermore Furthermore –Parallel computers are seen as being user unfriendly –Parallel processing is not used on daily basis –Ordinary users have to be involved in programming activities that are of the operating system nature –Ordinary engineers, managers, etc do not have, and should not have, specialized knowledge needed to program operating system oriented activities

June 2004COSET’20043 Aim of Our Research IBM has launched a comprehensive program IBM has launched a comprehensive program –“to re-examine an obsession with faster, smaller, and more powerful” –“to look at the evolution of computing from a more holistic perspective” IBM’s Autonomic Computing - one of the Grand Challenges IBM’s Autonomic Computing - one of the Grand Challenges Parallel processing on non-dedicated clusters could benefit from the Autonomic Computing vision Parallel processing on non-dedicated clusters could benefit from the Autonomic Computing vision Aim: to show a general design of services and initial implementation of a system that moves parallel processing on clusters to the computing mainstream using the Autonomic Computing vision Aim: to show a general design of services and initial implementation of a system that moves parallel processing on clusters to the computing mainstream using the Autonomic Computing vision

June 2004COSET’20044 IBM’s Autonomic Computing The name “autonomic” has not caught on everywhere, if only because it’s IBM’s The name “autonomic” has not caught on everywhere, if only because it’s IBM’s –Microsoft – “trustworthy” –Others prefer more generic – “self-managing” Many see “autonomic computing” as one of the basic parts of a revolutionary technology that Many see “autonomic computing” as one of the basic parts of a revolutionary technology that –Will start the new.com boom –Will move parallel computing on clusters to the Computing mainstream

June 2004COSET’20045 IBM’s Autonomic Computing Characteristics of autonomic computing systems Characteristics of autonomic computing systems –knows itself –configures and reconfigures itself under varying and unpredictable conditions –optimizes its working –performs something akin to healing –provides self-protection –knows its surrounding environment –exists in an open (non-hermetic) environment –anticipates the optimized resources needed while keeping its complexity hidden

June 2004COSET’20046 Related Work A number of projects related to Autonomous Computing are mentioned by the IBM website A number of projects related to Autonomous Computing are mentioned by the IBM website While many of the reported projects engage in some aspects of Autonomic Computing none engage in research to develop a system that has all eight of the characteristics required While many of the reported projects engage in some aspects of Autonomic Computing none engage in research to develop a system that has all eight of the characteristics required None of the projects addresses parallel processing, in particular parallel processing on non-dedicated clusters. None of the projects addresses parallel processing, in particular parallel processing on non-dedicated clusters.

June 2004COSET’20047 Design of Autonomic Elements (Services) Providing Autonomic Computing on Non-dedicated Clusters We have proposed and designed a set of autonomic elements that must be provided to develop an autonomic computing environment on a non-dedicated cluster We have proposed and designed a set of autonomic elements that must be provided to develop an autonomic computing environment on a non-dedicated cluster Three component levels Three component levels –Services –Computers –Non-dedicated cluster Note: we have not addressed Note: we have not addressed –Hardware aspects –Administration aspects

June 2004COSET’20048 Cluster Knows Itself A need for resource discovery A need for resource discovery This autonomic element runs on each computer This autonomic element runs on each computer Activities Activities –Acquires knowledge of static parameters of computers  processor type (e.g., speed)  memory size  available software –Acquires knowledge of dynamic parameters of clusters  computers’ load  available memory  communication pattern and volume

June 2004COSET’20049 Resource Discovery Service Design Resource Discovery Communication Pattern & Load Local Communication Load CPU Main Memory Remote Communication Load Computational Load & Parameters Computer i Resource Discovery CPU Main Memory Computation element 1 Computer j Computation element 2 Computation element 1

June 2004COSET’200410 Cluster Configures and Reconfigures Itself under Varying and Unpredictable Conditions In a non-dedicated cluster there are times when In a non-dedicated cluster there are times when –Some computers are lightly loaded or idle –Some computers cannot be used  owners removed them from a shared pool of resources  are heavy loaded To offer high availability, i.e., to configure and reconfigure itself, the system To offer high availability, i.e., to configure and reconfigure itself, the system –Forms parallel virtual clusters adaptively and dynamically –Forming is based on load and changing resources

June 2004COSET’200411 Availability Service Design RD Availability Services Virtual Parallel Cluster (t 0 ) Where times t 0 < t 1 < t 2 < t 3 Virtual Parallel Cluster (t 2 ) Virtual Parallel Cluster (t 3 ) RD Virtual Parallel Cluster (t 1 ) RD

June 2004COSET’200412 Cluster Should Optimize Its Working Application computation elements should be placed optimally Application computation elements should be placed optimally To improve performance there is a need for To improve performance there is a need for –Computation load –Available memory –Communication costs To optimize cluster’s working there is To optimize cluster’s working there is –Static allocation and load balancing –Ability to change performance indices that reflect user objectives –Computation element migration, creation and duplication –Setting of computation priorities of applications

June 2004COSET’200413 High Performance Service Design Virtual Parallel Cluster C1C1 P1P1 C2C2 P2P2 C3C3 PiPi Migration CnCn Availability Services { where: P 1 → C 1, P 2 → C 2, ……… {P i, P j } → C n } {where, which, when: P i : C n → C 3 } Global Scheduler Static Allocation Load Balancing PjPj

June 2004COSET’200414 Cluster Should Perform Something Akin To Healing Hardware and software faults can occur Hardware and software faults can occur Failures lead to the termination of computations Failures lead to the termination of computations To provide something akin to healing To provide something akin to healing –Faults are identified and reported –Checkpointing of parallel computation element of applications is provided –Recovery from failures is employed –Migrating applications from faulty computers to healthy computers is carried out automatically –Redundant/replicated services are provided

June 2004COSET’200415 Self-Healing Service Design Computation Element i Checkpointing (coordinated) Recovery Checkpoint for Computation Element i C1C1 Checkpoint for Compute Elem i Checkpoint for Compute Elem i Disk Compute Elem i after crash recovery C2C2 CjCj CkCk

June 2004COSET’200416 Clusters Should Provide Self- Protection Computation elements of parallel applications are distributed Computation elements of parallel applications are distributed Computation elements communicate using messages Computation elements communicate using messages They are the subject of passive and active attacks They are the subject of passive and active attacks To provide self-protection: To provide self-protection: –Virus detection and recovery must be offered –Resource protection should be a mandatory service –Encryption, as a countermeasure against passive attacks, should be used –Authentication, as a countermeasure against active attacks, should be used

June 2004COSET’200417 To Allow a System to Know Its Surrounding Environment and to Prevent a System From Existing in a Hermetic Environment There are applications that require There are applications that require –More computation power –Specialized software –Unique peripheral devices etc Many owners cannot afford such resources Many owners cannot afford such resources Some owners can offer their services and resources to appropriate users Some owners can offer their services and resources to appropriate users

June 2004COSET’200418 To Allow a System to Know Its Surrounding Environment and to Prevent a System From Existing in a Hermetic Environment To benefit from existing unique resources To benefit from existing unique resources –Resource discovery of other clusters is provided –Advertising services is in place –Systems are able to cooperate –Negotiation is in use –Brokerage of resources and services are used –Resources are shared in a distributed manner –“The move toward a grid” should be in place

June 2004COSET’200419 Grid-like Service Design Brokerage Services Computational Services Storage/Memory Services Printer Services Information Services Advertisement Exporting Services Withdrawal Services Import Requests Cluster 1 Brokerage Servicess Cluster nCluster 3 Cluster 2 Brokerage Servicess

June 2004COSET’200420 A Cluster Should Anticipate the Optimized Resources Needed While Keeping Its Complexity Hidden The scarcity of software to assist ordinary programmers limits the harnessing of the computing power of non-dedicated clusters The scarcity of software to assist ordinary programmers limits the harnessing of the computing power of non-dedicated clusters This implies This implies –A programming environment simple to use –Knowledge of resource distribution not needed –Message passing and shared memory programming supported transparently

June 2004COSET’200421 Easy Programming Service Design Communication Primitives System Services of an Operating System Kernel Services of an Operating System Programming Environment Shared Memory Message Passing or PVM / MPI DSM

June 2004COSET’200422 The Holos Services for Autonomic Computing Clusters Holos is built to demonstrate that it is possible to develop an autonomic non-dedicated cluster that Holos is built to demonstrate that it is possible to develop an autonomic non-dedicated cluster that –could be routinely employed by ordinary engineers, managers, etc –able to support next generation application software executing on clusters We followed the IBM’s vision recommendations regarding autonomic elements We followed the IBM’s vision recommendations regarding autonomic elements We decided to view autonomic elements as processes We decided to view autonomic elements as processes –Each computer is a multi-process systems with its objectives –A cluster is a set of multi-process systems with its objectives

June 2004COSET’200423 Holos System Servers Kernel Servers Global Scheduler Execution Server Migration Server Check- point Server Resource Discovery Server DSM Server Broker- age Server IPC Server Process Manage Server Space Manage Server GENESIS Microkernel Parallel Processes MP / PVM / MPI Process DSM Process Holos was developed based on the P2P and microkernel paradigms Holos was developed based on the P2P and microkernel paradigms The microkernel provides services such as The microkernel provides services such as –local IPC –basic paging operations –interrupt handling –context switching Three groups of processes: Three groups of processes: –kernel servers –system servers –application processes Kernel and system servers are stationary, application processes are mobile Kernel and system servers are stationary, application processes are mobile All processes communicate using messages All processes communicate using messages

June 2004COSET’200424 System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters Resource Discovery Server - collects data about computation and communication load Resource Discovery Server - collects data about computation and communication load Availability Server - dynamically and adaptively forms a parallel virtual cluster for the application Availability Server - dynamically and adaptively forms a parallel virtual cluster for the application Global Scheduling Server – maps application processes using static allocation and dynamic load balancing on the computers of the virtual parallel cluster Global Scheduling Server – maps application processes using static allocation and dynamic load balancing on the computers of the virtual parallel cluster

June 2004COSET’200425 System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters Execution Server - coordinates the single, multiple and group creation and duplication of application processes on both local and remote computers Execution Server - coordinates the single, multiple and group creation and duplication of application processes on both local and remote computers Migration Server - coordinates moving application processes to other computers Migration Server - coordinates moving application processes to other computers DSM Server - hides the distributed nature of the cluster’s memory and allows writing code as though using physically shared memory DSM Server - hides the distributed nature of the cluster’s memory and allows writing code as though using physically shared memory

June 2004COSET’200426 System Servers Form a Basis of an Autonomic Operating System for Nondedicated Clusters Checkpoint Server - coordinates creation of checkpoints for an executing application Checkpoint Server - coordinates creation of checkpoints for an executing application Fault Recovery Server – recovers application processes / applications using checkpoints Fault Recovery Server – recovers application processes / applications using checkpoints IAC Server - supports remote interprocess communication and supports group communication within sets of application processes IAC Server - supports remote interprocess communication and supports group communication within sets of application processes Brokerage Server – supports advertising and sharing services through service exporting, importing and revoking Brokerage Server – supports advertising and sharing services through service exporting, importing and revoking

June 2004COSET’200427 Holos Possesses the Autonomic Computing Characteristics Autonomic Computing RequirementCooperating Holos Servers –Relationships Among Autonomic Elements To allow a system to know itselfResource Discovery Server A system must configure and reconfigure itself under varying and unpredictable conditions Resource Discover Server, Global Scheduling Server, Migration Server, Execution Server, and Availability Server A system must optimize its workingGlobal Scheduling Server, Migration Server, and Execution Server A system must perform something akin to healingCheckpoint Server, Recovery Server, Migration Server, Global Scheduling Server A system must provide self-protectionCapabilities in the form of System Names A system must know its surrounding environmentResource Discovery Server, and Brokerage Server A system cannot exist in a hermetic environmentInterprocess Communication Server, and Brokerage Server A system must anticipate the optimized resources needed while keeping its complexity hidden (most critical for the user) DSM Server, and Execution Server, DSM Programming Environment, Message Passing Programming Environment, PVM/MPI Programming Environment

June 2004COSET’200428 Conclusion Autonomic computing has been shown to be a basic part of a revolutionary technology that Autonomic computing has been shown to be a basic part of a revolutionary technology that –Could move parallel computing on non-dedicated clusters to the computing mainstream –(Will start the new.com boom – is to be shown) The development of the Holos cluster operating system demonstrates that it is possible to build an autonomic non-dedicated cluster The development of the Holos cluster operating system demonstrates that it is possible to build an autonomic non-dedicated cluster The Holos cluster operating system has been built from scratch The Holos cluster operating system has been built from scratch

1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin.

Similar presentations

Presentation on theme: "1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin.

Similar presentations

Presentation on theme: "1 Cluster Operating System Support For Parallel Autonomic Computing Andrzej M. Goscinski, J. Silcock, M. Hobbs School of Information Technology Deakin."— Presentation transcript:

Similar presentations

About project

Feedback