Cluster Reliability Project ISIS Vanderbilt University.

Cluster Reliability Project ISIS Vanderbilt University

Purpose Put together a set of software tools that allows us to easily create “pluggable” components that –Monitor our clusters (hardware, utilization, errors, jobs, etc.) –Recognizes anomalous conditions (inference rules, model comparison, probabilities) –Take actions to correct or alleviate the problems (triggered by other components)

Goals Increase the availability, utilization, and reliability of the computing clusters Reduce the time it takes to diagnose problems Reduce the administrative workload associated with operating the clusters Automate routine administration tasks Monitor the use of the clusters Allow other systems and user scripts to easily interact with this system

Some basic features include: Identifying types of information that must be collected or communicated, APIs for communicating information and the creation of monitors and reactors, Popular programming language bindings, An environment for writing, testing, and releasing new reactors and monitors, A basic set of problems that must be addressed, the monitors, reactors and data that they will communicate, Recording of monitoring information and actions taken, Administration tools that allow for single-point release distribution and installation, and control of the runtime environment, A configuration system that allows for uniform parameter setting and allows for tuning to adjust the performance impact on the system.

ISIS Goals Monitor the health (performance, utilization, state) of all processors and networks in the system (leveraging existing tools and standards). It should closely monitor performance and the status of a job, and work together with the workflow subsystem, ensuring good progress for larger analysis campaigns that are being conducted. It should be coupled to the application. Mitigation actions depend on the properties of the application and its overall workflow. It should be integrated with workflow planning, to allow for resource optimization. This will include interactions with real-time scheduling systems.

ISIS Research Overall planned deliverables –Customizable Monitoring & Control framework. –Mitigation engine. –Monitoring and mitigation design tool. –Monitoring and mitigation system generator In this project we will address the critical difficulties in achieving a fault mitigation framework for a large cluster, which is configurable, and strives to minimally affect the performance of the cluster. For this, we will utilize a model-based design approach, that uses domain-specific modeling languages and model transformer to enable system design using domain-specific and higher-level abstractions, and uses the monitoring and control framework.

Current activities Strong desire to leverage existing infrastructures for networked-system health monitoring. The evaluation criteria is based on: –Architecture: centralized vs. hierarchical –Monitoring: available sensors as well as configurability for new sensors. –Scalability: cluster size vs. resource consumption –Handling: smart data mining and virtual sensors. –Programmability: configuration language. –Report Visualization. Currently, we are evaluating two packages: –Open NMS –Aware

Near-term major work Nov/Dec 06: –Get all job information and some currently measured attributes into a database –Complete evaluation of products Jan/Feb 07: –Additions to current code to record resource utilization, IMPI information, errors, and actions –Create test system for selected monitoring product Mar/Apr 07: –Replicating existing functionality in new framework

General architecture General architecture

Backup slides follow

Model-Based Approach System Health Monitoring: Open NMS /Aware  Fault handling  Process dataflow  Hardware Configuration Modeling in GME Run Time Fault Mitigation Engines

Proposed Architecture System Health Monitoring –Cluster –Campaign –Application –Leverage existing tool and standard Mitigation –Application –Campaign –Cluster Both monitoring and mitigation must be synchronized across related jobs. Design/Modeling environment for deploying campaigns, monitoring and mitigation policies.

Motivation Jobs on LQCD cluster are usually long term and interdependent. Failure on one node can have domino effect on other nodes. We cannot rely on job-level fault tolerance: –as it will be computationally expensive –will cause a decrease in performance –will make synchronization between related jobs difficult. We need a cluster-wide fault-tolerant framework that does monitoring and mitigation and is integrated with the scheduling framework.

Fault Mitigation Engine Two mitigation schemes: –Reflex based scheme e.g. relocate jobs, shutdown jobs, rewire nodes. –Planning based scheme e.g. optimize the campaign, reschedule jobs Mitigation schemes integrated with the workflow and job scheduling system. Constraints: deadlines, resources consumption. Approach: model-based generators to transform the designs into components and configurations for the runtime system, making sure that end-users can flexibly modify the characteristics of the generated artifacts We will integrate with tool for definition of workflows, monitoring, and mitigation actions.

Cluster Reliability Project ISIS Vanderbilt University.

Similar presentations

Presentation on theme: "Cluster Reliability Project ISIS Vanderbilt University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cluster Reliability Project ISIS Vanderbilt University.

Similar presentations

Presentation on theme: "Cluster Reliability Project ISIS Vanderbilt University."— Presentation transcript:

Similar presentations

About project

Feedback