Presentation is loading. Please wait.

Presentation is loading. Please wait.

Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and.

Similar presentations


Presentation on theme: "Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and."— Presentation transcript:

1 Partner Logo http://cern.ch/hep-proj-grid-fabric DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and Architecture Status reports Installation Monitoring Others Short-term planning

2 J.Iven - 2 HEPiX 2002, Catania / IT WP4 - Background information  WP4 ’ s objective: deliver the necessary tools to manage a computing fabric providing grid services on clusters scaling up to thousands of nodes. + Main scope: = Fabric (system administration) management = User job management (Grid and local) + Official participants: CERN (leading partner), INFN, NIKHEF, University of Heidelberg, ZIB (Berlin) and University of Edinburgh/ PPARC

3 J.Iven - 3 HEPiX 2002, Catania / IT Functionality + Enterprise system administration - scalable to O(10K) nodes = Automated installation and maintenance of nodes = Resource management (batch, interactive) = Monitoring of events and performance = Fault tolerance & recovery actions = Fabric Configuration Management + Provision for running Grid jobs = Authorization according to local policies = Mapping Grid credential to local ones = Publication of fabric resources and job information + Provision for running local jobs = Sharing of resources according to local policies

4 J.Iven - 4 HEPiX 2002, Catania / IT DataGRID Architecture Grid Fabric Grid Collective Services Information & Monitoring Replica Manager Grid Scheduler Local ApplicationLocal Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Node Installation & Management Monitoring and Fault Tolerance Resource Management Fabric Storage Management Local Computing Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index WP4 tasks

5 J.Iven - 5 HEPiX 2002, Catania / IT Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services (WP3) WP4 subsystems Other Wps Resource Broker (WP1) Data Mgmt (WP2) Grid Data Storage (WP5) WP4 Architecture overview - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials. - provides transparent access to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting). - provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information. - provides the tools to install and manage all software running on the fabric nodes; - Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories; - provides the tools for gathering monitoring information on fabric nodes; - central measurement repository stores all monitoring information; - - fault tolerance correlation engines detect failures and trigger recovery actions.

6 J.Iven - 6 HEPiX 2002, Catania / IT Status: Installation + Main focus: Linux on i386 + Prototype available, based on a tool originally developed by Edinburgh University: LCFG (Local ConFiGuration system). + Main Features: = automatic installation of O.S. = installation/upgrade/removal of software packages = configure and manage standard services and custom packages = Uses a central, hierarchical configuration database + Deployed on EDG testbed 1 in September 2001, interim releases every ~2 months. Currently ~70 nodes (CERN, INFN, NIKHEF, RAL + other UK, IFAE-Barcelona, LIP-Lisbon), +~30 being installed now (ESA- ESRIN, NIKJEF)

7 J.Iven - 7 HEPiX 2002, Catania / IT LCFG overview A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services Abstract configuration parameters for all nodes stored in a central repository ldxprof Load Profile Generic Component Profile Object rdxprof Read Profile LCFG Objects Local cache Client nodes Web Server HTTP XML Profile LCFG Config Files Make XML Profile Server LCFG diagram +inet.services telnet login ftp +inet.allow telnet login ftp sshd +inet.allow_telnet ALLOWED_NETWORKS +inet.allow_login ALLOWED_NETWORKS +inet.allow_ftp ALLOWED_NETWORKS +inet.allow_sshd ALL +inet.daemon_sshd yes..... +auth.users mickey +auth.userhome_mickey /home/Mickey +auth.usershell_mickey /bin/tcsh Config files 192.168., 192.135.30...... /home/Mickey /bin/tcsh.... XML profiles Profile Object inet auth /etc/services /etc/inetd.conf /etc/hosts.allow in.telnetd : 192.168., 192.135.30. in.rlogind : 192.168., 192.135.30. in.ftpd : 192.168., 192.135.30. sshd : ALL /etc/shadow /etc/group /etc/passwd.... mickey:x:999:20::/home/Mickey:/bin/tcsh.... Slide prepared by INFN

8 J.Iven - 8 HEPiX 2002, Catania / IT Monitoring + Fault Tolerance + Principle: = A Monitoring Agent running on each node samples the configured metrics via sensors = The samples are sent to a central Monitoring Repository and stored. The samples are also stored locally to allow for local fault tolerance if appropriate = Correlation engines act on local or central data Å Trigger actions, e.g. Alarms or Recovery Å Create higher-level data = User Interface allows to query Repository, displays alarms + Status: = Component APIs have been defined = First version of Agent available, deployed on EDG (~15) and CERN (~1000) nodes = Current prototype uses simple DB based on flat files

9 J.Iven - 9 HEPiX 2002, Catania / IT Other Statuses (Stati?) + Fault Tolerance = Prototype which periodically checks the CPU/chip set temperatures as well as the fan speeds. + Configuration Management = High Level Configuration Description Language: declarative way of describing configuration of computer systems. First draft available. = High Level configuration Language to Low Level Configuration language Compiler. Alpha prototype available. = Central Configuration Database (CDB) (central store for all fabric configuration information). Being designed. + Resource Management = Working on first prototype of the Resource Management Subsystem + Gridification = Enhancing the Globus gatekeeper with plug-in authorization and credential mapping components.

10 J.Iven - 10 HEPiX 2002, Catania / IT Planning up to Release 2 (09/02) + Installation: Split "Production" and "Research" = Deliver Production quality LCFG in R2 = Move to latest LCFG: support PXE installations, RedHat7.2 = New Configuration Schema = Deploy new configuration language compiler + Monitoring: complete Prototype in R2 = Prototype-quality agent, deploy everywhere = Simple alarm display / GUI  Rework transport layer (UDP  reliable transport) = Integrate SNMP = Select and move to "real" database

11 J.Iven - 11 HEPiX 2002, Catania / IT Summary and Links + WP4 is well-established: software deployed and used + Release 2 will be Evolution, not Revolution + @CERN: close collaboration between WP4, farm admins and LCG team + DataGRID project: http://cern.ch/eu-datagridhttp://cern.ch/eu-datagrid + DataGRID WP4 : http://cern.ch/hep-proj-grid-fabrichttp://cern.ch/hep-proj-grid-fabric + LCFG: http://www.lcfg.orghttp://www.lcfg.org


Download ppt "Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and."

Similar presentations


Ads by Google