Presentation is loading. Please wait.

Presentation is loading. Please wait.

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

Similar presentations


Presentation on theme: "Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL."— Presentation transcript:

1 Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL

2 Background Brookhaven National Lab (BNL) is a multi- disciplinary research laboratory funded by US government. Brookhaven National Lab (BNL) is a multi- disciplinary research laboratory funded by US government. BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments. The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.

3 Background (cont.) BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc). RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).

4 Background (cont.) The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF RCF/ACF is transforming itself from a local resource into a national and global resource RCF/ACF is transforming itself from a local resource into a national and global resource Growing design and operational complexity Growing design and operational complexity Increasing staffing levels to handle additional responsibilities Increasing staffing levels to handle additional responsibilities

5 RCF/ACF Structure

6 Staff Growth at the RCF/ACF

7 The Pre-Grid Era Rack-mounted commodity hardware Rack-mounted commodity hardware Self-contained, localized resources Self-contained, localized resources Resources available only to local users Resources available only to local users Little interaction with external resources at remote locations Little interaction with external resources at remote locations Considerable freedom to set own usage policies Considerable freedom to set own usage policies

8 The (Near-Term) Future Resources available globally Resources available globally Distributed computing architecture Distributed computing architecture Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth Constraints on freedom to set own policies Constraints on freedom to set own policies

9 How do we get there? Change in management philosophy Change in management philosophy Evolution in hardware requirements Evolution in hardware requirements Evolution in software packages Evolution in software packages Different security protocol(s) Different security protocol(s) Change in access policy Change in access policy

10 Change in Management Philosophy Automated monitoring & management of servers in large clusters a must Automated monitoring & management of servers in large clusters a must Remote power management, predictive hardware failure analysis and preventive maintenance are important Remote power management, predictive hardware failure analysis and preventive maintenance are important High-availability based on large number of identical servers, not on 24-hour support High-availability based on large number of identical servers, not on 24-hour support Increasingly larger clusters only manageable if servers are identical  avoid specialized servers Increasingly larger clusters only manageable if servers are identical  avoid specialized servers

11 Evolution in Hardware Requirements Early acquisitions emphasized CPU power over local storage capacity Early acquisitions emphasized CPU power over local storage capacity Increasing affordability of local disk storage has changed this philosophy Increasing affordability of local disk storage has changed this philosophy Hardware chosen by optimal combination of CPU power, storage capacity, server density and price Hardware chosen by optimal combination of CPU power, storage capacity, server density and price Buy from high-quality vendors to avoid labor- intensive maintenance issues Buy from high-quality vendors to avoid labor- intensive maintenance issues

12 The Growth of the Linux Farm

13 Drop in Server Price as a Function of Performance

14 Drop in Cost of Local Storage

15 Total Distributed Storage Capacity

16 Growth of Storage Capacity per Server

17 Server Reliability

18 The Factors Enforcing Evolution in Software Packages Cost Cost Farm size / scalability Farm size / scalability Security Security External influences / wide acceptance External influences / wide acceptance

19 Cost Red Hat Linux → Scientific Linux Red Hat Linux → Scientific Linux LSF → Condor LSF → Condor

20 Farm Size / Scalability Home built batch system for data reconstruction → Condor based batch system Home built batch system for data reconstruction → Condor based batch system Home built monitoring system → Ganglia Home built monitoring system → Ganglia

21 Security Started with NIS/telnet in the 90’s Started with NIS/telnet in the 90’s Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh  scricter security standards than in the past Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh  scricter security standards than in the past On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. Testing GSI  limited support for GSI Testing GSI  limited support for GSI

22 Security Changes (cont.) Authorization & authentication controlled by local site (NIS and Kerberos) Authorization & authentication controlled by local site (NIS and Kerberos) Migration to GSI requires a central CA and regional VO’s for authentication  local sites performs final authentication before granting access Migration to GSI requires a central CA and regional VO’s for authentication  local sites performs final authentication before granting access Accept certificates from multiple CA’s? Accept certificates from multiple CA’s? Difficult transition from complete to partial control over security issues Difficult transition from complete to partial control over security issues

23 External Influences / Wide Acceptance Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. HRM / dCACHE – used by other labs HRM / dCACHE – used by other labs Condor – widely used by Atlas community Condor – widely used by Atlas community

24 Software Evolution - summary PackageOldNewDate OS RedHat Linux Scientific Linux 2004 Batch Home- Built/LSF Condor/LSF2004/2000 MonitoringHome-BuiltGanglia2003 SecurityNISK5/GSI2003/2004 Distributed Storage ----------- -----------HRM/dCache2004/?

25 Ganglia at the RCF/ACF

26 Condor at the RCF/ACF

27 Summary RCF/ACF going through a transition from a local facility to a regional (global) facility  many changes RCF/ACF going through a transition from a local facility to a regional (global) facility  many changes Linux Farm built with commodity hardware is increasingly affordable and reliable Linux Farm built with commodity hardware is increasingly affordable and reliable Distributed storage is also increasingly affordable  management software issues. Distributed storage is also increasingly affordable  management software issues.

28 Summary (cont.) Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices Transition with security and access issues Transition with security and access issues Migration will take longer and be more difficult than generally expected  change in hardware and software needs to be complemented by a change in management philosophy Migration will take longer and be more difficult than generally expected  change in hardware and software needs to be complemented by a change in management philosophy


Download ppt "Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL."

Similar presentations


Ads by Google