Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

Similar presentations


Presentation on theme: "Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT."— Presentation transcript:

1 Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT

2 2002/10/21White Box Farms: Tim.Smith@cern.ch2 Contents  Scale  Behind the Scenes  Hardware  Complexity  Dynamics  Practical Steps  Software  Legacy  Projects

3 2002/10/21White Box Farms: Tim.Smith@cern.ch3 Scale  ~1000 boxes  140k Jobs/wk  2400 int user  50 parallel reinstalls  Parallel cmd engines  350kSi2000  ~7/38 in top 500 clusters

4 2002/10/21White Box Farms: Tim.Smith@cern.ch4 Complexity  Hardware  12 hardware acquisitions  38 combinations of CPU/Mem/Disk  Software  4 versions of RedHat OS  37 clusters (indep. configurations)  User Communities  30 expts/user communities + Public  12,000 users

5 2002/10/21White Box Farms: Tim.Smith@cern.ch5 Dynamics  Hardware Drift  e.g. missing after reboot:  CPUs, Memory, Disks  Ethernet speed wrong  Volatile configurations  e.g. passwd file every couple of hours  Hardware Failures  Up to 4% of farm on holiday  Replacements generate new configurations Monitoring Inventory Tracking

6 2002/10/21White Box Farms: Tim.Smith@cern.ch6 Vendor Call Analysis 1 every 2 days!

7 2002/10/21White Box Farms: Tim.Smith@cern.ch7 Acquisition Cycles

8 2002/10/21White Box Farms: Tim.Smith@cern.ch8 Addressing the Challenge  Interactive: Refresh from uniform batch machines  Batch: One large production facility  Shares (and priorities)  Selectable resources  Flexibility  Redundancy to reduced sensitivity to failures  Remedy Hardware workflows  But intractable  Scatter in job return times  Assumed but undeclared job requirements

9 2002/10/21White Box Farms: Tim.Smith@cern.ch9 SW: Legacy from Maturity OS Applications Mgmt Tools KickStart SUE ASIS BIS /home /usr/cute /usr/local /var /opt

10 2002/10/21White Box Farms: Tim.Smith@cern.ch10 BIS DB SW: Legacy from Maturity OS Applications Mgmt Tools KickStart SUE ASIS BIS Oracle AFS Local acrontabs /home /usr/cute /usr/local /var /opt crontabs Multiple owners, methods, formats Multiple locations

11 2002/10/21White Box Farms: Tim.Smith@cern.ch11 A Clean Restart Node Configuration System Monitoring System Installation System Fault Mgmt System

12 2002/10/21White Box Farms: Tim.Smith@cern.ch12 A Clean Restart: SnapShot Node Configuration System Monitoring System Installation System Fault Mgmt System HW SW Function State Software UpdateBase Installation RPM API PXE Kickstart

13 2002/10/21White Box Farms: Tim.Smith@cern.ch13 State and Configuration Mgt  Clean Initial State  Linux Standards Base, RPM  Externally Specified  Configuration System, local cache  Versioned + Repository  CVS  No inherent drift  No external crontabs  No unregistered application provider triggered updates  Update verification nodes + release cycle  Procedures and Workflows  Transactions  Notifications

14 2002/10/21White Box Farms: Tim.Smith@cern.ch14 Conclusions  Maturity brings…  Degradation of initial state definition  HW + SW  Accumulation of innocuous temporary procedures  Scale brings…  Marginal activities become full time  Many hands on the systems  Combat with strong management automation


Download ppt "Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT."

Similar presentations


Ads by Google