Presentation is loading. Please wait.

Presentation is loading. Please wait.

The CERN Computer Centres October 14 th 2005 CERN.ch.

Similar presentations


Presentation on theme: "The CERN Computer Centres October 14 th 2005 CERN.ch."— Presentation transcript:

1 The CERN Computer Centres October 14 th 2005 Tony.Cass@ CERN.ch

2 Tony.Cass@ CERN.ch 2 Talk Outline  Where  Why  What  How  Who

3 Tony.Cass@ CERN.ch 3 Where  Where –B513 »Main Computer Room, ~1,500m 2 & 1.5kW/m 2, built for mainframes in 1970, upgraded for LHC PC clusters 2003-2005. »Second ~1,200m 2 room created in the basement in 2003 as additional space for LHC clusters and to allow ongoing operations during the main room upgrade. Cooling limited to 500W/m 2. –Tape Robot building ~50m from B513 »Constructed in 2001 to avoid loss of all CERN data due to an incident in B513.  Why  What  How  Who

4 Tony.Cass@ CERN.ch 4 Why  Where  Why –Support »Laboratory computing infrastructure u Campus networks—general purpose and technical u Home directory, email & web servers (10k+ users) u Administrative computing servers »Physics computing services u Interactive cluster u Batch computing u Data recording, storage and management u Grid computing infrastructure  What  How  Who

5 Tony.Cass@ CERN.ch 5 Physics Computing Requirements  25,000k SI2K in 2008, rising to 56,000k in 2010 –2,500-3,000 boxes –500kW-600kW @ 200W/box.  2.5MW @ 0.1W/SI2K  6,800TB online disk in 2008, 11,800TB in 2010 –1,200-1,500 boxes, –600kW-750kW  15PB of data per year –30,000 500GB cartridges/year –Five 6,000 slot robots/year  Sustained data recording at up to 2GB/s –Over 250 tape drives and associated servers

6 Tony.Cass@ CERN.ch 6 What are the major issues  Where  Why  What are the major issues –Commodity equipment from multiple vendors –Large scale clusters –Infrastructure issues »Power and cooling »Limited budget  How  Who

7 Tony.Cass@ CERN.ch 7 Commodity equipment & many vendors  Given the requirements, significant pressure to limit cost per SI2K and cost per TB.  Open tender purchase process –Requirements in terms of box performance –Reliability criteria seen as subjective and so difficult to incorporate in process. »Also, as internal components are similar, are branded boxes intrinsically more reliable?  Cost requiremens and tender process lead to “white box” equipment, not branded.  Tender purchase process leads to frequent changes of bidder. Good in that there is competition and we aren’t reliant on a single supplier.  Bad as we must deal with many companies, most of whom are remote and subcontract maintenance services.

8 Tony.Cass@ CERN.ch 8 Large Scale Clusters  The large number of boxes leads to problems in terms of –Maintaining software homogeneity across the clusters –Maintaining services despite the inevitable failures –Logistics »Boxes arrive in batches of O(500) »Are vendors respecting the contractual warranty times? u (Have they returned the box we sent them last week…) »How to manage service upgrades u especially as not all boxes for a service will be up at the time of upgrade –…

9 Tony.Cass@ CERN.ch 9

10 10 Infrastructure Issues  Cooling capacity limits the equipment we can install –Maximum cooling of 1.5kW/m 2 –40x1U servers @ 200W/box = 8kW/m 2  We cannot provide diesel backup for the full computer centre load. –Swiss/French auto-transfer covers most failures. –Dedicated zone for “critical equipment” with diesel backup and dual power supplies. »Limited to 250kW for networks and laboratory computing infrastructure. »… and physics services such as Grid and data management servers u but not all the physics network, so careful planning needed in terms of switch/router allocations and the power connections.

11 Tony.Cass@ CERN.ch 11 How  Where  Why  What  How  Who

12 Tony.Cass@ CERN.ch 12 How  Where  Why  What  How –Rigorous, centralised control  Who

13 Tony.Cass@ CERN.ch 13 ELFms  Extremely Large Farm management system –box nodes in: »deliver required configuration »monitor performance and any deviation from the required state »track nodes through hardware and software state changes  Three components: –quattor for configuration, installation and node management –Lemon for system and service monitoring –Leaf for managing state changes—both hardware (HMS) and software (SMS) Node Configuration Management Node Management

14 Tony.Cass@ CERN.ch 14 quattor  quattor takes care of the configuration, installation and management of nodes. –A Configuration Database holds the ‘desired state’ of all fabric elements »Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) »Cluster (name and type, batch system, load balancing info…) »Defined in templates arranged in hierarchies – common properties set only once –Autonomous management agents running on the node take care of »Base installation »Service (re-)configuration »Software installation and management

15 Tony.Cass@ CERN.ch 15 quattor architecture Node Configuration Manager NCM CompACompBCompC ServiceAServiceBServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s) HTTP SW Repository RPMs Install server HTTP / PXE System installer Install Manager base OS XML configuration profiles Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP

16 Tony.Cass@ CERN.ch 16 Lemon  Lemon (LHC Era Monitoring) is a client-server tool suite for monitoring status and performance comprising –sensors to measure the values of various metrics »Several sensors exist to monitor node performance, process, hw and sw monitoring, database monitoring, security, alarms »“External” sensors for metrics such as hardware errors and computer centre power consumption. –a monitoring agent running on each node. This manages the sensors and sends data to the central repository –a central repository to store the full monitoring history »two implementations, Oracle or flat file based –an RRD based display framework »Pre-processes data into rrd files and creates cluster summaries u Including “virtual” clusters such as the set of nodes being used by a given experiment.

17 Tony.Cass@ CERN.ch 17 Lemon architecture Correlation Engines User Workstations Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP

18 Tony.Cass@ CERN.ch 18 Leaf  LEAF (LHC Era Automated Fabric) is a collection of workflows for high level node hardware and software state management, built on top of quattor and Lemon. –HMS (Hardware Management System) »Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement »Automatically requests installs, retires etc. to technicians »GUI to locate equipment physically »HMS implementation is CERN specific, but concepts and design should be generic –SMS (State Management System) »Automated handling (and tracking of) high-level configuration steps u Reconfigure and reboot all LXPLUS nodes for new kernel and/or physical move u Drain and reconfig nodes for diagnosis / repair operations »Issues all necessary (re)configuration commands via quattor »extensible framework – plug-ins for site-specific operations possible –CCTracker (in development) »shows location of equipment in room

19 Tony.Cass@ CERN.ch 19 Node HMS LAN DB SMS CDB Operations Sysadmins 1. Import 2. Set to standby 3. Update 4. Refresh 5. Take out of production 6. Shutdown work order 7. Request move 10. Install work order 8. Update 9. Update 11. Set to production 12. Update 13. Refresh 14. Put into production Use Case: Move rack of machines

20 Tony.Cass@ CERN.ch 20

21 Tony.Cass@ CERN.ch 21

22 Tony.Cass@ CERN.ch 22

23 Tony.Cass@ CERN.ch 23

24 Tony.Cass@ CERN.ch 24

25 Tony.Cass@ CERN.ch 25

26 Tony.Cass@ CERN.ch 26

27 Tony.Cass@ CERN.ch 27

28 Tony.Cass@ CERN.ch 28

29 Tony.Cass@ CERN.ch 29

30 Tony.Cass@ CERN.ch 30

31 Tony.Cass@ CERN.ch 31 Who  Where  Why  What  How  Who –Contract Shift Operators: 1 person 24x7 –Technician level System Administration Team »10 team members plus 3 people for machine room operations plus engineer level manager –Engineer level teams for Physics computing »System & Hardware support: approx 10FTE »Service support: approx 10FTE »ELFms software: 3FTE plus students and collaborators. u ~30FTE-years total investment since 2001

32 Tony.Cass@ CERN.ch 32 Summary  Physics requirements, budget and tendering process lead to large scale clusters of commodity hardware.  We have developed and deployed tools to install, configure, monitor nodes and to automate hardware and software lifecycle steps.  Services must cope with individual node failures –already the case for simple services such as batch –new data management software being introduced to reduce reliance on individual servers –focussing now on grid level services  We believe we are well prepared for LHC computing –but expect managing the large scale, complex environment to be an exciting adventure


Download ppt "The CERN Computer Centres October 14 th 2005 CERN.ch."

Similar presentations


Ads by Google