Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans.

Similar presentations


Presentation on theme: "6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans."— Presentation transcript:

1 6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans

2 6/2/2015Bernd Panzer-Steindel, CERN, IT2 View of different Fabric areas Infrastructure Electricity, Cooling, Space Infrastructure Electricity, Cooling, Space Network Batch system (LSF, CPU server) Batch system (LSF, CPU server) Storage system (AFS, CASTOR, disk server) Storage system (AFS, CASTOR, disk server) Purchase, Hardware selection, Resource planning Purchase, Hardware selection, Resource planning Installation Configuration + monitoring Fault tolerance Installation Configuration + monitoring Fault tolerance Prototype, Testbeds Benchmarks, R&D, Architecture Benchmarks, R&D, Architecture Automation, Operation, Control Coupling of components through hardware and software GRID services !?

3 6/2/2015Bernd Panzer-Steindel, CERN, IT3 Current relationship of the Fabric to other projects openlab -- 10 Gbit networking -- new CPU technology -- possibly, new storage technology openlab -- 10 Gbit networking -- new CPU technology -- possibly, new storage technology EDG, WP4 -- Installation -- Configuration -- Monitoring -- Fault tolerance EDG, WP4 -- Installation -- Configuration -- Monitoring -- Fault tolerance GRID Technology and deployment -- Common fabric infrastructure -- Fabric  GRID interdependencies GRID Technology and deployment -- Common fabric infrastructure -- Fabric  GRID interdependencies GDB working groups -- Site coordination -- Common fabric issues GDB working groups -- Site coordination -- Common fabric issues LCG -- Hardware resources -- Manpower resources LCG -- Hardware resources -- Manpower resources External network -- Firewall performance External network -- Firewall performance Collaboration with India -- Monitoring -- Quality of Service Collaboration with India -- Monitoring -- Quality of Service SERCO --Sysadmin outsourcing SERCO --Sysadmin outsourcing CERN IT Main Fabric provider CERN IT Main Fabric provider

4 6/2/2015Bernd Panzer-Steindel, CERN, IT4 Preparations for the LCG-1 service Two parallel coupled approaches : 1. Use the prototype to install pilot LCG-1 production services with the corresponding tools and configurations of the different middleware packages (EDG, VDT, etc.) 2. ‘Attach’ the Lxbatch production worker nodes carefully in a none intrusive way to the GRID services  service nodes and worker nodes, the focus is here on the worker nodes increasing size from Pilot 1 (50 nodes, 10 TB) to the service in July (200 nodes, 20 TB)

5 6/2/2015Bernd Panzer-Steindel, CERN, IT5 Fabric Milestones for the LCG-1 service Production Pilot I starts 15.01.2003 Production Pilot 2 starts 17.04.2003 LCG-1 initial service 01.07.2003 7 days acceptance test 04.08.2003 Lxbatch job scheduler pilot 03.02.2003 Lxbatch replica manager pilot 01.09.2003 Lxbatch merges into LCG-1 17.10.2003 30 days acceptance test 28.10.2003 Fully operational LCG-1 Service & distributed production environment 24.11.2003

6 6/2/2015Bernd Panzer-Steindel, CERN, IT6 –Pilot-1 service – February 1, 2003. 50 machines (CE), 10 TB (SE). Runs middleware currently on LCG testbeds. Initial testbed at CERN. –Add 1 remote site by February 28, 2003. –Pilot-2 service – March 15, 2003. 100 machines (CE), 10 TB (SE). CERN service will run full prototype of WP4 installation and configuration system. –Add 1 US site to pilot – March 30, 2003 –Add 1 Asian site to pilot – April 15, 2003 –Add 2-3 more EU and US sites – April – May, 2003 –Service includes 6-7 sites – June 1, 2003 –LCG-1 initial production system – July 2003. 200 machines (CE), 20 TB (SE). Uses full WP4 system with fully integrated fabric infrastructure. Global service has 6-7 sites in 3 continents. Fabrics project plan : http://lcg.web.cern.ch/LCG/PEB/Planning/PBS/LCG.mpp Integration of the milestones with the GD area

7 6/2/2015Bernd Panzer-Steindel, CERN, IT7 Benchmark and performance cluster (current architecture and hardware) Benchmark and performance cluster (current architecture and hardware) PASTA investigation R&D activities (background)  iSCSI, SAN, Infiniband  Cluster technologies R&D activities (background)  iSCSI, SAN, Infiniband  Cluster technologies Data Challenges Experiment specific IT base figures Data Challenges Experiment specific IT base figures Architecture validation Benchmark and analysis framework Components LINUX, CASTOR, AFS, LSF, EIDE disk servers, Ethernet, etc. Components LINUX, CASTOR, AFS, LSF, EIDE disk servers, Ethernet, etc. Computing model of the Experiments Criteria : Reliability Performance Functionality Criteria : Reliability Performance Functionality Status and plans, Fabric area : Architecture (I)

8 6/2/2015Bernd Panzer-Steindel, CERN, IT8 Regular checkpoints for the architecture verification  Computing data challenges (IT, ALICE-mass storage) Physics data challenges (no really I/O stress yet -- analysis)  Collecting the stability and performance measurements of the commodity hardware in the different fabric areas  Verifying interdependencies and limits  Definition of Quality of Services Status and plans, Fabric area : Architecture (II) Regular (mid 2003,2004,2005) reports on the status of the Architecture TDR report finished by mid 2005

9 6/2/2015Bernd Panzer-Steindel, CERN, IT9 Vault conversion complete, migration of equipment from the centre has started Plans for the upgrade to 2.5 MW cooling and electricity supply are progressing well http://ref.cern.ch/CERN/IT/C5/2002/038/topic.html Status and plans, Fabric area : Infrastructure https://web11.cern.ch/it-support-mrp/B513 Upgrade/ http://lcg.web.cern.ch/LCG/C-RRB/2002-05/RRB2_Report1610.doc Worries :  Financing of this exercise  CPU power consumption development Performance per Watt is improving very little

10 6/2/2015Bernd Panzer-Steindel, CERN, IT10 Status and plans, Fabric area : Operation, Control EDG WP4 Time schedule for delivery of installation, configuration, fault tolerance and monitoring aligned to the milestones of the LCG-1 service. Integration of new tools into Lxbatch service has started Successful introduction of a new Linux certification team (all experiments + IT)  just released RH 7.3.1  Important also for the site coordination ( GDB WG4) Linux team increases next year from 3 to 4 (later 5) FTE Outsourcing contract (SERCO) for system administration ends in Dec 2003. Will be replaced by insourcing. ~ 10 technical engineers in the next years

11 6/2/2015Bernd Panzer-Steindel, CERN, IT11 Status and plans, Fabric area : Networking 10 Gbit equipment tests until mid 2003 integration into the prototype mid 2003 part integration into the backbone mid 2004 full 10 Gbit backbone mid 2005 Network in the computer center : 3COM and Enterasys equipment, 14 routers, 147 switches (Fast Ethernet and Gigabit) Stability : 29 interventions in 6 month, (resets, hardware failure, software bugs,etc.) Traffic : constant load of ~400 MB/s aggregate, no overload  ~ 10 % load

12 6/2/2015Bernd Panzer-Steindel, CERN, IT12 Status and plans, Fabric area : Batch system Node Stability : 7 reboots per day + 0.7 Hardware interventions per day (mostly IBM disk problems) With ~700 nodes running batch jobs at ~ 65% cpu utilization, last 6 month General survey of batch systems during 2004 Based on the recommendations of the survey a possible installation of a new batch is scheduled for 2005 Successful introduction of share queues in LSF  optimization of general throughput Continuous work on Quality of Service (user interference, problem disentanglement) Statistics and monitoring http://it-div-fio-is.web.cern.ch/it-div-fio-is/Reports/Weekly_lsf_stats(all%20groups).xls http://it-div-fio-is.web.cern.ch/it-div-fio-is/Reports/Weekly_lsf_stats(all%20groups).xls

13 6/2/2015Bernd Panzer-Steindel, CERN, IT13 Status and plans, Fabric area : Storage (I) Castor HSM System : 8 million files, 1.8 PB of data today 20 new tape drives (9940B) arrived and in are heavy usage right now  IT Computing DCs and ALICE DC Hardware stability : New disk server generation doubles the performance solves the tape server – disk server ‘impedance’ matching problem (disk I/O should be much faster than tape I/O) ~ one intervention per week on one tape drive (STK 9940A) ~ one tape with recoverable problems per 2 weeks( to be send to STK HQ) ~ one disk server reboot per week (out of ~200 disk servers in production) ~one disk error per week (out of ~3000 disks in production)

14 6/2/2015Bernd Panzer-Steindel, CERN, IT14 Details of the storage access methods need to be defined and implemented until March 2003 (Application I/O, transport mechanism, CASTOR Interfaces, replica management middleware,etc.) A survey of common storage solutions will start in July 2003 Recommendation will be reported in July 2004 Tests and prototype installation are planned from July 2004 to June 2005 Deployment of the storage solution for LHC will start in July 2005 Status and plans, Fabric area : Storage (II) CASTOR activities are focused on consolidation Stager rewrite Improved error recovery and redundancy Stability  IT and ALICE DCs very useful

15 6/2/2015Bernd Panzer-Steindel, CERN, IT15 Status and plans, Fabric area : Resources http://doc.cern.ch/AGE/current/askArchive.php?a02155/a02155s1t3/transparencies/slides.ppt Common planning for the 2003 resources (CPU, disk) combining PEB (Physics Data Challenges), LCG Prototype (Computing Data Challenges and general resources (COCOTIME) established. Very flexible policy to ‘move’ resources between the different areas, to achieve the highest possible resource optimization IT physics base budget for CPU and disk resources 1.75 million SFr in 2003 Advancement of 2004 purchases for the prototype are needed Non-trivial exercise with continuous adaptation necessary CERN purchasing procedures don’t make it easier

16 6/2/2015Bernd Panzer-Steindel, CERN, IT16 Dual P4 node == 1300 SI2000 == 3000 SFr == 2.3 SFr/SI2000

17 6/2/2015Bernd Panzer-Steindel, CERN, IT17 Fabric area LCG(Q402) LCG(Q103) EDG IT System Management and 2.5 2.0 - 14.3 Operation Development 4.5 3.0 3.0 5.3 (management automation) Data Storage Management 2.0 2.0 - 10.1 Grid Security 1.0 2.0 - 1.0 Grid-Fabric Interface - 1.0 1.0 0.8 Personnel in the Fabrics area (I) Focus of the IT personnel is on service

18 6/2/2015Bernd Panzer-Steindel, CERN, IT18 LCG personnel, more details : 2 Staff 2 Fellows 6 Unpaid Associates ( 5 Coorporant/Students) (PPARC, IN2P3, Spain, Israel) Personnel in the Fabrics area (II)  System Management and Operation : 2.5 Unpaid Associates  System administration for the various EDG testbeds (system installation, middleware installation, user support, feedback to the developers,etc.)  Design and implementation of an I/O benchmarking framework, etailed disk server benchmarks as preparations for the Data Challenges

19 6/2/2015Bernd Panzer-Steindel, CERN, IT19 Personnel in the Fabrics area (III)  Data Storage Management : 1 Fellow + 1 UPAS  Design and implementation of specific CASTOR monitoring sensors  Interfacing CASTOR to various transfer protocols (stfp, GridFTP)  Maintenance and support for the modified GridFTP servers and clients  Development (management automation) : 2 Staff + 1 Fellow + 1.5 UPAS  Pilots and preparation for large production system of automated remote (secure) access to node console and remote reset (basic cluster infrastructure)  Evaluation and pilot for ‘diskless’ cluster setup (fast-installation, configuration simplification)  Prototype of a hardware workflow tracking system (preparations for the handling of large numbers of hardware components)  Evaluation and implementation of database solutions for the monitoring storage  various contributions to installation and monitoring tools  Grid Security : 1 UPAS  Replacement of the old CA with an improved version based on a redesigned infrastructure, documentation, new functionality

20 6/2/2015Bernd Panzer-Steindel, CERN, IT20 Conclusions Architecture verification okay so far Stability and performance of commodity equipment is good Major ‘stress’ (I/O) on the systems is coming from Computing DCs and currently running experiments, not the LHC physics productions Worries : -- Computer centre infrastructure (finance and power) -- Analysis model and facility -- Quality of Service measurements -- Constraints imposed by the middleware Remark : Things are driven by the market, not the pure technology  possible paradigm changes


Download ppt "6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans."

Similar presentations


Ads by Google