6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans.

Slides:



Advertisements
Similar presentations
Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.
Advertisements

CASTOR Project Status CASTOR Project Status CERNIT-PDP/DM February 2000.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
EU-GRID Work Program Massimo Sgaravatto – INFN Padova Cristina Vistoli – INFN Cnaf as INFN members of the EU-GRID technical team.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
18. November 2003Bernd Panzer, CERN/IT1 LCG Internal Review Computing Fabric Overview.
12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status
October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Planning the LCG Fabric at CERN openlab TCO Workshop November 11 th 2003 CERN.ch.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
DataGrid Applications Federico Carminati WP6 WorkShop December 11, 2000.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.
Overall Goal of the Project  Develop full functionality of CMS Tier-2 centers  Embed the Tier-2 centers in the LHC-GRID  Provide well documented and.
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Storage and Storage Access 1 Rainer Többicke CERN/IT.
LCG EGEE is a project funded by the European Union under contract IST LCG PEB, 7 th June 2004 Prototype Middleware Status Update Frédéric Hemmer.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
CERN IT Department CH-1211 Genève 23 Switzerland Introduction to CERN Computing Services Bernd Panzer-Steindel, CERN/IT.
S.Jarp CERN openlab CERN openlab Total Cost of Ownership 11 November 2003 Sverre Jarp.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
DataTAG Work Package 4 Meeting Bologna Simone Ludwig Brunel University 23rd and 24th of May 2002.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
24. November 2003Bernd Panzer-Steindel, CERN/IT1 LCG LHCC Review Computing Fabric Overview and Status.
23.March 2004Bernd Panzer-Steindel, CERN/IT1 LCG Workshop Computing Fabric.
Chapter 12 The Network Development Life Cycle
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
LCG CERN David Foster LCG WP4 Meeting 20 th June 2002 LCG Project Status WP4 Meeting Presentation David Foster IT/LCG 20 June 2002.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
CERN LCG Deployment Overview Ian Bird CERN IT/GD LCG Internal Review November 2003.
DataGrid is a project funded by the European Commission under contract IST rd EU Review – 19-20/02/2004 The EU DataGrid Project Three years.
CERN Campus Network Infrastructure Specificities Jean-Michel Jouanigot Campus Network Leader CERN EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH EUROPEAN LABORATORY.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
26. Juni 2003Bernd Panzer-Steindel, CERN/IT1 LHC Computing re-costing for for the CERN T0/T1 center.
GDB Meeting 12. January Bernd Panzer-Steindel, CERN/IT 1 Mass Storage at CERN GDB meeting, 12. January 2005.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Bob Jones EGEE Technical Director
U.S. ATLAS Grid Production Experience
NL Service Challenge Plans
Ian Bird GDB Meeting CERN 9 September 2003
Database Services at CERN Status Update
Grid Deployment Area Status Report
Grid related projects CERN openlab LCG EDG F.Fluckiger
LHC Computing re-costing for
Ákos Frohner EGEE'08 September 2008
Bernd Panzer-Steindel CERN/IT
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
Wide Area Workload Management Work Package DATAGRID project
The LHCb Computing Data Challenge DC06
Presentation transcript:

6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans

6/2/2015Bernd Panzer-Steindel, CERN, IT2 View of different Fabric areas Infrastructure Electricity, Cooling, Space Infrastructure Electricity, Cooling, Space Network Batch system (LSF, CPU server) Batch system (LSF, CPU server) Storage system (AFS, CASTOR, disk server) Storage system (AFS, CASTOR, disk server) Purchase, Hardware selection, Resource planning Purchase, Hardware selection, Resource planning Installation Configuration + monitoring Fault tolerance Installation Configuration + monitoring Fault tolerance Prototype, Testbeds Benchmarks, R&D, Architecture Benchmarks, R&D, Architecture Automation, Operation, Control Coupling of components through hardware and software GRID services !?

6/2/2015Bernd Panzer-Steindel, CERN, IT3 Current relationship of the Fabric to other projects openlab Gbit networking -- new CPU technology -- possibly, new storage technology openlab Gbit networking -- new CPU technology -- possibly, new storage technology EDG, WP4 -- Installation -- Configuration -- Monitoring -- Fault tolerance EDG, WP4 -- Installation -- Configuration -- Monitoring -- Fault tolerance GRID Technology and deployment -- Common fabric infrastructure -- Fabric  GRID interdependencies GRID Technology and deployment -- Common fabric infrastructure -- Fabric  GRID interdependencies GDB working groups -- Site coordination -- Common fabric issues GDB working groups -- Site coordination -- Common fabric issues LCG -- Hardware resources -- Manpower resources LCG -- Hardware resources -- Manpower resources External network -- Firewall performance External network -- Firewall performance Collaboration with India -- Monitoring -- Quality of Service Collaboration with India -- Monitoring -- Quality of Service SERCO --Sysadmin outsourcing SERCO --Sysadmin outsourcing CERN IT Main Fabric provider CERN IT Main Fabric provider

6/2/2015Bernd Panzer-Steindel, CERN, IT4 Preparations for the LCG-1 service Two parallel coupled approaches : 1. Use the prototype to install pilot LCG-1 production services with the corresponding tools and configurations of the different middleware packages (EDG, VDT, etc.) 2. ‘Attach’ the Lxbatch production worker nodes carefully in a none intrusive way to the GRID services  service nodes and worker nodes, the focus is here on the worker nodes increasing size from Pilot 1 (50 nodes, 10 TB) to the service in July (200 nodes, 20 TB)

6/2/2015Bernd Panzer-Steindel, CERN, IT5 Fabric Milestones for the LCG-1 service Production Pilot I starts Production Pilot 2 starts LCG-1 initial service days acceptance test Lxbatch job scheduler pilot Lxbatch replica manager pilot Lxbatch merges into LCG days acceptance test Fully operational LCG-1 Service & distributed production environment

6/2/2015Bernd Panzer-Steindel, CERN, IT6 –Pilot-1 service – February 1, machines (CE), 10 TB (SE). Runs middleware currently on LCG testbeds. Initial testbed at CERN. –Add 1 remote site by February 28, –Pilot-2 service – March 15, machines (CE), 10 TB (SE). CERN service will run full prototype of WP4 installation and configuration system. –Add 1 US site to pilot – March 30, 2003 –Add 1 Asian site to pilot – April 15, 2003 –Add 2-3 more EU and US sites – April – May, 2003 –Service includes 6-7 sites – June 1, 2003 –LCG-1 initial production system – July machines (CE), 20 TB (SE). Uses full WP4 system with fully integrated fabric infrastructure. Global service has 6-7 sites in 3 continents. Fabrics project plan : Integration of the milestones with the GD area

6/2/2015Bernd Panzer-Steindel, CERN, IT7 Benchmark and performance cluster (current architecture and hardware) Benchmark and performance cluster (current architecture and hardware) PASTA investigation R&D activities (background)  iSCSI, SAN, Infiniband  Cluster technologies R&D activities (background)  iSCSI, SAN, Infiniband  Cluster technologies Data Challenges Experiment specific IT base figures Data Challenges Experiment specific IT base figures Architecture validation Benchmark and analysis framework Components LINUX, CASTOR, AFS, LSF, EIDE disk servers, Ethernet, etc. Components LINUX, CASTOR, AFS, LSF, EIDE disk servers, Ethernet, etc. Computing model of the Experiments Criteria : Reliability Performance Functionality Criteria : Reliability Performance Functionality Status and plans, Fabric area : Architecture (I)

6/2/2015Bernd Panzer-Steindel, CERN, IT8 Regular checkpoints for the architecture verification  Computing data challenges (IT, ALICE-mass storage) Physics data challenges (no really I/O stress yet -- analysis)  Collecting the stability and performance measurements of the commodity hardware in the different fabric areas  Verifying interdependencies and limits  Definition of Quality of Services Status and plans, Fabric area : Architecture (II) Regular (mid 2003,2004,2005) reports on the status of the Architecture TDR report finished by mid 2005

6/2/2015Bernd Panzer-Steindel, CERN, IT9 Vault conversion complete, migration of equipment from the centre has started Plans for the upgrade to 2.5 MW cooling and electricity supply are progressing well Status and plans, Fabric area : Infrastructure Upgrade/ Worries :  Financing of this exercise  CPU power consumption development Performance per Watt is improving very little

6/2/2015Bernd Panzer-Steindel, CERN, IT10 Status and plans, Fabric area : Operation, Control EDG WP4 Time schedule for delivery of installation, configuration, fault tolerance and monitoring aligned to the milestones of the LCG-1 service. Integration of new tools into Lxbatch service has started Successful introduction of a new Linux certification team (all experiments + IT)  just released RH  Important also for the site coordination ( GDB WG4) Linux team increases next year from 3 to 4 (later 5) FTE Outsourcing contract (SERCO) for system administration ends in Dec Will be replaced by insourcing. ~ 10 technical engineers in the next years

6/2/2015Bernd Panzer-Steindel, CERN, IT11 Status and plans, Fabric area : Networking 10 Gbit equipment tests until mid 2003 integration into the prototype mid 2003 part integration into the backbone mid 2004 full 10 Gbit backbone mid 2005 Network in the computer center : 3COM and Enterasys equipment, 14 routers, 147 switches (Fast Ethernet and Gigabit) Stability : 29 interventions in 6 month, (resets, hardware failure, software bugs,etc.) Traffic : constant load of ~400 MB/s aggregate, no overload  ~ 10 % load

6/2/2015Bernd Panzer-Steindel, CERN, IT12 Status and plans, Fabric area : Batch system Node Stability : 7 reboots per day Hardware interventions per day (mostly IBM disk problems) With ~700 nodes running batch jobs at ~ 65% cpu utilization, last 6 month General survey of batch systems during 2004 Based on the recommendations of the survey a possible installation of a new batch is scheduled for 2005 Successful introduction of share queues in LSF  optimization of general throughput Continuous work on Quality of Service (user interference, problem disentanglement) Statistics and monitoring

6/2/2015Bernd Panzer-Steindel, CERN, IT13 Status and plans, Fabric area : Storage (I) Castor HSM System : 8 million files, 1.8 PB of data today 20 new tape drives (9940B) arrived and in are heavy usage right now  IT Computing DCs and ALICE DC Hardware stability : New disk server generation doubles the performance solves the tape server – disk server ‘impedance’ matching problem (disk I/O should be much faster than tape I/O) ~ one intervention per week on one tape drive (STK 9940A) ~ one tape with recoverable problems per 2 weeks( to be send to STK HQ) ~ one disk server reboot per week (out of ~200 disk servers in production) ~one disk error per week (out of ~3000 disks in production)

6/2/2015Bernd Panzer-Steindel, CERN, IT14 Details of the storage access methods need to be defined and implemented until March 2003 (Application I/O, transport mechanism, CASTOR Interfaces, replica management middleware,etc.) A survey of common storage solutions will start in July 2003 Recommendation will be reported in July 2004 Tests and prototype installation are planned from July 2004 to June 2005 Deployment of the storage solution for LHC will start in July 2005 Status and plans, Fabric area : Storage (II) CASTOR activities are focused on consolidation Stager rewrite Improved error recovery and redundancy Stability  IT and ALICE DCs very useful

6/2/2015Bernd Panzer-Steindel, CERN, IT15 Status and plans, Fabric area : Resources Common planning for the 2003 resources (CPU, disk) combining PEB (Physics Data Challenges), LCG Prototype (Computing Data Challenges and general resources (COCOTIME) established. Very flexible policy to ‘move’ resources between the different areas, to achieve the highest possible resource optimization IT physics base budget for CPU and disk resources 1.75 million SFr in 2003 Advancement of 2004 purchases for the prototype are needed Non-trivial exercise with continuous adaptation necessary CERN purchasing procedures don’t make it easier

6/2/2015Bernd Panzer-Steindel, CERN, IT16 Dual P4 node == 1300 SI2000 == 3000 SFr == 2.3 SFr/SI2000

6/2/2015Bernd Panzer-Steindel, CERN, IT17 Fabric area LCG(Q402) LCG(Q103) EDG IT System Management and Operation Development (management automation) Data Storage Management Grid Security Grid-Fabric Interface Personnel in the Fabrics area (I) Focus of the IT personnel is on service

6/2/2015Bernd Panzer-Steindel, CERN, IT18 LCG personnel, more details : 2 Staff 2 Fellows 6 Unpaid Associates ( 5 Coorporant/Students) (PPARC, IN2P3, Spain, Israel) Personnel in the Fabrics area (II)  System Management and Operation : 2.5 Unpaid Associates  System administration for the various EDG testbeds (system installation, middleware installation, user support, feedback to the developers,etc.)  Design and implementation of an I/O benchmarking framework, etailed disk server benchmarks as preparations for the Data Challenges

6/2/2015Bernd Panzer-Steindel, CERN, IT19 Personnel in the Fabrics area (III)  Data Storage Management : 1 Fellow + 1 UPAS  Design and implementation of specific CASTOR monitoring sensors  Interfacing CASTOR to various transfer protocols (stfp, GridFTP)  Maintenance and support for the modified GridFTP servers and clients  Development (management automation) : 2 Staff + 1 Fellow UPAS  Pilots and preparation for large production system of automated remote (secure) access to node console and remote reset (basic cluster infrastructure)  Evaluation and pilot for ‘diskless’ cluster setup (fast-installation, configuration simplification)  Prototype of a hardware workflow tracking system (preparations for the handling of large numbers of hardware components)  Evaluation and implementation of database solutions for the monitoring storage  various contributions to installation and monitoring tools  Grid Security : 1 UPAS  Replacement of the old CA with an improved version based on a redesigned infrastructure, documentation, new functionality

6/2/2015Bernd Panzer-Steindel, CERN, IT20 Conclusions Architecture verification okay so far Stability and performance of commodity equipment is good Major ‘stress’ (I/O) on the systems is coming from Computing DCs and currently running experiments, not the LHC physics productions Worries : -- Computer centre infrastructure (finance and power) -- Analysis model and facility -- Quality of Service measurements -- Constraints imposed by the middleware Remark : Things are driven by the market, not the pure technology  possible paradigm changes