Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lessons learned administering a larger setup for LHCb

Similar presentations


Presentation on theme: "Lessons learned administering a larger setup for LHCb"— Presentation transcript:

1 Lessons learned administering a larger setup for LHCb
Dirac User Workshop Joel Closier 23 may 2016

2 Dirac User Workshop – Joel Closier – 23 may 2016
LHCbDirac in numbers Sites 1 T0 8 T1 (IN2P3, RAL, RRCKI, CNAF, GRDIKA,PIC, SARA, CERN) 66 LCG, 7 DIRAC, 6 VAC, 14 CLOUD, 3 BOINC 620 Users registered in VOMS 100 M Pilots run until 1st January 2010 120 M jobs run until 1st January 2010 71% Simulation 35% User 7% Stripping 3% Merge 3% Swimming 2% Reconstruction 1.5% Reprocessing Dirac User Workshop – Joel Closier – 23 may 2016

3 Dirac User Workshop – Joel Closier – 23 may 2016
Storage in LHCb More than 130 SE (25 PB) Most common operations 48.5 PB Replicate and register 44 PB putAndRegister 32 PB stage 28 PB RemovePhysicalReplica Dirac User Workshop – Joel Closier – 23 may 2016

4 LHCbDirac services : Evolution
2010 – 2014 : Use of physical machines (up to 27) Between 16 and 32 CPUs each Managed by Quattor 2016 – today Use of virtual machines 51 instances (3 for tests/certifications, 3 for jenkins tests) 150 VCPUs 305 GB RAM 10 CEPH volumes Managed by Puppet 4 templates (standard, webportal, rabbitmq and lhcbci) Same software installed Same rules for firewall and iptables Same local user Dirac User Workshop – Joel Closier – 23 may 2016

5 First configuration of VM for Dirac services
39 Virtual Machines (puppet managed) 2 (8 CPUs 16 GB memory) 11 (4 CPUs 8 GB memory) 26 (2 CPUs 4Gb memory) 7 CEPH volumes (2.9 TB) BOINC Sandboxes Monitoring Transformation Log Swap Failover /opt/dirac on the ephemeral disk of each VM Dirac User Workshop – Joel Closier – 23 may 2016

6 Evaluation of the first configuration
Many machines to manage Small VM With high load => processes killed Small swap I/O not efficient with ephemeral disk Update of DIRAC software painfull No way to do it locally Through Web Portal inefficient Through CLI too long to do it in one single thread Dirac User Workshop – Joel Closier – 23 may 2016

7 Second iteration for the configuration
Bigger VM 16 CPUs 32 GB RAM /opt/dirac on CEPH volume Management much easier Installation of DIRAC faster Dirac User Workshop – Joel Closier – 23 may 2016

8 Dirac User Workshop – Joel Closier – 23 may 2016
Why this evolution ? Some services needs to have several instances BookkeepingManager JobStateUpdate ResourceStatus Optimizers TransformationManager Some services needs load balancing Configuration server (hammer by all the pilots..) (to be tested) Some services are busy for a given period Better usage of the VM with big machine Dirac User Workshop – Joel Closier – 23 may 2016

9 Dirac User Workshop – Joel Closier – 23 may 2016
VOBOXes - Monitoring The main entry point is : Activity Monitor Dashboard System Administration Dirac User Workshop – Joel Closier – 23 may 2016 9

10 Monitoring of the machine
Dirac User Workshop – Joel Closier – 23 may 2016

11 Dirac User Workshop – Joel Closier – 23 may 2016
VOBOXes - Alarms Alarm so far defined Filesystem full /opt full Swap space High load Each alarm open a ticket Dirac User Workshop – Joel Closier – 23 may 2016 11

12 Dirac User Workshop – Joel Closier – 23 may 2016
LHCbDirac Setups 2 Setups (previously 3) Production Certification All new version of dirac can be tested with the Certification setup except few of them because of Configuration Server Testing of this setup is associated with Jenkins to automatize most of the steps : Consistency of code Installation Production jobs Dirac User Workshop – Joel Closier – 23 may 2016

13 Dirac User Workshop – Joel Closier – 23 may 2016
Databases 2 types of databases used in Production ORACLE (Bookkeeping) DBOD (DataBase On Demand) Lbacc Lbprod Lbwms Lbwmsacc dfc 2 types of databases used in Certification DBOD (DataBase On Demand) : lbcertif, lbprdev Dirac User Workshop – Joel Closier – 23 may 2016 13

14 Dirac User Workshop – Joel Closier – 23 may 2016
Services outside CERN Most of the services are located at CERN and are duplicated on several instances at CERN 6 machines outside CERN, located in the T1 sites used by LHCb RAL GRIDKA IN2P3 CNAF PIC SARA Machine used for duplication of services Configuration Server (slave instance) ReqProxy Dirac User Workshop – Joel Closier – 23 may 2016

15 Dirac User Workshop – Joel Closier – 23 may 2016
Main issues With such a configuration and with the tools that we have in place Difficulties to spot services/agents/optimizers which are stuck Installation of new version of Dirac delicate Recover VM down not trivial (no live migration) Web interface for Sysadministration needs improvement Dirac User Workshop – Joel Closier – 23 may 2016

16 Dirac User Workshop – Joel Closier – 23 may 2016
Conclusions Web portal usefull with its system administration console to manage large set of machine but missing functionnalities Dirac update not very friendly Extension version number for the VO not displayed Lot of clicks to get meaningfull error Duplication of service help a lot the load of the machine Single point of failure ?? Dirac User Workshop – Joel Closier – 23 may 2016


Download ppt "Lessons learned administering a larger setup for LHCb"

Similar presentations


Ads by Google