Lessons learned administering a larger setup for LHCb

Slides:



Advertisements
Similar presentations
National Grid's Contribution to LHCb IFIN-HH Serban Constantinescu, Ciubancan Mihai, Teodor Ivanoaica.
Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN-IT Plans on Virtualization Ian Bird On behalf of IT WLCG Workshop, 9 th July 2010.
Module 13: Configuring Availability of Network Resources and Content.
Chapter 4 COB 204. What do you need to know about hardware? 
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Adam Duffy Edina Public Schools.  Traditional server ◦ One physical server ◦ One OS ◦ All installed hardware is limited to that one server ◦ If hardware.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
LFC Replication Tests LCG 3D Workshop Barbara Martelli.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
1 Stevie Award Nomination – “Innovation in Customer Service” Informatica.
The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Arne Wiebalck -- VM Performance: I/O
T1 status Input for LHCb- NCB 9 th November 2009.
Multi-Tier Apps with Admin Access, RDP, Custom Installs Modern Scalable Web Sites Full Windows Server/Linux VMs Web Sites Virtual Machines Cloud Services.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
1Maria Dimou- cern-it-gd LCG November 2007 GDB October 2007 VOM(R)S Workshop report Grid Deployment Board.
1 Cherenkov Telescope Array: a production system prototype L. Arrabito 1 C. Barbier 2, J. Bregeon 1, A. Haupt 3, N. Neyroud 2 for the CTA Consortium 1.
LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.
LHCb status and plans Ph.Charpentier CERN. LHCb status and plans WLCG Workshop 1-2 Sept 2007, Victoria, BC 2 Ph.C. Status of DC06  Reminder:  Two-fold.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
AMGA-Bookkeeping Carmine Cioffi Department of Physics, Oxford University UK Metadata Workshop Oxford, 05 July 2006.
Jiri Chudoba for the Pierre Auger Collaboration Institute of Physics of the CAS and CESNET.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.
Commvault and Nutanix October Changing IT landscape Today’s Challenges Datacenter Complexity Building for Scale Managing disparate solutions.
Servizi core INFN Grid presso il CNAF: setup attuale
Service Availability Monitoring
Jean-Philippe Baud, IT-GD, CERN November 2007
Run Azure Services in your datacenter
The CLoud Infrastructure for Microbial Bioinformatics
AWS Integration in Distributed Computing
LCG Service Challenge: Planning and Milestones
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Database Services at CERN Status Update
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Simulation use cases for T2 in ALICE
WLCG Service Report 5th – 18th July
Discussions on group meeting
Publishing ALICE data & CVMFS infrastructure monitoring
LHCb Conditions Database TEG Workshop 7 November 2011 Marco Clemencic
WLCG Collaboration Workshop;
VMDIRAC status Vanessa HAMAR CC-IN2P3.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Microsoft Virtual Academy
Managing Services with VMM and App Controller
LHCb status and plans Ph.Charpentier CERN.
Microsoft Virtual Academy
Cloud Computing Architecture
Status and plans for bookkeeping system and production tools
Production Manager Tools (New Architecture)
The LHCb Computing Data Challenge DC06
Presentation transcript:

Lessons learned administering a larger setup for LHCb Dirac User Workshop Joel Closier 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 LHCbDirac in numbers Sites 1 T0 8 T1 (IN2P3, RAL, RRCKI, CNAF, GRDIKA,PIC, SARA, CERN) 66 LCG, 7 DIRAC, 6 VAC, 14 CLOUD, 3 BOINC 620 Users registered in VOMS 100 M Pilots run until 1st January 2010 120 M jobs run until 1st January 2010 71% Simulation 35% User 7% Stripping 3% Merge 3% Swimming 2% Reconstruction 1.5% Reprocessing Dirac User Workshop – Joel Closier – 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 Storage in LHCb More than 130 SE (25 PB) Most common operations 48.5 PB Replicate and register 44 PB putAndRegister 32 PB stage 28 PB RemovePhysicalReplica Dirac User Workshop – Joel Closier – 23 may 2016

LHCbDirac services : Evolution 2010 – 2014 : Use of physical machines (up to 27) Between 16 and 32 CPUs each Managed by Quattor 2016 – today Use of virtual machines 51 instances (3 for tests/certifications, 3 for jenkins tests) 150 VCPUs 305 GB RAM 10 CEPH volumes Managed by Puppet 4 templates (standard, webportal, rabbitmq and lhcbci) Same software installed Same rules for firewall and iptables Same local user Dirac User Workshop – Joel Closier – 23 may 2016

First configuration of VM for Dirac services 39 Virtual Machines (puppet managed) 2 (8 CPUs 16 GB memory) 11 (4 CPUs 8 GB memory) 26 (2 CPUs 4Gb memory) 7 CEPH volumes (2.9 TB) BOINC Sandboxes Monitoring Transformation Log Swap Failover /opt/dirac on the ephemeral disk of each VM Dirac User Workshop – Joel Closier – 23 may 2016

Evaluation of the first configuration Many machines to manage Small VM With high load => processes killed Small swap I/O not efficient with ephemeral disk Update of DIRAC software painfull No way to do it locally Through Web Portal inefficient Through CLI too long to do it in one single thread Dirac User Workshop – Joel Closier – 23 may 2016

Second iteration for the configuration Bigger VM 16 CPUs 32 GB RAM /opt/dirac on CEPH volume Management much easier Installation of DIRAC faster Dirac User Workshop – Joel Closier – 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 Why this evolution ? Some services needs to have several instances BookkeepingManager JobStateUpdate ResourceStatus Optimizers TransformationManager Some services needs load balancing Configuration server (hammer by all the pilots..) (to be tested) Some services are busy for a given period Better usage of the VM with big machine Dirac User Workshop – Joel Closier – 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 VOBOXes - Monitoring The main entry point is : https://lhcb-portal-dirac.cern.ch/DIRAC/ Activity Monitor Dashboard System Administration Dirac User Workshop – Joel Closier – 23 may 2016 9

Monitoring of the machine Dirac User Workshop – Joel Closier – 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 VOBOXes - Alarms Alarm so far defined Filesystem full /opt full Swap space High load Each alarm open a ticket Dirac User Workshop – Joel Closier – 23 may 2016 11

Dirac User Workshop – Joel Closier – 23 may 2016 LHCbDirac Setups 2 Setups (previously 3) Production Certification All new version of dirac can be tested with the Certification setup except few of them because of Configuration Server Testing of this setup is associated with Jenkins to automatize most of the steps : Consistency of code Installation Production jobs Dirac User Workshop – Joel Closier – 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 Databases 2 types of databases used in Production ORACLE (Bookkeeping) DBOD (DataBase On Demand) Lbacc Lbprod Lbwms Lbwmsacc dfc 2 types of databases used in Certification DBOD (DataBase On Demand) : lbcertif, lbprdev Dirac User Workshop – Joel Closier – 23 may 2016 13

Dirac User Workshop – Joel Closier – 23 may 2016 Services outside CERN Most of the services are located at CERN and are duplicated on several instances at CERN 6 machines outside CERN, located in the T1 sites used by LHCb RAL GRIDKA IN2P3 CNAF PIC SARA Machine used for duplication of services Configuration Server (slave instance) ReqProxy Dirac User Workshop – Joel Closier – 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 Main issues With such a configuration and with the tools that we have in place Difficulties to spot services/agents/optimizers which are stuck Installation of new version of Dirac delicate Recover VM down not trivial (no live migration) Web interface for Sysadministration needs improvement Dirac User Workshop – Joel Closier – 23 may 2016

Dirac User Workshop – Joel Closier – 23 may 2016 Conclusions Web portal usefull with its system administration console to manage large set of machine but missing functionnalities Dirac update not very friendly Extension version number for the VO not displayed Lot of clicks to get meaningfull error Duplication of service help a lot the load of the machine Single point of failure ?? Dirac User Workshop – Joel Closier – 23 may 2016