Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.

Slides:

Advertisements

Similar presentations

University of St Andrews School of Computer Science Experiences with a Private Cloud St Andrews Cloud Computing co-laboratory James W. Smith Ali Khajeh-Hosseini.

Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Startup Technology Pitfalls and How to Avoid them.

Batch Production and Monte Carlo + CDB work status Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

CMS Diverse use of clouds David Colling

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

CMS Experience Provisioning Cloud Resources with GlideinWMS Anthony Tiradani HTCondor Week May 2015.

DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

Raffaele Di Fazio Connecting to the Clouds Cloud Brokers and OCCI.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France

Block1 Wrapping Your Nugget Around Distributed Processing.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.

Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.

1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.

ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.

Project Name Program Name Project Scope Title Project Code and Name Insert Project Branding Image Here.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.

Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen Budapest

Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?

T3g software services Outline of the T3g Components R. Yoshida (ANL)

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1 how to profit of the ATLAS HLT farm during the LS1 & after Sergio Ballestrero.

The GridPP DIRAC project DIRAC for non-LHC communities.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Maria Girone, CERN  CMS in a High-Latency Environment  CMSSW I/O Optimizations for High Latency  CPU efficiency in a real world environment  HLT 

A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

Any Data, Anytime, Anywhere Dan Bradley representing the AAA Team At OSG All Hands Meeting March 2013, Indianapolis.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

PRIN STOA-LHC: STATUS BARI BOLOGNA-18 GIUGNO 2014 Giorgia MINIELLO G. MAGGI, G. DONVITO, D. Elia INFN Sezione di Bari e Dipartimento Interateneo.

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Dag Toppe Larsen UiB/CERN CERN,

StratusLab Final Periodic Review

StratusLab Final Periodic Review

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

Thoughts on Computing Upgrade Activities

Presentation transcript:

Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London

David Colling for CMS, ACAT Beijing2 Caveats This is still very much work in progress... This is the work of several people from within CMS and in some places I have reused parts of their material would like to thank them Any opinions expressed (and errors made) are my own

Content David Colling for CMS, ACAT Beijing3 (Gratuitous) what is CMS and what does it do? What is the High Level Trigger (HLT) and where does it sit within CMS data taking? Why use the HLT as a cloud? Technicalities of using the HLT as a cloud The Networking... OpenStack GlideinWMS What we have seen/done so far What is happening now/soon. Reprocessing 2011 data Interaction with other cloud work in CMS Conclusions Acknowledgements

David Colling for CMS, ACAT Beijing4 4 What is CMS? 4 CMS is a detector at the LHC First direct evidence of the Higgs decaying to fermions (actually taus). David Colling for CMS, ACAT Beijing

5 CMS during data taking Few x High Level Trigger nodes

The HLT is... David Colling for CMS, ACAT Beijing6 Big in CPU terms (almost no storage) 195K HepSpec06 (1264 physical machines with cores, 26.6TB of RAM) Compare to Tier 0 at CERN (where data is initially reconstructed), 121k HepSpec06 ALL the Tier 1 (where data is stored, reprocessed etc) pledged to CMS, 150K HepSpec06 ALL the (~50) Tier 2 sites around the world (where data is analysed), 399K HepSpec06 Wholly owned by CMS Complex, several different generations of machine with very different specifications (and configurations), dedicated to no other process than to filtering CMS events. Other uses of the HLT must NEVER interfere with data taking.

So why use it as a cloud? David Colling for CMS, ACAT Beijing7 CMS will be very short of resources in 2015 so we need to use the HLT out side of data taking. In engineering (~week long) breaks and even the in the (~12 hour) gaps between fills. Always remembering that DATA TAKING COMES FIRST Even so why turn it into a cloud? Need to make minimal changes to the underlying set of hardware configurations when using HLT for anything else -> Virtualisation of new tasks Need to be able to make opportunistic use of the HLT, which means migrate on quickly and migrate off quickly (15 minutes warning) -> Virtualisation Complex mixture of different physical machines is not a problem as the cloud infrastructure will only instantiate VMs on resources capable of supporting them. Finally, potentially a good model for using other opportunistic resources as they become available.

Brief thought on virtualisation (Not original to me) David Colling for CMS, ACAT Beijing8

So a cloud was born... David Colling for CMS, ACAT Beijing9 Decided to create the “CMS openstack, opportunistic, overlay, online-cluster Cloud” or CMSooooCloud for short. Why openstack? Large, growing and dynamic development community Open source Widely believed to have a solid underlying architecture “speaks EC2” Being driven by a huge and growing user community “People Get Pissed Off About OpenStack. And That’s Why It Will Survive”

David Colling for CMS, ACAT Beijing10 Setup and initial testing So the plan was to produce an initial installation, test it, and then form a production set up and use it. Description of the initial set up and the tests that were performed

The Networking David Colling for CMS, ACAT Beijing11 The details are not too important what is important is that it is complex and that there is more than one route from CMS to CERN at different speeds. 1Gb/s10 Gb/s

The Networking David Colling for CMS, ACAT Beijing12 However using Open vSwitch and generic routining encapsulation (GRE) this can be simplified so that what the VMs see is... 1Gb/s

The OpenStack Cloud Manager David Colling for CMS, ACAT Beijing13 All the components that you would expect, especially note that different number and types of VMs running on different hardware

Architectural Implementation David Colling for CMS, ACAT Beijing14

Initial Testing David Colling for CMS, ACAT Beijing15 The LHC closed down for 2 years at the end of 2012 and initial tests had been very positive and so in January decided to set try to use the HLT as a “production quality” opportunistic resource and to give it a large validation job (eventually decided to be the complete reprocessing of the 2011 dataset)

What is needed to make the HLT (or any opportunistic resource) useful to CMS? David Colling for CMS, ACAT Beijing16 Each of these deserves an entire presentation... Access to CMS data. Once a major problem however now using CMS “Any data, Any time, Anywhere” (AAA) which serves all of CMS data over xrootd no longer a problem. ( or Pete Elmer’s presentation at this conference). Specifically for the HLT data is served over xrootd from the EOS disk server at CERN. Output from the jobs was also written over xrootd to EOS ( Access to the CMS software. Could put this directly into the VM image however this would require a new VM for each software release so instead decided to use CvmFS. Which serves software releases through squid caches ( A mechanism for interacting with the cloud controller to instantiate VMs and then to connect them to a batch system. CMS’ main submission system is the glideinWMS ( and this has been modified to be able to perform this role alongside its more usual Grid submission rolehttp://

glideinWMS David Colling for CMS, ACAT Beijing17

Experiences so far... David Colling for CMS, ACAT Beijing18 Found many minor but annoying problems. Each of which need to be solved if CMS is going to be able to use (opportunistic) cloud sites in a production manner. These include: Permissions problems with xrootd and EOS VMs dying because access to CvmFS was not available fast enough OpenStack EC2 not Amazon EC2 causing many minor problems all of which required modifications to the glideinWMS. Behaviour in clouds is different from behaviour in Grids so glideinWMS needed to learn how to handle the situations differently OpenStack controller can be “rather fragile” when asked to do things at scale so glideWMS learnt to treat it gently.... However, we worked our way through these, worked out that we could actually run ~7000 VMs on the hardware that was there and started to scale up...

Experience so far... David Colling for CMS, ACAT Beijing19 This was caused by the limitations of importing data over the 1Gb/s network link which case timeouts and job failures. So we had a choice to reprocess and keep the limit of the number of jobs to less than ~1000 or reconfigure the network... Currently reconfiguring the network to use the (now multiple) 10Gb/s connection... Hopefully completed today (CERN time).

Interactions with other cloud work in CMS David Colling for CMS, ACAT Beijing20 The work on using the HLT as an opportunistic cloud resource doesn’t happen in a vacuum. Considering CMS cloud activity in the UK (part of GridPP): Unlike CERN HLT this is across sites and so firewalls need to be considered Unlike CERN this is also considering end user physics analysis Like CERN does use thee glideinWMS (the CMS default) Has greater monitoring of the physical and virtual machines than CERN.

UK CMS cloud activity David Colling for CMS, ACAT Beijing21 CvmFS From CERN Data Via AAA

Analysis Jobs David Colling for CMS, ACAT Beijing jobs submitted with glideinWMS configured to run no more than 80 at a given time Complete with monitoring of the VM and inside the VM

Interactions with HLT cloud David Colling for CMS, ACAT Beijing23 All the lessons learnt as part the HLT activity have directly fed into the CMS analysis cloud work –Including improvements/work to glideinWMS, handling of OpenStack etc The monitoring framework developed as part of the CMS cloud analysis work is being added to the CMS HLT Cloud work to give better understanding These are examples of the collaboration that will enable CMS to utilise (opportunistic) cloud resources efficiently.

Conclusions David Colling for CMS, ACAT Beijing24 CMS is developing the HLT to be a cloud resource because when it is not running as the HLT This resource has been given its a large validation task of reprocessing the 2011 data. However, in order to that efficiently the networking needs to be reconfigured. The work carried out in enabling the use of the HLT as a cloud helps CMS develop an infrastructure capable of using (opportunistic) cloud resources for a variety of activities.

Thanks David Colling for CMS, ACAT Beijing25 Thanks to all those working on the HLT cloud work, including: Adam, Alison, Andrew, Stephen, Toni, Tony, Wojciech (and anybody I have missed)

Questions David Colling for CMS, ACAT Beijing26