1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

Slides:

Advertisements

Similar presentations

Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.

Advertisements

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari

Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.

Applications Area Issues RWL Jones Deployment Team – 2 nd June 2005.

Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

Overview of day-to-day operations Suzanne Poulat.

Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.

Your university or experiment logo here Storage and Data Management - Background Jens Jensen, STFC.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April

MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.

Status & Plan of the Xrootd Federation Wei Yang 13/19/12 US ATLAS Computing Facility Meeting at 2012 OSG AHM, University of Nebraska, Lincoln.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

Report on Installed Resource Capacity Flavia Donno CERN/IT-GS WLCG GDB, CERN 10 December 2008.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

OSG Abhishek Rana Frank Würthwein UCSD.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Padova, 5 October StoRM Service view Riccardo Zappi INFN-CNAF Bologna.

The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.

DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

Sep 17, 20081/16 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Sep 17, 2008 Gabriele Garzoglio.

Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

1 SRM v2.2 Discussion of key concepts, methods and behaviour F. Donno CERN 11 February 2008.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,

An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.

SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.

Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.

Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.

Federating Data in the ALICE Experiment

a brief summary for users

Markus Schulz - LCG Deployment

WLCG IPv6 deployment strategy

SuperB – INFN-Bari Giacinto DONVITO.

StoRM: a SRM solution for disk based storage systems

Status of the SRM 2.2 MoU extension

SRM Developers' Response to Enhancement Requests

Luca dell’Agnello INFN-CNAF

Ákos Frohner EGEE'08 September 2008

The INFN Tier-1 Storage Implementation

Data Management cluster summary

Presentation transcript:

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily  Answer the question how unpredictable Chaotic User Analysis will be  Identify additional requirements coming from the analysis use case  Middleware  T1/T2 infrastructure  Load to be expected  Total input received:  Several summary write ups and comments to questions and summaries  Roughly 100 reference documents and presentations

2 High Level View  All experiments have well established analysis frameworks  They are documented and in use by end users  Users are shielded from most of the middleware  They look similar, but are different when it comes to resource access  The frameworks are currently in a “tuning” phase  Detailed performance measurements  I/O vs CPU, failure rates, network bandwidth  Stress tests  Automatic submission frameworks  Communication channels between experiments and T2s are well organized  Different concepts, but communication works  Specific site stability monitoring for analysis use case  Some experiments have well defined resource requests  Storage per WG etc.

3 High Level View II  T2s/T3s  Can be of almost any size  Several have less than 100 cores, one 6000  Storage:  „ZOO“ describes it best  Different storage systems with different characteristics  Same SE type does NOT guarantee same performance  Pool clustering, network, file systems, configuration  Experiments take an adaptive approach  ATLAS currently builds a „Storage Access Method Map“  Allows to pick best performing acces on every site  Without adaptation the CPU/Walltime varies 10-90%

4 High Level View II  User activity  Large experiments each have > 1000 users who used the grid during 2008  Users per week on the order of /experiment  Expected increase during summer: Factor 2-3  Current resource usage: 15 – 30 % of the production use (30k/day)  Computing resource access  Via submission frameworks  WMS, direct CE, pilot  Atlas uses both strategies  CMS investigates pilot option  LHCb and ALICE are pilot based

5 Main Issues and Requirements  Reliability of sites  Important, but not part of the work group  Access to computing resources  Some minor problems with scalability of WMS and lcgCEs  Can be addressed by adding more services  Must be planned now, overall job rate will double soon  Multi user pilot job support  Glexec/ gums in OSG ( available)  glexec/SCAS ( very, very soon available)  NDGF ???  Access control to shares and fair share balances  Fine grain control respecting groups and user locality is needed  Technical basis is available, but sites have to configure it  Share allocation monitoring needed (CMS)  End user job monitoring ( Experiments develop specific dashboards)

6 Main Issues and Requirements  Access to storage resources  Three main problem domains  I/O performance (including network)  Access Control  SRM commands  Access Control  ACLs to control staging and data (re)moval  VOMS based ACL are in dCache, DPM, STORM  Why can‘t they be used? What is missing?  Quotas: Group based and user based are needed  Experiments monitor usage, but this will not scale  „Free Space for X“ tool wanted ( LHCb)  To know how much space is available  Accounting

7 Main Issues and Requirements  Storage I/O  Xrootd everywhere ( Alice)  Milage varies based on backend  DPM uses 2 years old version  CASTOR moves to a major upgrade  dCache implemented subset of xrootd protocol  Xrootd native at most ALICE T2s  Access via “Posix” like clients: rfio, dcap, WN local ( staged)  Very sensitive to file structure (number of ROOT trees)  Even local access varies from 2.2 MB/sec to 47 MB/sec  Very site dependent  ATLAS and ALICE performed specific measurements  CMS will do the same very soon  Network: Example ATLAS “Muon style analysis in HammerCloud)  200CPUs -  800MB/sec I/O  10 Gbit network

8 Main Issues and Requirements  SRM commands  SRM client calls should not block SRM services  srmLs polling is a good example  Bulk operations for some commands  Which?

9 What next?  Documentation of analysis models seems to be not an urgent task  Detailed resource allocation at T2s are best coordinated by experiments diretly  In my view we should concentrate on the following issues:  I/O for analysis  Understanding the different access methods  How can we help the T2s to optimize their systems?  Can the ALICE analysis train model help?  Role of xrootd as an access protocol?  Clarification of requirements concerning storage ACLs and quotas  What can be done now  Protect staging, handle ownership, etc.  What is really missing?  Most SRM implementations have VOMS ACLs.  Short and longterm solutions for quotas  SRM calls

10 What next?  CPU resource access  Specify for each experiment a share layout, including pilot queues  Pool middleware and batch system expertise to provide good templates for the most common batch systems  Share monitoring?  The infrastructure needs to understand clearly how many jobs will flow through the system  Analysis increases at least by factor 3 during the year  How many WMS are needed?  How many Ces should a T2 site with X cores provide?  Number of FQANs and jobs are the issue  What else?

11 Discussion