The ADC Operations Story

Slides:



Advertisements
Similar presentations
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Advertisements

FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
Virtual techdays INDIA │ September 2011 High Availability - A Story from Past to Future Balmukund Lakhani │ Technical Lead – SQL Support, Microsoft.
AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.
MC, REPROCESSING, TRAINS EXPERIENCE FROM DATA PROCESSING.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
AMOD Report Doug Benjamin Duke University. Running Jobs last 7 days 120K MC sim Users MC Rec Group.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Concept: Well-managed provisioning of storage space on OSG sites owned by large communities, for usage by other science communities in OSG. Examples –Providers:
PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.
Update on replica management
Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system.
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Rucio Commissioning ADC Operations. Rucio Commissioning  Won’t go into the details of the commissioning ToDo list and priorities didn’t change respect.
Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
7 Strategies for Extracting, Transforming, and Loading.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Stephen Gowdy FNAL 9th Feb 2015CMS Computing Model Simulation 1.
Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
Data Distribution Performance Hironori Ito Brookhaven National Laboratory.
PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.
UNICORE and Argus integration Krzysztof Benedyczak ICM / UNICORE Security PT.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Daniele Bonacorsi Andrea Sciabà
WLCG IPv6 deployment strategy
Computing Operations Roadmap
Virtualization and Clouds ATLAS position
ATLAS Grid Information System
ATLAS Use and Experience of FTS
Cluster Optimisation using Cgroups
Simone Campana CERN-IT
Cross-site problem resolution Focus on reliable file transfer service
Future of WAN Access in ATLAS
Key Activities. MND sections
POW MND section.
Status of the CERN Analysis Facility
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Update on Plan for KISTI-GSDC
Storage Protocol overview
Taming the protocol zoo
Readiness of ATLAS Computing - A personal view
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
Olof Bärring LCG-LHCC Review, 22nd September 2008
ATLAS Sites Jamboree, CERN January, 2017
Univ. of Texas at Arlington BigPanDA Workshop, ORNL
Ákos Frohner EGEE'08 September 2008
10 – Workstation Fleet Logistics
Data Lifecycle Review and Outlook
Chapter 2: Operating-System Structures
Roadmap for Data Management and Caching
Chapter 2: Operating-System Structures
Protect data in core business applications
Alexei Klimentov BNL Jun 25, 2019
Presentation transcript:

The ADC Operations Story Andrej & Ale ADC Commissioning & Integration & Operations

Activities “Standard” Commissioning But not so many things are standard: MCORE pile reconstruction Derivation Framework Commissioning Prodsys2 Rucio Now also Event Index …and more…! ADC Ops 27 October 2014

CPU Resource usage Not good in the past months: many sites often not fully exploited New workflows are not optimal for the “old” ProdSys1+PandaBrokering, some needing complex job requirements HighMem: not clear how much, not only 2 vs 6 MultiCore vs single core Job duration Often not enough workload Many issues due to commissioning of other services (e.g. Rucio) ADC relies on the ATLAS SW: too often happened in the past months that the workflows we run were not deeply tested To be addressed with ATLAS Offline ADC Ops 27 October 2014

Disk Resource usage Too many primaries on Tier1 Deletion: Tier1s and good Tier2s spaces should be considered on the same level of reliability. Tasks could be even assigned directly to Tier2s. Good Tier2s reliability metrics are under constructions, good status Deletion: too much time to wait for green lights, transient data cleaning manual New workflows changing the way in which we use the present spacetokens: Merging in Tier2s is great, but we have to dimension the spaces accordingly Goal is to get rid anyway of Proddisk but we need more time (Rucio migration, and better darkdata handling, see next slides) Near Future will solve (hopefully) most of these issues: ProdSys2 will manage transient data new Data Placement model: lifetime! ADC Ops 27 October 2014

Job failures Need clear separation between SW and Site failures Automate retries SW failures still too hidden, difficult to debug Site failures today not clearly disentangled Some failures ends up in darkdata We need to expose these data and clean up! Proposal: improve the HC jobs to expose better the pure site failures. more HC to validate “new” workflows ADC Ops 27 October 2014

Workflows latency optimization Network Direct access vs transfer Reduce startup and completion latency Today DQ2+SS+FTS+Callbacks putting quite a lot of overhead: looking forward to Rucio Remote I/O may speed-up for some workflows MCP optimization: use network topology for job assignment Job retries: can be done in different sites (for non-sw issues) Log files: Kept in Tier2s for now Object store in the coming months ADC Ops 27 October 2014

Opportunistic resources Panda brokering today can be affected by opportunistic resources not working properly: E.g. too many transferring in a cloud can led to non assignments We need to organize the opportunistic resources in a way that won’t affect the production sites First priority is to have pledged resources fully busy! Cloud resources usability: Clouds should give to us “flexibility”: we didn’t see this in the September MultiCore pileup reconstruction campaign More in one of the next sessions ADC Ops 27 October 2014

Data loss recovery One of the feature of ProdSys2 is to be able to automatically recover the data losses This should solve us most of the issues (biggest problems are for the unfinished/non-replicated tasks) What do we do for the “old” data? If possible ProdSys2 can redo what’s needed We should rely on multiple replication for critical data (for workflows completely different in ProdSys1 – ProdSys2) ADC Ops 27 October 2014

Monitoring Also Monitoring tools are changing/improving BigPanda is now THE PandaMonitoring People getting used to it More help needed in the near future “campaign” monitoring, e.g. the reprocessing new monitoring: we would need something similar in near future DDM dashboard(s) Rucio now publishing into the DDM dashboard FTS activity in a different dashboard Xrootd in another different dashboard ADC Ops 27 October 2014

ADC Shifts ADCoS Senior under revision ADCoS Expert to be revisioned too DAST under heavy load but very useful AMOD to be reviewed Plan under discussion between ADCOps and ADC Shifts Coords we don’t have today people on duty over weekends ADC Ops 27 October 2014

“New” frameworks: Reminder! ProdSys1 has to go ProdSys2 first priority is to implement full ProdSys1 functionalities Almost there! DQ2 has to go Rucio first priority is to implement the same DQ2 functionalities ADC Ops 27 October 2014

DQ2  Rucio migration More details in the next sessions Soon or later (sooner than later) we have to JUMP: mid/third week of November is the target date And then be agile in fixing issues on the fly ADC Ops 27 October 2014

Agile deployment Various teams have developed their frameworks in the past year(s) Often quite “on their own” Integration is ongoing since various months: Many things/issues/details coming to surface only now: It’s obvious but better to be clear: we are all ADC ADC Ops 27 October 2014

… ongoing… Many of the new features will come with Rucio and ProdSys2 Cannot even test these before full integration can’t do all the commissioning/changes at the same time: we need to prioritize! There are things that we if don’t do *now* we will keep them for the next N years! ADC Ops 27 October 2014

N.B. won’t be so easy…! ADC Ops 27 October 2014

.. . ADC Ops 27 October 2014