Efi.uchicago.edu ci.uchicago.edu ATLAS Experiment Status Run2 Plans Federation Requirements Ilija Vukotic XRootD UCSD San Diego 27 January,

Slides:



Advertisements
Similar presentations
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Advertisements

Computer Organization and Architecture
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS Computing Integration.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
ALICE data access WLCG data WG revival 4 October 2013.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
FAX UPDATE 1 ST JULY Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue.
FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.
PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
Efi.uchicago.edu ci.uchicago.edu Using FAX to test intra-US links Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computing Integration.
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.
FAX UPDATE 12 TH AUGUST Discussion points: Developments FAX failover monitoring and issues SSB Mailing issues Panda re-brokering to FAX Monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
Efi.uchicago.edu ci.uchicago.edu Data Federation Strategies for ATLAS using XRootD Ilija Vukotic On behalf of the ATLAS Collaboration Computation and Enrico.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Maria Girone, CERN CMS Experiment Status, Run II Plans, & Federated Requirements Maria Girone, CERN XrootD Workshop, January 27, 2015.
Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)
Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
Data Distribution Performance Hironori Ito Brookhaven National Laboratory.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computation and Enrico Fermi.
Efi.uchicago.edu ci.uchicago.edu Caching FAX accesses Ilija Vukotic ADC TIM - Chicago October 28, 2014.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Daniele Bonacorsi Andrea Sciabà
Computing Operations Roadmap
Processes and threads.
U.S. ATLAS Grid Production Experience
Future of WAN Access in ATLAS
Storage Protocol overview
PanDA in a Federated Environment
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
The ADC Operations Story
Simulation use cases for T2 in ALICE
ALICE Computing Upgrade Predrag Buncic
湖南大学-信息科学与工程学院-计算机与科学系
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

efi.uchicago.edu ci.uchicago.edu ATLAS Experiment Status Run2 Plans Federation Requirements Ilija Vukotic XRootD UCSD San Diego 27 January, 2015

efi.uchicago.edu ci.uchicago.edu 2 Part 1 – overview of changes Run2 brings new challenges: 1kHz trigger rate 40 events pile-up flat resources But the long shutdown gave us an opportunity for fundamental changes. And we used it: New models and policies New DDM system: Rucio New distributed production framework: ProdSys2 Opportunistic resources – clouds, HPC New data format: xAOD New analysis model Large changes in reconstruction codes – 3x speed up!

efi.uchicago.edu ci.uchicago.edu 3 Run-1 model, briefly Strict hierarchical model (Monarc): – Clouds: T1 + T2s (+ T3s) – No direct transfers between foreign T2s – Relaxed towards the end of Run-1 (Multi-cloud production – T2s can process jobs of many clouds) Production organization: – Tasks assigned to T1s – T1 is the aggregation point for the output datasets of the tasks – T2 PRODDISK used for input/output transfers from/to T1 T2 disk space: – distribute the final data to be used by analysis – store secondary replicas of precious datasets

efi.uchicago.edu ci.uchicago.edu 4 Planning for Run-2 model - facts Network globally improved – Much higher bandwidth (an order of magnitude increase) – Most of the links between ATLAS sites provide sufficient throughput : full mesh for transfers can be used Many Tier-2 sites provide the Tier-1 level stability of computing, storage and WAN – Many in LHCONE or other high-throughput networks – Tape resource is the only difference between Tier-1s and large Tier-2s, as far as the usability for ATLAS is concerned CPU only (opportunistic) centers are fully integrated in ATLAS – Some run all kind of tasks, including data reprocessing – Have good connectivity to geographically close Storage Elements

efi.uchicago.edu ci.uchicago.edu 5 CPU and Storage organization Breaking the barrier between the Storage Element and Computing Element: – Remote I/O, job overflow, remote fail-over of input or output file staging → storage not strictly bound to the site computing resource – Tier-1, Tier-2, Tier-3 storage classification does not make much sense anymore ATLAS Storage pool: – TAPE – STABLE disk storage – T1 + reliable T2 (former T2Ds). For storing custodial, primary data – UNSTABLE disk storage – less reliable T2s. For secondary data (for analysis) – VOLATILE disk storage – unreliable T2s, T3s, opportunistic storage. LOCALGROUPDISK SEs, Rucio cache storage, Tier3 Analysis Disk Space: – Lifetime-based Storage Model – Disk - Tape residency almost 100% algorithmically managed

efi.uchicago.edu ci.uchicago.edu 6 Job optimizations Production / Analysis – Run-1: 75% / 25% (slots occupancy ~ cputime usage) – Run-2: 90% / 10% (no estimate yet) o Bulk of analysis (Derivation) moving to (group) production o Remaining analysis will be shorter and I/O intensive Reduce the merging – Avoid it if possible (simulation, reconstruction) – Local merging – merge on the site, where the files to be merged are Jobs will produce bigger outputs – Good for tape storage – Bigger files transferred – good for efficient transfers (but less files to transfer) Massive multicore for ~80% of production – All G4 simulation and all digi+reco – Effective drop in running jobs from 200k to 60k (20k 8-core + 40k single-core ) JEDI dynamic resizing – tune the jobs to 6-12h – Avoid failures and cpu losses for very long jobs Automatic healing: – Split jobs too long – Increase memory requirements for out-of-memory failing jobs

efi.uchicago.edu ci.uchicago.edu 7 Specializing the sites for workloads not all the sites are equal not all the job types run equally well on all the sites some sites are slow for analysis but they are good for data reprocessing some sites are very big but cannot run 100% of heavy I/O jobs differentiation was already used during Run-1 by limiting the job types through the fairshare (AGIS settings) – e.g. evgensimul=60%,all=40% not all the jobs are EQUALLY important: – Some tasks have short deadline – Some large activities have close deadline (physics conferences) FUTURE: Dynamic specialization: – I/O expensive jobs will be automatically throttled by the central system based on recent history – keeping track of data transferred to site and reduce the heavy job assignment – Migration from fixed bamboo queues to per task/job heaviness estimates Forced specialization: – ADC will specialize sites for certain activities, if the site provides custom resources (more memory per cpu, GPU availability...)

efi.uchicago.edu ci.uchicago.edu 8 Rucio Run 2 Data Management model File level granularity Multiple ownerships (user/group/activity) Policy based replication for space/network optimization Features: – Plug in based architecture supporting multiple protocols (SRM/gridFTP/xrootd/HTTP...) – Unified dataset/file catalog, supports metadata In production Ironing kinks Adding functionality In production Ironing kinks Adding functionality

efi.uchicago.edu ci.uchicago.edu 9 Lifetime-based Storage Model Use updated version of current model to calculate CPU usage and dataset volumes Keep track of creation date of each dataset Remove each dataset from disk and tape when it has exceeded its (type-specific) Lifetime. Adjust total request for disk and tape to be sufficient to accommodate the requested lifetimes. Assign an experience-driven fraction of the disk+ tape storage to disk. Model is now being refined Will be used to inform the ATLAS resource request for 2016 in February 2015

efi.uchicago.edu ci.uchicago.edu 10 ProdSys2 DEfT – task request and task definition – new web interface to requests, review, submission – templates for easy task definitions. – chaining of steps, requests – series of chains – post-production interface JEDI – dynamic job definition and task execution – integrated with PanDA – engine for user analysis tasks – New features: o dynamic job definition o lost file recovery o network-aware brokerage o log file merging o output merging o support for event service PanDA – covered later today by Kaushik De BigPanDA - brand new monitor Adding functionality Validating workflows Adding functionality Validating workflows In production since August Tuning parameters In production since August Tuning parameters In production since August

efi.uchicago.edu ci.uchicago.edu 11 New data format: xAOD “Dual-use” xAOD replaces separate ATHENA- readable and Root-readable formats. Will be covered in details by Doug Benjamin More efficient More user-friendly Derivation Framework (trains with carriages provided by groups) replaces incoherent “group production” Analysis Framework, supporting standardized use of performance group recommendations (jet energy scale etc.) New Analysis Model In use Some tuning still possible In use Some tuning still possible In validation In use Some tools in validation In use Some tools in validation

efi.uchicago.edu ci.uchicago.edu 12 Part 1 – Conclusion ATLAS is making excellent progress towards readiness for Run 2. New production and data management system provides many possibilities for further improvements and dynamic optimizations Many of the changes can be implemented before the Run-2 starts A lot of things need tuning Even during the Run-2 we can afford to bring drastic improvements to our distributed system The production STABILITY will be the FIRST PRIORITY during data taking Unknown: What will groups and physicists really do?

efi.uchicago.edu ci.uchicago.edu 13 Part 2 – Federation Role and scale in Run2 Three main roles – Enable remote IO jobs o Increase job turnover (shorter wait in queues, faster startup) o Increase wall time utilization of CE (no wait for input data) o Lower number of replicas needed o Better use available bandwidth by streaming only what is actually needed – Enabling users to easily and efficiently access much more data than they could possibly have locally. Make diskless Tier3s possible and practical. – Add redundancy to the existing data delivery mechanisms. An ideal scale would be the one where we use all the available bandwidth to make all the CPU’s busy and have a minimal number of rarely accessed files/datasets. – Currently system is much simple than that.

efi.uchicago.edu ci.uchicago.edu 14 Part 3 – FAX currently Deployment completed Coverage >96% Deployment completed Coverage >96%

efi.uchicago.edu ci.uchicago.edu 15 FAX - stability Most sites running stably Glitches do happen but are fixed usually in few hours No signs of stability issues caused by load

efi.uchicago.edu ci.uchicago.edu 16 FAX usage Physicists are slowly starting to use it – Mainly thanks to our Offline software tutorial sessions. – Only anecdotal evidence – due to both privacy and monitoring issues Failovers – jobs that could not access the data from local storage, try to get them elsewhere – Few thousands jobs / day. Saves roughly half of them.

efi.uchicago.edu ci.uchicago.edu 17 FAX usage Overflows – our term for jobs brokered to a site that does not have the input data, sent with explicit instruction to use FAX to get them. Goals for beginning of Run2: – Handle 5-10% of all the Analysis jobs – Have job efficiency roughly the same as jobs locally accessing data – Have CPU efficiency at least 50% of the locally run jobs Decision to overflow is made by JEDI based on: – Where is the data – How busy is the destination site – What kind of data rate job can expect between source and destination In operation since August. Initially only enabled in USA. Now includes 4 sites in Europe. Started with a suboptimal system, big improvements still to come.

efi.uchicago.edu ci.uchicago.edu 18 Job rates and efficiencies Analysis jobs per hour Local accessRemote access All queuesQueues enabling overflow 33k11k3.2k (9/22%) All analysis Queues Local access Remote access Local access Remote access Queues enabling remote access

efi.uchicago.edu ci.uchicago.edu 19 Job efficiency We were and still are debugging the system and that causes lower job efficiency We will start blacklisting sites accepting/delivering overflow jobs when their FAX endpoint fails the tests. We are confident that error rate will be at the level of jobs accessing data locally. Last 4 days Local access - 89% Remote access - 83% Since start Local access - 82% Remote access - 64%

efi.uchicago.edu ci.uchicago.edu 20 CPU efficiency Completely depends on the job mix. People still not consistently using TTreeCache Version of ROOT auto-enabling TTC still not in wide use Last 4 days Local access - 80% Remote access - 42% Since start Local access - 79% Remote access - 34%

efi.uchicago.edu ci.uchicago.edu 21 FAX - scaling We are confident it can scale to all the ATLAS queues A few important optimizations still in queue: – Access data from the optimal place – Reducing number of space tokens where N2N looks for the file Last 4 days Local access - 3 PB (Avg: 9GB/s) Remote access - 1PB (Avg: 3GB/s)

efi.uchicago.edu ci.uchicago.edu 22 Monitoring Network has to be managed the same way we manage CPU and storage usage. And for that we need accounting. That’s the part that failed us the most – Sites don’t really cooperate in enabling it o Even when obliged by law o We did not make it easy / clear how to do it – We don’t trust currently shown ML info o Instances where ganglia shows 3GB/s and ML 100MB/s o Huge differences between summary and detailed monitoring plots.

efi.uchicago.edu ci.uchicago.edu 23 In development Current Monitoring Chain Collector at SLAC – a complicated custom made construction – quite difficult to support – of a dubious scalability ML – Difficult to support – not easily customizable postgresql ActiveMQ MONA LISA Oracle Dashboard HDFS Dashboard

efi.uchicago.edu ci.uchicago.edu 24 Proposal for a new Monitoring Chain Collectors – FLUME source(s) o directly accept UDP messages – FLUME sink o Aggregates and outputs straight into HDFS Cleaning, re-summing – PIG running each 10 min. – Writing back into HDFS Dashboard – ElasticSearch indexes data in HDFS – Kibana to visualize FLUME HDFS ES/Kibana Scalable Industry standard tools Very customizable dashboard Easy to do detailed analytics

efi.uchicago.edu ci.uchicago.edu 25 Federation requirements Stability Monitoring User friendliness

efi.uchicago.edu ci.uchicago.edu 26 Reserve

efi.uchicago.edu ci.uchicago.edu 27 Actions on solving protocol zoo 1. Short Term (?) : Eliminate the need of space tokens (rely on paths) for accounting. 2. Short Term: Move to gridFTP only for 3rd party transfer (requires 1) 3. Short Term: Move to xrootd/http for uploads and downloads (requires 1) 4. Short/Medium Term: commission http/WebDAV to production quality for deletions 5. Medium Term: Decommission SRMfrom non tape sites ( requires all the above) 6. Short Term: Move to xrootd (or file) all the directIO 7. Short Term: Decommission other directIO protocols ( requires 6 ) 8. Medium/Long Term: Commission xrootd and WebDAV for 3rd party transfers 9. Medium/Long Term: Decommission gridFTP ( requires 8 ) 10. Long Term: Consolidate Davix and evaluate Davix and xrootd for directIO 11. Long Term: Keep both webDAV and xrootd or decommission one

efi.uchicago.edu ci.uchicago.edu FAX redirection FAX cost matrix Data collected between 20 ANALY queues (compute sites) and 58 FAX endpoints Jobs submitted by HammerCloud Results to ActiveMQ, consumed by SSB with network & throughput measurements (perfSONAR and FTS) HammerCloud REST SSB FAX cost matrix SSB FAX cost matrix FTS GAE

efi.uchicago.edu ci.uchicago.edu Cost matrix - results

efi.uchicago.edu ci.uchicago.edu Cost matrix - results GAE based dashboard GAE based dashboard