MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

Slides:

Advertisements

Similar presentations

Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)

Advertisements

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.

Academic Services Interactive Media Managing the Web with Java JA-SIG Winter 2002 Robert Sherratt Academic Services, Interactive Media.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013.

U.S. Postal Service - CONFIDENTIAL ® Information Technology Update MTAC November 18, 2010.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.

WORKFLOW IN MOBILE ENVIRONMENT. WHAT IS WORKFLOW ?  WORKFLOW IS A COLLECTION OF TASKS ORGANIZED TO ACCOMPLISH SOME BUSINESS PROCESS.  EXAMPLE: Patient.

OSG Public Storage and iRODS

The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June.

A Distributed Computing System Based on BOINC September - CHEP 2004 Pedro Andrade António Amorim Jaime Villate.

Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.

RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.

Nick Elias 2010 May 14 CASA Developers' Meeting1.

Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

DORII Joint Research Activities DORII Joint Research Activities Status and Progress 6 th All-Hands-Meeting (AHM) Alexey Cheptsov on.

PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.

Storage Wahid Bhimji DPM Collaboration : Tasks. Xrootd: Status; Using for Tier2 reading from “Tier3”; Server data mining.

Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.

FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.

The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.

1 / 22 AliRoot and AliEn Build Integration and Testing System.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

PanDA Monitor Development ATLAS S&C Workshop by V.Fine (BNL)

A Technical Validation Module for the offline Auger-Lecce, 17 September 2009  Design  The SValidStore Module  Example  Scripting  Status.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.

Experience and possible evolution Danila Oleynik (UTA), Sergey Panitkin (BNL), Taylor Childers (ANL) ATLAS TIM 2014.

DELETION SERVICE ISSUES ADC Development meeting

Alliance Alliance Performance Status - CREQ Régis ELLING July 2011.

STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.

Anubha Gupta | Software Engineer Visual Studio Online Microsoft Corp. Visual Studio Enterprise Leveraging modern tools to streamline Build and Release.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.

HPC pilot code. Danila Oleynik 18 December 2013 from.

Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

Update on Titan activities Danila Oleynik (UTA) Sergey Panitkin (BNL)

OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL

Proxy management mechanism and gLExec integration with the PanDA pilot Status and perspectives.

Gridmake for GlueX software Richard Jones University of Connecticut GlueX offline computing working group, June 1, 2011.

VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.

Network integration with PanDA Artem Petrosyan PanDA UTA,

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March

+ AliEn status report Miguel Martinez Pedreira. + Touching the APIs Bug found, not sending site info from ROOT to central side was causing the sites to.

Toward Geant4 version 10 Makoto Asai (SLAC PPA/SCA) For the Geant4 Collaboration Geant4 Technical Forum December 6 th, 2012.

A Software Energy Analysis Method using Executable UML for Smartphones Kenji Hisazumi System LSI Research Center Kyushu University.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

Core and Framework DIRAC Workshop October Marseille.

PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.

MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.

ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.

PANDA PILOT FOR HPC Danila Oleynik (UTA). Outline What is PanDA Pilot PanDA Pilot architecture (at nutshell) HPC specialty PanDA Pilot for HPC 2.

OpenPBS – Distributed Workload Management System

Diskpool and cloud storage benchmarks used in IT-DSS

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

Readiness of ATLAS Computing - A personal view

Final Review 27th March Final Review 27th March 2019.

Presentation transcript:

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015

Overview Current status Workflow and issues ATLAS workloads on Titan

MultiJob Pilot on Titan First version deployed at the end of May 2015 Validation starts June ~ Completed jobs 2 simultanius pilots Only backfill mode Increasing loading from 15 to 300 PanDA jobs in bunch Have no steady stream of jobs form PanDA Have no steady stream of jobs form PanDA Steady stream of jobs from 15 to 300 jobs in bunch, two Pilots.

How to be efficient with backfill Be fast and smart enough to grab recourses, before other bastards users steal them and while time gap is enough for execution of payload. -Minimize start up time as much as possible -Minimize post processing procedures – to be ready for next round

Workflow & Issues 1 Pilot requests information about available resources (backfill) According to available resources, pilot requests jobs from PanDA server -Sequential requests works good for small chunks (one request took about 1 sec.), but for chunks with few hundred jobs it take significant time. -Simultaneous requests gave about 10-15% of benefit due to execution time each particular request increasing in depending of number of parallel process -Finally bulk function on PanDA server was implemented - now request for 300 jobs tooks about sec.

Workflow & Issues 2 Pilot stage-in and prepares environment before launching Jobs updates with initial info (space report etc.) -Stage-in of one file take ~15-20 seconds; -Stage-in procedure was modified to decrease number of real remote downloads. It takes now for 300 jobs about 300 seconds, it’s very good, but will be great to decrease this time as well. -Since update of one job on PanDA server consume about 1 sec. I try to minimize number of updates of PanDA server before job really launched. For example, 'space report' was disabled - and now this information come with regular job state update.

Workflow & Issues 3 Request availble backfill and adjust number of jobs for launching. Not fitted jobs will be failed and sub status of jobs will be updated, as only MPI job will be launched.

Workflow & Issues 4 Pilot monitors availble space, output and sends hearbeat during payload execution Procedures, related with checking of space and outputs should be reviewed to avoid significant impacts to FS.

Workflow & Issues 5 Jobs completed, pilot perform stage-out and cleanup…. and it may take up to few hours

Conclusion about workflow We still got ‘surprises’ related with scalabilty. There are still a lot of room for improvements, some of them may be achived by optimizotion of code of Pilot, anothers by extnesion of architecture for exmple optimization of data management, prestaging of data etc.

Update on Titan workloads Sergey Panitkin

Validation of the data produced on Titan We ran validation jobs since May – Started with Rel based tasks A set of physics validation pages generated No conclusive message. Some plots there are green, some are red Looks like larger statistics are needed – Rel based task started recently on Titan Found a curious bug in the release’s pacman file (ref. to older release). Fixed now.

Completed ATLAS jobs on TITAN From May 1 to Sept 17

ATLAS Simulations with Geant4 v10 Installed rel Ran on data suggested by Sasha Tests on events that caused crashes on other platforms. Interim conclusion: What crashes on Intel runs on AMD We would like to run a few tasks with this release to help with Geant4 v10 validation

ATLAS simulation release “Small” release fitting on RAM disk – ~12 GB in size right now Installed AthSimulationBase 1.0.0, then – Many Root 6 persistency related crashes Reported initial crashes to dev. team (Zach and John Chapman) After fixes and patches suggested by Vakho rel ran in a single threaded mode AthenaMP still crashes in the merging step – Identified as CLHEP related issue – Devs are working on it. – Waiting for the next release 1.0.2