Job submission overview Marco Mambelli – August 9 2011 OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Campus High Throughput Computing (HTC) Infrastructures (aka Campus Grids) Dan Fraser OSG Production Coordinator Campus Grids Lead.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.

Minerva Infrastructure Meeting – October 04, 2011.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.

Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

Grid job submission using HTCondor Andrew Lahiff.

Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.

Remote Cluster Connect Factories David Lesny University of Illinois.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

© 2007 UC Regents1 Track 1: Cluster and Grid Computing NBCR Summer Institute Session 1.1: Introduction to Cluster and Grid Computing July 31, 2007 Wilfred.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.

EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

Greg Thain Computer Sciences Department University of Wisconsin-Madison HTPC on the OSG.

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Workload Management Workpackage

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Workload Management System ( WMS )

GWE Core Grid Wizard Enterprise (

Grid Compute Resources and Job Management

The Condor JobRouter.

Wide Area Workload Management Work Package DATAGRID project

Overview of Workflows: Why Use Them?

Presentation transcript:

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO

Campus and Production Grid Campus Grid  Direct trust relationship  Local authentication (username and password)  Accessible as a single Condor pool Production Grid  OSG and VO mediated trust relationship  Authentication via grid certificates (x509)  Varied resources accessed via GRAM protocol  Different tools to submit jobs 2 8/9/11 Job submission overview

Job classification HPC High Performance Computing  Massively parallel  Specialized resources HTC High Throughput Computing  Many independent jobs  Common denominator across resources HTPC High Throughput Parallel Computing  Take advantage of multi-core architectures  Multiple threads on a single node All jobs in OSG should have minimal assumptions about the resources 3 8/9/11 Job submission overview

High Throughput Computing Opposed to High Performance Computing Simple serial jobs, Mote Carlo simulation, parameter sweep (pleasantly parallel) Complex workflows with data and/or job dependencies User is decoupled from the resource Must be able to express the work in a highly portable way Cannot care about exactly where pieces are executed May not have SSH access to resources -> many implications Must be able to respect the community rules of a shared infrastructure; “plays well with others” Minimize time between job ‐ submit and job ‐ complete and...ensembles of small node count MPI jobs (HTPC) 4 8/9/11 Job submission overview

Local vs Distributed Submission, wait in queue, execution, completion Variety of resources Many possibilities for failures (more complex) Different order Big latency Black boxes Difficult end-to-end monitoring 5 Local Distributed 8/9/11 Job submission overview

Direct submission vs Brokering Competing for resources Coordinate the scheduling There are still others 6 Direct submissionBrokering 8/9/11 Job submission overview

Single job vs Workflow Direct acyclic graph Workflow manager  Runs jobs in order  Retries 7 8/9/11 Job submission overview

Direct execution vs Pilot jobs Pilots  Separate user job from grid job  Have some overhead  Can perform some initial test  Uniform (enhanced) environment  Delayed scheduling 8 Direct execution Pilot jobs 8/9/11 Job submission overview

Pilot system: Pilot vs Direct job submission PilotDirect job Submitted by factories  Central (GiWMS, autopilot)  Cluster factories Managed by factories Scheduled centrally (allows late scheduling) Code to support VO job execution Submitted continuously Partially accounted  no big deal if some fail Submitted by users or production managers Managed by workflow manager (or user) Scheduled by local resource managers (cluster queues) Code for both support and VO tasks Submitted when needed Fully accounted  error statistics 9 8/9/11 Job submission overview

Pilot system: Pilot vs VO job PilotVO job (on pilot) Submitted by factories  Central (GiWMS, autopilot)  Cluster factories Managed by factories Code to support job execution Submitted continuously Partially accounted  no big deal if some fail Submitted by users or production managers Managed by Panda/GiWMS Server Runs VO software Submitted when needed Fully accounted  error statistics 10 8/9/11 Job submission overview

Low level job management Tools  LRM: Condor, PBS, LSF, SGE, …  Globus GRAM They are  More flexible  A lot of different tools  Steep learning curve  Technology may change (CREAM) 11 8/9/11 Job submission overview

More advanced tools Condor-G  Uniform interface, workflow management (Dagman) OSG MM  Brokering, uniform interface (C-G), workflow mgm. (Dg) Glidein WMS  Brokering, uniform interface (C-G), workflow mgm. (Dg) Panda  Monitoring, brokering, uniform interface (C-G), workflow mgm. (Dg), integrated with VO software CRAB  integrated with VO software /9/11 Job submission overview

PANDA PANDA = Production ANd Distributed Analysis system  Designed for analysis as well as production  Project started Aug 2005, prototype Sep 2005, production Dec 2005  Works both with OSG and EGEE middleware A single task queue and pilots  Apache-based Central Server  Pilots retrieve jobs from the server as soon as CPU is available  late scheduling Highly automated, has an integrated monitoring system Integrated with ATLAS Distributed Data Management (DDM) system Not exclusively ATLAS: CHARMM, OSG ITB 13 8/9/11 Job submission overview

Panda System site A Panda server site B pilot Worker Nodes condor-g Autopilot https submit pull End-user submit job pilot ProdDB job logger http send log bamboo LRC/LFC DQ2 14 8/9/11 Job submission overview

Panda Monitor: production dashboard /9/11 Job submission overview

Panda Monitor: Dataset browser 16 8/9/11 Job submission overview

Panda Monitor: error reporting 17 8/9/11 Job submission overview

Glide-in WMS Provide a simple way to access the Grid resources. GlideinWMS is a Glidein Based WMS (Workload Management System) that works on top of Condor Workflow (as visible in the next page illustration)  Users submit jobs to the User Pool Condor schedd process.  The GlideinWMS Frontend polls the user pool to make sure that there are enough glideins (worker nodes) to satisfy user jobs. It submits requests to the glidein factory to submit glideins.  Alternatively, users can control their workflow with the CorralWMS frontend.  The glidein factory and WMS pool receives requests from the frontend(s) and submits a condor startd wrapper to entry points (grid sites).  The grid sites receive the jobs and starts a condor startd that joins the user pool. This shows up as a resource in the user pool. Final users can submit regular Condor jobs to the local queue 18 8/9/11 Job submission overview

Glide-in WMS workflow 19 8/9/11 Job submission overview

Glide-in WMS system /9/11 Job submission overview

Submit File for Condor (or Campus Grid) Universe = vanilla #notify_user = Executable = serial64ff #transfer_executable = false # Files (and directory) in the submit host: log, stdout, stderr, stdin, other files InitialDir = run_1 Log = serial64ff.log Output = serial64ff.stdout Error = serial64ff.stderr #input = serial64.in fetch_files=serial64ff.in #Arguments = should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT Queue 8/9/11 21 Job submission overview

Submit File for Glidein WMS Universe = vanilla #notify_user = Executable = serial64ff #transfer_executable = false # Files (and directory) in the submit host: log, stdout, stderr, stdin, other files InitialDir = run_1 Log = serial64ff.log Output = serial64ff.stdout Error = serial64ff.stderr #input = serial64.in fetch_files=serial64ff.in #Arguments = should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT # This line is required to specify that this job should run on GlideinWMS system requirements = IS_GLIDEIN == True x509userproxy=/tmp/x509up_u20003 Queue 8/9/11 22 Job submission overview

23 ?!?! 8/9/11 Job submission overview