1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Slides:



Advertisements
Similar presentations
NGS computation services: API's,
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Some Content Contributed by the University of Wisconsin.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
Condor Birdbath Web Service interface to Condor
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
GRID activities in Wuppertal D0RACE Workshop Fermilab 02/14/2002 Christian Schmitt Wuppertal University Taking advantage of GRID software now.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 22 February 2006.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Condor DAGMan: Managing Job Dependencies with Condor
CREAM-CE/HTCondor site
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Basic Grid Projects – Condor (Part I)
Genre1: Condor Grid: CSECCR
Condor-G Making Condor Grid Enabled
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Presentation transcript:

1 Concepts of Condor and Condor-G Guy Warner

2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

3 Harvesting CPU time Teaching lab machines lie idle for most of the time Harvest spare compute cycles to create a low-cost “high throughput computing” (HTC) platform –Goal: run many tasks in a week, month, … –Typically: many similar tasks invoked from workflow or a script Monte-Carlo Simulation – parameter sweeps Pool processors as a batch processing resource Submit jobs that run when a machine is free Condor most common approach –

4 Example: viewshed analyses Derive viewsheds for all points in “digital elevation model” (DEM) Build a database to allow –Derivation of indices to characterise viewsheds –Applications to access pre- calculated viewsheds Viewsheds: what can be seen from point at “+” Mineter, Dowers, Caldwell

5 Example: viewshed analyses Condor Viewshed creation Coordinator Starts,monitors Condor jobs. Builds database intermediate files Condor DEM database Worker processors running Windows Coordinating processor Typical run: 39 PCs (Windows) 330 CPU days in 9.94 days elapsed Mineter, Dowers, Caldwell

6 Condor Converts collections of distributed workstations and dedicated clusters into a high-throughput computing (HTC) facility –Condor Pool Manages both resources (machines) and resource requests (jobs) Cycle scavenging –Using a non-dedicated resource when it would otherwise be idle.

7 Terminology Cluster – a dedicated set of computers not for interactive use (definition by Alain Roy) Pool – a collection of computers used by Condor –May or may not be dedicated –Single administrative domain Central Manager – one per pool –Matches resource requests to resources (Matchmaking) –Pool Management (Flocking – running jobs submitted from pool A on pool B –Sharing resources with administrative domains possibly with user prioritization)

8 Architecture All nodes run condor_master –Responsible for control of a node. The Central Manager additionally runs condor_collector and condor_negotiator –Responsible for matchmaking An Execute Node additionally runs condor_startd –Responsible for starting a job A Submit Node additionally runs condor_schedd –Responsible for submitting jobs (and allowing user to monitor jobs) A Node must be at least one of Manager/Execute/Submit but may be more

9 Example Configurations Personal Condor – all services on one node –Gain benefits of Condors job management –E.g. only run jobs on your desktop PC when you are not using it. Dedicated Cluster – Manager/Submit on head-node all other nodes are Execute. –Users ssh to head-node. Shared workstations – one workstation dedicated as Manager, all others as Submit/Execute –Submission from any workstation (except Manager) –Nodes join the pool as Execute Nodes when idle and leave the pool when Keyboard/Mouse activity detected.

10 Our setup tc12 MANAGER tc03 SUBMIT tc05 SUBMIT tc07 SUBMIT tc13 EXECUTE tc18 EXECUTE … Classroom PC’s gsissh

11 Job Life-Cycle schedd (Submit machine) Matchmaker (Central Manager) Negotiator Collector shadow User startd (Execute machine) starter Job User submits job schedd advertises the request to collector Resource advertising to collector by startd Matching of resource to request by negotiator and informs schedd schedd interacts with startd Spawns starter Spawns shadow shadow transfers job details to starter Executes job asynchronous Results

12 Job types – “Universes” DAG (workflow) submit … vanillastandardGlobus … scheduler … job job runs within a particular Universe which provides a particular kind of environment / functionality MPI

13 Some Universes Vanilla –Any job that does not require the features of the other universes Standard –goal – to reproduce home Unix environment emulates standard system calls –file I/O »remote access to home files –signal routing –resource management –+ checkpointing save the entire state into checkpoint file can then restart from there – possibly on a different machine for –migration –backwards error recovery important for very long running jobs

14 Some More Universes Java –to provide a standard Java environment MPI –to allow use of message passing interface between component processes Scheduler –For running a job that schedules other jobs –Standard is DAGMAN Globus – “Condor-G” – –access to Globus Job Manager

15 Condor-G Tool that provides globus universe – Condor-G The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites. –One additional line in submit file Condor-G –manages both a queue of jobs and the resources from one or more sites where those jobs can execute. –communicates with these resources and transfers files to and from these resources using Globus mechanisms. –more than just a replacement for globusrun

16 Condor & Globus Condor uses the following Globus protocols GSI – for authentication and authorization GRAM – protocol that Condor-G uses to talk to remote Globus jobmanagers. GASS – to transfer the executable, stdin, stdout, and stderr between the machine where a job is submitted and the remote resource. RSL – to specify job information.

17 DAGMAN Directed Acyclic Graph Manager –specify dependencies as a DAG –re-start a partially completed DAG records what has been done A Each vertex (DAG node) is a normal submit command For any universe (most appropriate for a scheduler universe) BCDEFG Each arc is a sequencing constraint - Parent must finish before child starts

18 Match Making Matchmaking is fundamental to Condor Matchmaking is two-way –Job describes what it requires: I need Linux && 2 GB of RAM –Machine describes what it provides: I am a Mac Matchmaking allows preferences –I need Linux, and I prefer machines with more memory but will run on any machine you provide me Condor conceptually divides people into three groups: –Job submitters –Machine owners –Pool (cluster) administrator All three of these groups have preferences } May or may not be the same people

19 ClassAds ClassAds state Facts –Submit-side e.g. My job’s executable is analysis.exe –Execute-side Dynamic e.g. My machine’s load average is 5.6 Static e.g. My machine’s operating system is Linux ClassAds state preferences: –Job submitter preferences e.g. I require a computer with Linux –Machine owner preferences e.g. I prefer jobs from the physics group –System Administrator preferences e.g. When can jobs pre-empt other jobs?

20 The Tutorial The tutorial covers: –Basic Condor Usage Vanilla universe –Standard Universe –Condor-G –DAGMan –Matchmaking Condor.html