Grid Computing I CONDOR.

Slides:



Advertisements
Similar presentations
Community Grids Lab1 CICC Project Meeting VOTable Developed VotableToSpreadsheet Service which accepts VOTable file location as an input, converts to Excel.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
DCC/FCUP Grid Computing 1 Resource Management Systems.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering.
Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Mardi Gras Distributed Applications Conference Baton Rouge, LA
Condor: Job Management
Basic Grid Projects – Condor (Part I)
Genre1: Condor Grid: CSECCR
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Using Condor An Introduction Paradyn/Condor Week 2002
Condor-G Making Condor Grid Enabled
Presentation transcript:

Grid Computing I CONDOR

Agenda What is condor? What is Condor good for? How condor works? How to submit a job?

What is Condor? Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing (HTC) facility. Condor manages both resources (machines) and resource requests (jobs) Condor has several unique mechanisms such as : ClassAd Matchmaking Process checkpoint/ restart / migration Remote System Calls Grid Awareness

How Condor works Condor provides: a job queueing mechanism scheduling policy priority scheme resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, … chooses when and where to run the jobs based upon a policy, … carefully monitors their progress, and … ultimately informs the user upon completion.

Condor Architecture

Condor Daemons in action

condor_master Starts up all other Condor daemons If there are any problems and a daemon exits, it restarts the daemon and sends email to the administrator Checks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version Also supports various administrative commands such as starting, stopping or reconfiguring daemons remotly.

condor_startd Represents a machine to the Condor system Advertises information related to the node resources to the Central Manager(condor_collector) Responsible for starting, suspending, and stopping jobs Enforces the wishes of the machine owner (the owner’s “policy”)

condor_starter Only runs on Execution Host Sets up the execution environment and monitors the job.

condor_schedd Represents users to the Condor system Maintains the persistent queue of jobs Responsible for contacting available machines and sending them jobs Services user commands which manipulate the job queue: condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio

condor_collector Collects information from all other Condor daemons in the pool “Directory Service” / Database for a Condor pool Each daemon sends a periodic update called a “ClassAd” to the collector Services queries for information: Queries from other Condor daemons Queries from users (condor_status)

condor_negotiator Performs “matchmaking” in Condor Gets information from the collector about all available machines and all idle jobs Tries to match jobs with machines that will serve them Both the job and the machine must satisfy each other’s requirements

Job Life Cycle in Condor Job submission: Job submitted by a host using condor_submit command Job request advertising: On receiving a job request, the condor_schedd daemon on the submission host advertises a request to the condor_collector Resource advertising: Each condor_startd daemon running on an Execution host advertises available resources on host to condor_collector

Job life cycle (Cont…) Resource matching: condor_negotiator daemon queries the condor_collector daemon to match a resource for a user job request. It then informs the condor_schedd on the submission host of the matched host Job execution: The condor_schedd on submission host interacts with the condor_strtd daemon running oon the matched host, which spawns a condor_starter daemon. The condor_schedd on submission host spawns a condor_shadow daemon. Return output: When job is completed , the results are sent back.

Condor Universes Universe in Condor defines an execution environment Condor can support various combinations of features/environments in different “Universes” Different Universes provide different functionality for your job

Condor Universes Serial Jobs Vanilla Universe Standard Universe Scheduler Universe Parallel Jobs PVM Universe MPI Universe Java Universe Globus Universe

Vanilla universe Intended for programs that can not be relinked The existing executable can be used without re-compiling or re-linking Can not use Remote System Calls No checkpointing, no migration Can suspend or restart the job

Standard universe checkpointing, automatic migration for sequential jobs Existing program should be re-linked with the Condor instrumentation library The application cannot use some system calls (fork,socket, alarm) Grabs file operations and passes back to the shadow process

Scheduler Universe The job does not wait to be matched to a machine. Instead executes right away on the machine where the job is submitted Machine requirements are not considered

PVM universe Used to run parallel job written in PVM 3.4

MPI universe MPICH usage without any necessary changes Dynamic changes are not supported The application cannot be suspended

Java Universe Submitted program runs on any sort of machine with JVM regardless of its location, owner, or JVM version Condor takes care of all the details as finding the JVM binary and setting classpath

Globus Universe Provides standard Condor interface to Globus users Each job submission file is translated in Globus RSL Jobs submitted to Globus via GRAM protocol

Submitting a job Write a Java class and compile it. Public class Simple{ public static void main(String arg[]){ . }

Submitting a job (Cont…) Create a submit file. Name this file submit.java Universe = java Executable = simple.class Arguments = simple 4 10 Log = simple.log Output = simple.out Error = simple.error Queue

Submitting a job (Cont…)

Example job description file Universe = vanilla Executable = foo Requirements=Memory >= 32 && OpSys == “LINUX" && Arch ==“x86“ Image_Size = 28 Meg Error = err.$(Process) Input = in.$(Process) Output = out.$(Process) Log = foo.log Queue 150

Current Limitations Limitations on Jobs that can be checkpointed Jobs need to be re-linked to get Checkpointing and Remote System Calls

Summary Special resource management (batch)system Distributed, heterogeneous system. Goal: exploitation of spare computing cycles. It can migrate jobs from one machine to another. The ClassAds mechanism is used to match resource requirements and resources

References This presentation was prepared from the material provided by the Condor Project Team http://www.cs.wisc.edu/condor/