Condor Project Computer Sciences Department University of Wisconsin-Madison Advanced Condor mechanisms CERN Feb 14 2011.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

Condor Project Computer Sciences Department University of Wisconsin-Madison What’s new in Condor? What’s coming up? Condor Week 2009.

Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

HTPC - High Throughput Parallel Computing (on the OSG) Dan Fraser, UChicago OSG Production Coordinator Horst Severini, OU (Greg Thain, Uwisc) OU Supercomputing.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Grid Computing I CONDOR.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.

Condor Project Computer Sciences Department University of Wisconsin-Madison What’s new in Condor? What’s c Condor Week 2010.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.

Grid job submission using HTCondor Andrew Lahiff.

ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.

Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Five todos when moving an application to distributed HTC.

Greg Thain Computer Sciences Department University of Wisconsin-Madison HTPC on the OSG.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.

The Distributed Application Debugger (DAD)

Condor DAGMan: Managing Job Dependencies with Condor

Dynamic Deployment of VO Specific Condor Scheduler using GT4

High Availability in HTCondor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Basic Grid Projects – Condor (Part I)

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

Condor-G Making Condor Grid Enabled

Job Submission Via File Transfer

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Condor Project Computer Sciences Department University of Wisconsin-Madison Advanced Condor mechanisms CERN Feb

2 a better title… “Condor Potpourri” › Igor feedback “Could be useful to people, but not Monday” › If not of interest, new topic in 1 minute

3 Central Manager Failover › Condor Central Manager has two services › condor_collector  Now a list of collectors is supported › condor_negotiator (matchmaker)  If fails, election process, another takes over  Contributed technology from Technion

4 Submit node robustness: Job Progress continues if connection is interrupted › Condor supports reestablishment of the connection between the submitting and executing machines.  If network outage between execute and submit machine  If submit machine restarts › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: job_lease_duration = 1200

5 Submit node robustness: Job Progress continues if submit machine fails Automatic Schedd Failover Condor can support a submit machine “hot spare”  If your submit machine A is down for longer than N minutes, a second machine B can take over  Requires shared filesystem (or just DRBD*?) between machines A and B *Distributed Replicated Block Device –

6 DRBD

7 Interactive Debugging › Why is my job still running? Is it stuck accessing a file? Is it in an infinite loop? › condor_ssh_to_job  Interactive debugging in UNIX  Use ps, top, gdb, strace, lsof, …  Forward ports, X, transfer files, etc.

8 condor_ssh_to_job Example % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 einstein 4/15 06: :10:05 R cosmos 1 jobs; 0 idle, 1 running, 0 held % condor_ssh_to_job 1.0 Welcome to Your condor job is running with pid(s) $ gdb –p …

9 How it works › ssh keys created for each invocation › ssh  Uses OpenSSH ProxyCommand to use connection created by ssh_to_job › sshd  runs as same user id as job  receives connection in inetd mode So nothing new listening on network Works with CCB and shared_port

10 What?? Ssh to my worker nodes?? › Why would any sysadmin allow this? › Because the process tree is managed  Cleanup at end of job  Cleanup at logout › Can be disabled by nonbelievers

11 Concurrency Limits › Limit job execution based on admin- defined consumable resources  E.g. licenses › Can have many different limits › Jobs say what resources they need › Negotiator enforces limits pool-wide 11

12 Concurrency Example › Negotiator config file  MATLAB_LIMIT = 5  NFS_LIMIT = 20 › Job submit file  concurrency_limits = matlab,nfs:3  This requests 1 Matlab token and 3 NFS tokens 12

13 Green Computing › The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc.)  HIBERNATE, HIBERNATE_CHECK_INTERVAL  If all slots return non-zero, then the machine can powered down via condor_power hook  A final acked classad is sent to the collector that contains wake-up information › Machines ads in “Offline State”  Stored persistently to disk  Ad updated with “demand” information: if this machine was around, would it be matched?

14 Now what?

15 condor_rooster › Periodically wake up based on ClassAd expression (Rooster_UnHibernate) › Throttling controls › Hook callouts make for interesting possibilities…

16 Job Router › Automated way to let jobs run on a wider array of resources  Transform jobs into different forms  Reroute jobs to different destinations 16

17 What is “job routing”? 17 Universe = “vanilla” Executable = “sim” Arguments = “seed=345” Output = “stdout.345” Error = “stderr.345” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” Universe = “grid” GridType = “gt2” GridResource = \ “cmsgrid01.hep.wisc.edu/jobmanager-condor” Executable = “sim” Arguments = “seed=345” Output = “stdout” Error = “stderr” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” JobRouter Routing Table: Site 1 … Site 2 … final status routed (grid) joboriginal (vanilla) job

18 Routing is just site-level matchmaking › With feedback from job queue number of jobs currently routed to site X number of idle jobs routed to site X rate of recent success/failure at site X › And with power to modify job ad change attribute values (e.g. Universe) insert new attributes (e.g. GridResource) add a “portal” grid proxy if desired 18

19 Condor-G Matchmaking › Use Condor’s match-making to select sites to send grid universe jobs to › You must create the “machine” ads yourself using condor_advertise › No claiming protocol › Each machine ad can match to multiple jobs

20 Dynamic Slot Partitioning › Divide slots into chunks sized for matched jobs › Readvertise remaining resources › Partitionable resources are cpus, memory, and disk › See Matt Farrellee’s talk 20

21 Dynamic Partitioning Caveats › Cannot preempt original slot or group of sub-slots  Potential starvation of jobs with large resource requirements › Partitioning happens once per slot each negotiation cycle  Scheduling of large slots may be slow 21

22 High Throughput Parallel Computing › Parallel jobs that run on a single machine  Today 8-16 cores, tomorrow 32+ cores › Use whatever parallel software you want  It ships with the job  MPI, OpenMP, your own scripts  Optimize for on-board memory access

23 Configuring Condor for HTPC › Two strategies:  Suspend/drain jobs to open HTPC slots  Hold empty cores until HTPC slot is open › We have a recipe for the former on the Condor Wiki  › User accounting enabled by Condor’s notion of “Slot Weights”

24 CPU Affinity Four core Machine running four jobs w/o affinity j1j2 j3j4 j3aj3bj3cj3d core1core2core3core4

25 CPU Affinity to the rescue SLOT1_CPU_AFFINITY = 0 SLOT2_CPU_AFFINITY = 1 SLOT3_CPU_AFFINITY = 2 SLOT4_CPU_AFFINITY = 3

26 Four core Machine running four jobs w/affinity j1j2 j3j4 j3a j3b j3c j3d core1core2core3core4

27 Condor + Hadoop FS (HDFS)  Condor+HDFS = = 5 !!!  A Synergy exists (next slide) Hadoop as distributed storage system Condor as cluster management system  Large number of distributed disks in a compute cluster Managing disk as a resource

28 condor_hdfs daemon › Main integration point of HDFS within Condor › Configures HDFS cluster based on existing condor_config files › Runs under condor_master and can be controlled by existing Condor utilities › Publish interesting parameters to Collector e.g IP address, node type, disk activity › Currently deployed at UW-Madison

29 Condor + HDFS : Next Steps? › Integrate with File Transfer Mechanism › FileNode Failover › Management of HDFS › What about HDFS in a GlideIn environment?? › Online transparent access to HDFS??

30 Remote I/O Socket › Job can request that the condor_starter process on the execute machine create a Remote I/O Socket › Used for online access of file on submit machine – without Standard Universe.  Use in Vanilla, Java, … › Libraries provided for Java and for C, e.g. : Java: FileInputStream -> ChirpInputStream C : open() -> chirp_open() › Or use Parrot!

31 Job Fork startershadow Home File System I/O Library I/O ServerI/O Proxy Secure Remote I/O Local System Calls Local I/O (Chirp) Execution Site Submission Site

32

33 DMTCP › Written at Northeastern U. and MIT › User-level process checkpoint/restart library › Fewer restrictions than Condor’s Standard Universe  Handles threads and multiple processes  No re-link of executable › DMTCP and Condor Vanilla Universe integration exists via a job wrapper script

34 Questions? Thank You!