Presentation is loading. Please wait.

Presentation is loading. Please wait.

Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.

Similar presentations


Presentation on theme: "Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe."— Presentation transcript:

1 Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe

2 Overview › Task vs. Job Parallelism › New Condor support for Task- Parallelism › Other goodies

3 The Talk in one Slide Parallel Universe can run any * task parallel job Not just MPICH Not just MPI…

4 Job vs Task Parallelism › Condor historically focused on Job Parallelism › Job parallelism either manually or via DAGman › Rest of talk on task parallelism › Can also get task parallel via pvm or MW

5 Parallel Universe › Adaptation of MPI universe › Modifications based on experience with MPI › User feedback › But, more than just MPI

6 MPI lifecycle without Condor › Lam Version 1. lamboot lamboot -ssi boot ssh machine_file 2. mpirun mpirun -np 8 exe arg1 arg lamhalt lamhalt

7 Scheduling › Need “Dedicated Scheduler”  "Dedicated" has a specific Condor meaning  Nodes running MPI require a dedicated scheduler  A Given machine can have many opportunistic schedulers ... but only 1 dedicated scheduler

8 DedicatedScheduler surprises › DedicatedScheduler co-opts normal negotiation cycle › Preemption and scheduling work differently than opportunistic › DedicatedScheduler schedules First- Fit, sorted by UserJobPrio › Condor_q –analyze mystery!

9 Job startup › Same file transfer, etc. as Vanilla › One shadow, many starters › Starter runs sshd on all machines, does key exchange › Starter runs the exe on first machine  (head node, Rank0)

10 Your script Here › Script on the head node has contact file › We provide samples for LAM, MPICH › We try to mimic “by hand” startup › Use condor_ssh to start remote jobs › When script exits, condor cleans up

11 Parallel Example Submit Machine Execute Machines Schedd Shadow Startd Sshd Script Job starter

12 Example submit file Universe = Parallel # executable is a script executable = script # the real binary transfer_input_files = executable arguments = arg1 arg2 arg3 machine_count = 8 output = out.$(Cluster).$(NODE) queue

13 Example Script chmod 755 simple lamboot –ssi boot rsh $MACHINE_FILE mpirun –np $NO_MACHINES simple lamhalt

14 Example submit file 2 Universe = Parallel Requirements = (Hostname == “somemachine”) queue Requirements = (Hostname != “somemachine”) queue 7

15 Example Script 2 mach1 = `sed –n 1p $MACHINE_FILE` mach2 = `sed –n 2p $MACHINE_FILE`./server & ssh $mach1 client_app ssh $mach2 client_app wait

16 Summary › With Parallel Universe in Condor 6.8 comes: › Support for most MPI implementations (some scripting required) › Somewhat better MPI scheduling › Better node placement via condor matchmaking

17 Questions? › Thank you!


Download ppt "Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe."

Similar presentations


Ads by Google