Presentation is loading. Please wait.

Presentation is loading. Please wait.

PBS Job Management and Taskfarming Joachim Wagner 2008-07-24.

Similar presentations


Presentation on theme: "PBS Job Management and Taskfarming Joachim Wagner 2008-07-24."— Presentation transcript:

1 PBS Job Management and Taskfarming Joachim Wagner 2008-07-24

2 Outline Why do we need a cluster? Architecture Machines Software Job Management Running jobs Commands PBS Job discriptions Taskfarming Plans

3 Why do we need a cluster? Resource conflicts Waiting for colleague’s job to finish Trouble, e.g. disk full Medium-size jobs Too big for desktop PC Too small for ICHEC Preparation of ICHEC runs Learning

4 Cluster Architecture School Network maia.computing.dcu.ie Separate Network Logins Home Software Job Queue Nodes

5 Installed Software OpenMPI SRILM MaTrEx, Moses, GIZA++ XLE, Sicstus Johnson & Charniak’s reranking parser In progress: LFG AA, incl. function labeller

6 PBS Job Management Job Queue Job SubmissionUser: Job Description Job Execution Job Scheduler Nodes are allocated job-exclusive for the duration of the job

7 PBS Job Management Commands qsub myjob.pbs submits a job PBS description: shell script with #PBS commands (ignored by shell, see next slide) qstat, qstat –f jobnumber qdel jobnumber pbsnodes –a list all nodes with status and properties

8 PBS Job Description Number of nodes CPUs/node Notification: end, begin and abort Maximum runtime

9 Example: Memory-Intensive Job

10 Node Properties min4GB, min8GB: at least this much mem4GB, mem8GB: exactly this much long: please use this property for long jobs will leave nodes that do not have this property available for other jobs no limit enforced, but 24 h seems reasonable Future: 16 and 32 GB CPU local disk space (/tmp and swap)

11 CPU-Intensive Jobs Parallelisable, for example Sentence by sentence processing Cross-validation runs Parameter search Split into parts Run each part on a different CPU

12 Taskfarming PBS Job Description Taskfarming Executable (n instances) 1 Mastern-1 Worker Task file (.tfm): one task per line reading MPI Communication Task execution child process

13 Taskfarming Executable If Instance ID == 0 Run master code loop: Read.tfm file (arg 1) Send lines to worker Exit if no more task and all worker finished Else Run worker loop: Ask master for a task Execute task Exit if master has no more tasks

14 Example: Taskfarming PBS File

15 Example: Taskfarming TFM File

16 Example: Taskfarming Helper Script run-package.sh

17 Example: Taskfarming in Action 000 CPU 1 001 002 Master: reads.tfm and distributes tasks CPU 2 CPU 3 CPU 4 003 005 004 006 time 008 007 009 010 011 012 idle

18 Example: Non-Terminating Task 000 CPU 1 001 002 Master: reads.tfm and distributes tasks CPU 2 CPU 3 CPU 4 003 005 004 006 (does not terminate) Killed at Walltime Limit 008 007 009 010 011 012 idle

19 Estimating the PBS Walltime Parameter Collect durations from test run Usually high variance of execution time Long sentences Parameters Don’t use #packages x avg. time per package High risk (~50 %) that more time is needed Instead: Random sampling with observed package durations: /home/jwagner/tools/walltime.py

20 Effect of Task Size Job will wait for last task to finish (or be killed when walltime limit is reached) What if a task crashes? Results are incomplete Next tasks is executed What if a task does not terminate? Results are incomplete Fewer CPUs available for remaining tasks Overhead of starting tasks

21 Considering Multiple CPUs per Node 4 CPU cores per node 8 GB node -> 2 GB per core CPUs compete for RAM Swapping of one task effects the 3 other tasks Relatively slow CPUs in nodes Compared to new desktop PCs Optimise throughput of cluster / EUR Not: throughput of node or CPU Depends on application

22 Plans Fix sporadic errors of taskfarm.py re-implentation in C XML-RPC-based taskfarming http-based Run master on maia Run workers also outside the cluster Set parameters at runtime Add more nodes (CNGL) Install additional software

23 Questions ? Contact: Joachim Wagner CNGL System Administrator jwagner@computing.dcu.ie (01) 700 6915


Download ppt "PBS Job Management and Taskfarming Joachim Wagner 2008-07-24."

Similar presentations


Ads by Google