Presentation is loading. Please wait.

Presentation is loading. Please wait.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Similar presentations


Presentation on theme: "Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring."— Presentation transcript:

1 Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring of the computational workload on a set of computers Queuing: Users submit tasks or “jobs” to the resource management system where they are queued up until the system is ready to run them. Scheduling: The process of selecting which jobs to run, when, and where, according to a predetermined policy. Aimed at balance competing needs and goals on the system(s) to maximize efficient use of resources Monitoring : Tracking and reserving system resources, enforcing usage policy. This includes both software enforcement of usage limits and user or administrator monitoring of scheduling policies

2 Submitting jobs to PBS: qsub command qsub command is used to submit a batch job to PBS. Executed on aluf (login node). Submitting a PBS job specifies a task, requests resources and sets job attributes, which can be defined in an executable scriptfile. Recommended syntax of qsub command : > qsub [options] scriptfile PBS script files ( PBS shell scripts, see the next page) should be created in the user’s directory To obtain detailed information about qsub options, please use the command: > man qsub Job Identifier (JOB_ID) Upon successful submission of a batch job PBS returns a job identifier in the following format: > sequence_number.server_name > 12345.aluf01

3 ALUF Queues Description all_q - default routing queue, navigates jobs to respective destination queues according to the Wall time and CPUs number (ncpus) request in the PBS script multicore - parallel jobs up to 4 CPUs, time limit 24 hours short - Serial jobs, (1 CPU), time limit 3 hours main - Serial jobs,(1 CPU), time limit 24 hours long - Serial jobs,(1 CPU), time limit 72 hours For detailed up-to-date information on queues limits please type: " qstat -fQ queue_name "

4 The PBS shell script sections Shell specification: #!/bin/sh PBS directives: used to request resources or set attributes. A directive begins with the default string “#PBS”. Tasks (programs or commands) - environment definitions - I/O specifications - executable specifications NB! Other lines started with # are comments

5 PBS script example for multicore user code #!/bin/sh #PBS -N job_name #PBS -q queue_name #PBS -M user@technion.ac.iluser@technion.ac.il #PBS -l select=1:ncpus=4 #PBS -l select=mem=8 GB #PBS -l walltime=24:00:00 PBS_O_WORKDIR=$HOME/mydir cd $PBS_O_WORKDIR./program.exe output.file Other examples see at http://tx.technion.ac.il/doc/aluf/PBS-scripts/

6 Checking job/queue status: qstat command qstat command is used to request the status of batch jobs and queues Detailed information: > man qstat qstat output structure (see on Tamnun) Useful commands > qstat –a all users in all queues (default) > qstat -1n all jobs in the system with node names > qstat -1nu username all user’s jobs with node names > qstat –f JOB_ID extended output for the job > Qstat –Q list of all queues in the system > qstat –Qf queue_name extended queue details  qstat –1Gn queue_name all jobs in the queue with node names

7 Removing job from a queue: qdel command qdel used to delete queued or running jobs. The job's running processes are killed. A PBS job may be deleted by its owner or by the administrator Detailed information: > man qdel Useful commands > qdel JOB_ID deletes job from a queue > qdel -W force JOB_ID force delete job

8 Checking a job results and Troubleshooting Save the JOB_ID for further inspection Check error and output files: job_name.eJOB_ID;job_name.oJOB_ID Inspect job’s details (after N days ) : > ssh aluf01 > tracejob [-n N] JOB_ID Running interactive batch job: > qsub –I pbs_script Job is sent to an execution node, PBS directives executed, shell control is passed to user, job awaits user’s command Checking a job on an execution node: > ssh node_name (aluf01 or aluf02, or aluf03) > hostname > top /u user - shows user processes ; /1 – CPU usage > kill -9 PID remove job from the node > ls –rtl /gtmp check files under user’s ownership


Download ppt "Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring."

Similar presentations


Ads by Google