Presentation on theme: "Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC."— Presentation transcript:
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC
Outline Softwares required for the cluster. Network Topology CONDOR and different roles of machine in condor pool. Various Condor Environments. Pre-requisite of Condor Configuration of Condor on our LAN. Running jobs using Condor and some commonly used condor commands MPI Conclusion `
Softwares required for the cluster The Cluster requires following softwares: Operating System : Scientific Linux CERN 5.4 – 64 bit version Cluster Management : Management is done through IPMI (inbuilt) Cluster Usage and Statistics : Using “Ganglia” Cluster Middleware : CONDOR Parallel Programming Environment : MPI
Network Topology One head node containing all users home directory 16 worker node that will provide computational power Head node will be connected to both public and private network Public Network : Allow users to login on Head Node Private Network : Connect all worker node to head node using Gigabit and Infiniband Network. Used for job submission and execution File System : Network File System to have a shared area among Head node and Worker nodes
Prototype distributed and parallel computing environment A Prototype distributed and parallel computing environment for the cluster is setup on a LAN of 4 computer. Distributed computing environment : Using CONDOR Parallel computing environment : Using MPI CONDOR Condor is an open source high-throughput computing software package for distributed parallelization of computationally intensive task. Used to manage workload on a cluster of computing nodes. It can integrate both dedicated resourced (rack mounted clusters) and non dedicated desktop machines by making use of cycle scavenging. Can run both sequential and parallel jobs. Provide different universes to run jobs (vanilla, standard, MPI, Java etc..)
Condor exceptional features: Checkpointing and Migration Remote System calls No changes are necessary to user source code Sensitive to desires of Machine owner (in case of non dedicated machine). Different roles of a machine in condor pool Central Manager : The main administration machine. Execute : These are machine where job executes. Submit : These machine are used to submit the job.
Various Condor Daemons Following condor daemons runs on different machine in the condor pool Condor_master : Take care of rest of the daemons running on a machine. Condor_collector : Responsible for collecting information about status of pool Condor_negotiator : Responsible for all the match-making within Condor System Condor_schedd : This daemon represent resource requests to the Condor pool Condor_startd : This daemon represents a given resource to the Condor pool
Various condor environment to run different types of jobs Condor provides several universes to run different type of jobs, some are as follows: Standard : This universe provides condor’s full power, it provides following features 1. Checkpointing 2. Migration 3. Remote System Calls. The job needs to be relinked with condor libraries in order to run in standard universe. This can be easily achieved by putting condor_compile in front of usual link command Eg. Normal linking of a job : gcc –o my_prog my_prog.c for standard universe the job is prepared by condor_compile gcc –o my_prog my_prog.c Now this job can utilize the power of standard universe.
Vanilla : This universe is intended for programs which cannot be successfully relinked with condor libraries. 1. Shell scripts are one of the example of jobs where vanilla is useful. 2. Jobs that run under vanilla universe cannot utilize checkpointing or remote system calls. 3. Since remote system call feature is not available so we need a shared file system such as NFS or AFS. Parallel : This universe allow parallel programs, such as MPI jobs to be run in condor environment.
Prerequisites of condor configuration Setup of Private network of machine in computing pool Passwordless login from submit machines to all execute machine (rsh or ssh) Configuration of condor on our small LAN of 4 computers On our LAN of four machines we have one head node and remaining 3 worker nodes Condor is installed and configured on our pool and role of each machine is mentioned below: 1. Head Node : Central Manager, Submit 2. Worker Node : Execute Home directory of all the users resides on head node. These home directories resides in a shared area (using NFS) which can be accessed by all the worker nodes. (required for vanilla universe). Now user can submit job from their home directories.
Running jobs using Condor Following are the steps to run the condor job. Prepare the Code Chose the Condor Universe Make the Submit description file (submit.ip), a sample file is shown below: # # # # # # # # # # # # #Sample Submit Description file # # # # # # # # # # # # Executable = getIp Universe = standard Output = getIp.out Error = getIp.err Log = hello.log Queue 15 Submit the Job: Now this job can be submitted by following condor command Condor_submit submit.ip
Commonly used Condor commands Condor_submit : Used to submit the job Condor_q : displays information about jobs in condor job queue
Condor_status : used to monitor, query and display status of the Condor pool `
Condor_history : helps the users to View log of Condor jobs completed up to date. Condor_rm : removes one or more jobs from the Condor job queue. Condor_compile : used to relink the job with condor libraries, so that now it can be executed in standard universe.
MPI MPI is language independent communication protocol used to do parallel programming. Different languages provides their wrapper compiler to do MPI programming. Here we have installed MPICH that will allow us to do parallel programming in C,C++, fortran etc. Computation v/s communication. SISD,SIMD,MISD,MIMD (Flynn’s classification) MPI requires the executable to present on all the machine in the pool This achieved via NFS shared area. Testing is done through matrix multiplication program. Considerable reduction in execution time.
Conclusion: CONDOR is installed and configured on a small LAN of 4 computers and it is working properly and is giving expected results. Later on this prototype setup will be replicated on a computing cluster having 16 worker nodes that will provide a processing power of 1.3 TFlops plus a storage of 20 TBytes. The setup is also ready to run parallel jobs. So in future if we have some parallel job application then we are ready for it. ``