Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using MPI to run parallel jobs on the Grid

Similar presentations


Presentation on theme: "Using MPI to run parallel jobs on the Grid"— Presentation transcript:

1 Using MPI to run parallel jobs on the Grid
Marcello Iacono Manno INFN Catania – Italy FIRST GRID INTERNATIONAL SCHOOL FOR INDUSTRIAL APPLICATIONS 2007, June 30th-July 7th

2 Outline Requirements & Settings Overview How to create a MPI job
How to submit a MPI job to the Grid The OpenFOAM use case Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

3 Overview Currently parallel applications use “special” HW/SW
Parallel application are “normal” on a Grid Many are trivially parallelizable Grid middleware offers several parallel jobs (DAG, collection) A common solution for non – trivial parallelism is: Message Passing Interface (MPI) based on send() and receive() primitives a “master” node starts some processes “slaves” by establishing SSH sessions all processes can share a common workspace Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

4 MPI & Grid Several MPI implementations but only two of them are currently supported by the Grid Middleware: MPICH MPICH2 Both “old” GigaBit Ethernet and “new” low-latency InfiniBand nets are supported Cometa infrastructure will run MPI jobs on either GigaBit (MPICH, MPICH2) or InfiniBand (MVAPICH, MVAPICH2) Currently, MPI parallel jobs can run inside a single Computing Elements (CE) only several projects are involved into studies concerning the possibility of executing parallel jobs on Worker Nodes (WNs) belonging to different CEs NEW ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

5 JDL (1/3) From the user’s point of view, MPI jobs are specified by setting the JDL JobType attribute to MPICH, MPICH2, MVAPICH, MVAPICH2 specifying the NodeNumber attribute as well JobType = “MPICH”; NodeNumber = 2; This attribute defines the required number of CPU cores (PEs) Matchmaking: the Resource Broker (RB) chooses a CE (if any!) with enough free Processing Elements (PE = CPU cores) e.g.: free PE# ≥ NodeNumber (otherwise “wait!”) ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

6 JDL (2/3) When these two attributes are included in a JDL script the following expression is automatically added: (other.GlueCEInfoTotalCPUs >= NodeNumber) && Member (“MPICH”,other.GlueHostApplicationSoftwareRunTimeEnvironment) to the JDL requirements expression in order to find out the best resource where the job can be executed Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

7 NEW ! JDL (3/3) Executable specifies the MPI executable
NodeNumber specifies the number of cores Arguments specifies the WN command line Executable + Arguments form the command line on the WN mpi.pre.sh is a special script file that is sourced before launching MPI executable warning: it runs only on the master node actual mpirun command is issued by the middleware (… what if a proprietary script/bin?) mpi.pre.sh is a special script file that is sourced after MPI executable termination ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

8 Requirements (1/3) In order to assure that a MPI job can run, the following requirements MUST BE satisfied: the MPICH/MPICH2/MVAPICH/MVAPICH2 software must be installed and placed in the PATH environment variable, on all the WNs of the CE some MPI applications require a file system shared among the WNs: no shared area available at the moment application may access the area of the master (requires modifications to the application) middleware solutions are also possible (as soon as required) ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

9 Requirements (2/3) Job wrapper copies all the files indicated in the InputSandbox on ALL of the “slave” nodes host based ssh authentication MUST BE well configured between all the WNs If some environment variables are needed ONLY on the “master” node, they can be set by the mpi.pre.sh If some environment variables are needed ON ALL THE NODES, a static installation is currently required (middleware extension is under consideration) ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

10 Requirements (3/3) Two ways to install the software:
Statically (public installation) Advantage: speeds up job execution (less work to do) Disadvantages: compatibility problems with middleware and/or other applications more complex modifying and updating the application requires Software Manager (SWM) role privileges Usage: huge and/or stable SW packages only MPI binary executable available (difficultly modifiable) Dynamically (“local” installation during job pre – processing) Advantages: more robust and flexible jobs easier modifying and updating No SWM privileges required Disadvantages: slows down job execution (more work to do) little and/or frequently modified SW packages MPI launching script (easily modifiable) Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

11 NEW mpi.jdl [ Type = "Job"; JobType = "MPICH";
Executable = “MPIparallel_exec”; NodeNumber = 2; Arguments = “arg1 arg2 arg3"; StdOutput = "test.out"; StdError = "test.err"; InputSandbox = {“mpi.pre.sh”,“mpi.post.sh”, “MPIparallel_exec”}; OutputSandbox = {“test.err”, “test.out”, “executable.out”}; Requirements = other.GlueCEInfoLRMSType == "PBS" || other.GlueCEInfoLRMSType == "LSF"; ] Executable Pre e Post Processing Scripts Local Resource Manager (LRMS) = PBS/LSF only Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

12 GigaBit vs InfiniBand InfiniBand is a switched fabric communications link primarily used in high-performance computing. Its features include quality of service and failover, and it is designed to be scalable. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices The advantage of using a low – latency network becomes more evident the greater the number of nodes Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

13 CPI Test (1/4) [marcello@infn-ui-01 mpi-0.13]$ edg-job-submit mpi.jdl
Selected Virtual Organisation name (from proxy certificate extension): cometa Connecting to host infn-rb-01.ct.pi2s2.it, port 7772 Logging to host infn-rb-01.ct.trigrid.it, port 9002 ********************************************************************************************* JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

14 CPI Test (2/4) mpi-0.13]$ edg-job-status ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : Current Status: Done (Success) Exit code: Status Reason: Job terminated successfully Destination: infn-ce-01.ct.pi2s2.it:2119/jobmanager-lcglsf-short reached on: Sun Jul 1 15:08: Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

15 CPI Test (3/4) mpi-0.13]$ edg-job-get-output --dir /home/marcello/JobOutput/ Retrieving files from host: infn-rb-01.ct.pi2s2.it ( for ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - have been successfully retrieved and stored in the directory: /home/marcello/JobOutput/marcello_vYGU1UUfRnSktGODcwEjMw Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

16 CPI Test (4/4) mpi-0.13]$ cat /home/marcello/JobOutput/marcello_vYGU1UUfRnSktGODcwEjMw/test.out preprocessing script infn-wn-01.ct.pi2s2.it Process 0 of 4 on infn-wn-01.ct.pi2s2.it pi is approximately , Error is wall clock time = Process 1 of 4 on infn-wn-01.ct.pi2s2.it Process 3 of 4 on infn-wn-01.ct.pi2s2.it Process 2 of 4 on infn-wn-01.ct.pi2s2.it TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME ==== ========== ================ ======================= =================== 0001 infn-wn-01 /opt/lsf/6.1/lin Done /01/ :04:23 0002 infn-wn-01 /opt/lsf/6.1/lin Done /01/ :04:23 0003 infn-wn-01 /opt/lsf/6.1/lin Done /01/ :04:23 0004 infn-wn-01 /opt/lsf/6.1/lin Done /01/ :04:23 P4 procgroup file is /home/cometa005/.lsf_6826_genmpi_pifile. postprocessing script temporary mpi-0.13]$ Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

17 a “forerunner” experience
OpenFOAM (1/14) OpenFOAM is an open-source SW for fluid dynamics it is a parallel, multi – stage application several solvers, C-based libraries PORTING OPENFOAM to the Grid: a “forerunner” experience static installation the gcc4.1.2 compiler the 32/64 bit issue LAM / MPICH which mpirun machines file preparing parallel execution Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

18 ! OpenFOAM (2/14) Installation compressed package is about 300 MB
dynamical installation not feasible “stable” package jobs can also be “short” (test) environment problem not solved (… but some further testing is required on this particular issue) static installation software manager directly on the WNs installation job (lcg-asis) check for compatibility middleware other applications ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

19 ! OpenFOAM (3/14) Gcc4.1.2 32/64 bit different middleware versions
OpenFOAM-1.4 requires at least gcc3.4.6 (4.1.2 suggested) ”old” WNs run 32 bits Scientific Linux (SLC) with gcc3.2.6 … INCOMPATIBILITY 64-bit installation OpenFOAM-1.4 full-64 bit is available “new” WNs have a 64 bit-architecture with SLC4 and gcc4.x.x 32/64 bit no 64-bit user interface still available pre and post – processing must be run (as jobs) on the WN autonomous jobs or mpi.pre.sh / mpi.post.sh ? alternatively cross-pre-processing on a 64-bit machine, then transferring the OpenFOAM “user” directory ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

20 ! OpenFOAM (4/14) LAM / MPICH Which mpirun Machines files
different protocols OpenFOAM-1.4 is declared MPICH-compliant (by official docs) only libPstream needed to be recompiled (thanks to the forum!) no “native” mpirun allowed (Grid incompatible) Which mpirun new 64-bit (Grid compliant) mpirun script different libraries for MPICH/MPICH2, InfiniBand/GigaBit device p4  ch_p Machines files specified by the middleware (transparent to the user) all the WNs of a CE ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

21 ! OpenFOAM (5/14) Preparing parallel execution Current pre-processing
OpenFOAM user workspace is “local”: a sub – directory called user-1.4 under $PWD $PWD/user-1.4/run/tutorials/<root>/<case>/system <root>  (icoFoam) <case>  geometry (cavity) system  running system information (file) decomposeParDict  mesh decomposition decomposePar icoFoam cavity (file) proc0…n  processing element information Current pre-processing needs to become a job or alternatively, a 64-bit UI ! Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

22 shared installation space
OpenFOAM (6/14) OpenFOAM.grid]$ ll /home/marcello/JobOutput/marcello_-fGr9naNAp6Phj3P_GacQw total 88 -rw-rw-r marcello marcello Jul 2 16:02 openfoam.err -rw-rw-r marcello marcello Jul 2 16:02 openfoam.out OpenFOAM.grid]$ head -n 100 /home/marcello/JobOutput/marcello_-fGr9naNAp6Phj3P_GacQw/openfoam.out Mon Jul 2 15:54:03 CEST 2007 openfoam pre-processing file infn-wn-10.ct.pi2s2.it WM_LINK_LANGUAGE=c++ WM_ARCH=linux WM_JAVAC_OPTION=Opt WM_PROJECT_VERSION=1.4 WM_COMPILER_LIB_ARCH= WM_PROJECT_INST_DIR=/opt/exp_soft/cometa/OpenFOAM WM_PROJECT_DIR=/opt/exp_soft/cometa/OpenFOAM/OpenFOAM-1.4 WM_COMPILER_ARCH=-64 shared installation space 64-bit architecture Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

23 OpenFOAM (7/14) “native” gcc compiler double precision
WM_PROJECT=OpenFOAM WM_COMPILER=Gcc4 WM_MPLIB=MPICH WM_COMPILER_DIR=/opt/exp_soft/cometa/OpenFOAM/linux/gcc WM_COMPILE_OPTION=Opt WM_SHELL=bash WM_DIR=/opt/exp_soft/cometa/OpenFOAM/OpenFOAM-1.4/wmake WM_PROJECT_USER_DIR=/home/cometa005/globus-tmp.infn-wn /.mpi/https_3a_2f_2finfn-rb-01.ct.pi2s2.it_3a9000_2f-fGr9naNAp6Phj3P_5fGacQw/user-1.4 WM_OPTIONS=linuxGcc4DPOpt WM_PROJECT_LANGUAGE=c++ WM_PRECISION_OPTION=DP “native” gcc compiler double precision Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

24 OpenFOAM (8/14) home dir on the “master” WN pre and post –
content of: /home/cometa005/globus-tmp.infn-wn /.mpi/https_3a_2f_2finfn-rb-01.ct.pi2s2.it_3a9000_2f-fGr9naNAp6Phj3P_5fGacQw total 5536 lrwxrwxrwx 1 cometa005 cometa Jul 2 15:54 cometa > /home/cometa005/globus-tmp.infn-wn /.mpi/https_3a_2f_2finfn-rb-01.ct.pi2s2.it_3a9000_2f-fGr9naNAp6Phj3P_5fGacQw/user-1.4 -rw-r--r-- 1 cometa005 cometa Jul 2 15:53 host14808 -rw-r--r-- 1 cometa005 cometa Jul 2 15:53 mpi.post.sh -rw-r--r-- 1 cometa005 cometa Jul 2 15:53 mpi.pre.sh -rw-r--r-- 1 cometa005 cometa Jul 2 15:54 openfoam.err -rw-r--r-- 1 cometa005 cometa Jul 2 15:54 openfoam.out drwxr-xr-x 5 cometa005 cometa Jun 18 13:15 user-1.4 -rw-r--r-- 1 cometa005 cometa Jul 2 15:53 user-1.4.tar content of /opt/exp_soft/cometa/OpenFOAM total 12 -rw-rw-rw- 1 cometa001 cometa 382 Jun 20 10:09 env_config drwxrwxrwx 3 cometa001 cometa 4096 Jun 6 08:48 linux drwxrwxrwx 10 cometa001 cometa 4096 Jun 8 18:23 OpenFOAM-1.4 home dir on the “master” WN pre and post – processing scripts user dir Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

25 OpenFOAM (9/14) mesh decomposition
content of /home/cometa005/globus-tmp.infn-wn /.mpi/https_3a_2f_2finfn-rb01.ct.pi2s2.it_3a9000_2f-fGr9naNAp6Phj3P_5fGacQw/user-1.4/run/tutorials total 16 -rwxr-xr-x 1 cometa005 cometa 1308 Jun 19 16:31 controlDict -rwxr-xr-x 1 cometa005 cometa 1363 Jun 29 13:53 decomposeParDict -rwxr-xr-x 1 cometa005 cometa 1479 Jun 19 16:31 fvSchemes -rwxr-xr-x 1 cometa005 cometa 1305 Jun 19 16:31 fvSolution /* *\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: | | \\ / A nd | Web: | | \\/ M anipulation | | \* */ […] mesh decomposition Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

26 OpenFOAM (10/14) [1] Date : Jul 02 2007 [1] Time : 15:54:22
[1] Host : infn-wn-10.ct.pi2s2.it [1] PID : 29680 [4] Date : Jul [4] Time : 15:54:22 [4] Host : infn-wn-06.ct.pi2s2.it [4] PID : 25510 [3] Date : Jul [3] Time : 15:54:22 [3] Host : infn-wn-10.ct.pi2s2.it […] [MPI Pstream initialized with: floatTransfer : 1 nProcsSimpleSum : 0 scheduledTransfer : 0 Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

27 OpenFOAM (11/14) Process IDs
Exec : /opt/exp_soft/cometa/OpenFOAM/OpenFOAM-1.4/applications/bin/linuxGcc4DPOpt/icoFoam /home/cometa005/globus-tmp.infn-wn /.mpi/https_3a_2f_2finfn-rb-01.ct.pi2s2.it_3a9000_2f-fGr9naNAp6Phj3P_5fGacQw/cometa /run/tutorials/icoFoam cavity –parallel [0] 7 [0] ( [0] infn-wn-10.ct.pi2s2.it.29680 [0] infn-wn-10.ct.pi2s2.it.30265 [0] infn-wn-10.ct.pi2s2.it.30851 [0] infn-wn-06.ct.pi2s2.it.25510 [0] infn-wn-06.ct.pi2s2.it.26093 [0] infn-wn-06.ct.pi2s2.it.26676 [0] infn-wn-06.ct.pi2s2.it.27259 [0] ) [0] Create time Process IDs Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

28 OpenFOAM (12/14) Create mesh for time = 0 [7] Date : Jul 02 2007
[7] Host : infn-wn-06.ct.pi2s2.it [7] PID : 27259 [7] Root : /home/cometa005/globus-tmp.infn-wn /.mpi/https_3a_2f_2finfn-rb-01.ct.pi2s2.it_3a9000_2f-fGr9naNAp6Phj3P_5fGacQw/cometa /run/tutorials/icoFoam [7] Case : cavity [7] Nprocs : 8 Reading transportProperties Reading field p Reading field U Reading/calculating face flux field phi Starting time loop Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

29 OpenFOAM (13/14) Time = 0.005 Courant Number mean: 0 max: 0
DILUPBiCG: Solving for Ux, Initial residual = 1, Final residual = e-06, No Iterations 11 DILUPBiCG: Solving for Uy, Initial residual = 0, Final residual = 0, No Iterations 0 DICPCG: Solving for p, Initial residual = 1, Final residual = e-07, No Iterations 47 time step continuity errors : sum local = e-09, global = e-10, cumulative = e-10 DICPCG: Solving for p, Initial residual = , Final residual = e-07, No Iterations 46 time step continuity errors : sum local = e-08, global = e-10, cumulative = e-12 ExecutionTime = 0.06 s ClockTime = 0 s Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

30 OpenFOAM (14/14) and then…? pre and post - processing
environment exportation in mpi.pre.sh mesh decomposition as a job data recollection from slaves “project” as a whole GUI integration data storage …… Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

31 MPI on the web.. https://edms.cern.ch/file/454439/LCG-2-UserGuide.pdf
Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th

32 Questions… Acitrezza (CT), First International Grid School for Industrial Applications, 2007, June 30th – July 7th


Download ppt "Using MPI to run parallel jobs on the Grid"

Similar presentations


Ads by Google