NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir

NERCS Users’ Group, Oct. 3, 2005 What this talk will cover Infiniband fabric –Overview of Infiniband – past, present, future –Configuration on Jacquard MPI –How to use it –Limitations/workarounds –Plans

NERCS Users’ Group, Oct. 3, 2005 Infiniband Industry standard high performance network. –Many years in development. Near death in 2000. Has come roaring back –Originally seen as PCI replacement –Retains ability to connect directly to disk controllers High performance –Direct user space access to hardware Kernel not involved in the transfer No memory-to-memory copies needed Protected access ensures security, safety –Supports RDMA (put/get) and Send/Receive models

NERCS Users’ Group, Oct. 3, 2005 IB Link Speeds Single channel – 2.5 Gb/s in each direction, simultaneously “4X” link is 10 Gb/s –10 bits encode 8 – error correction/detection –1 GB/s bidirectional per 4X link –PCI-X is 1 GB/s total bandwidth, so not possible to fully utilize IB 4X with PCI-X 12X link is three 4X links –One fat pipe to routing logic rather than three separate links Double Data Rate (DDR) is 2X the speed (5 Gb/s links) –Just now becoming available –Very short cable lengths (3M) [XXX] Quad Data Rate (QDR) is envisioned –Cable length issues with copper –Optical possible but expensive

NERCS Users’ Group, Oct. 3, 2005 IB Switching Large switches created by connecting smaller switches together Current small switch building block is 24 ports Usual configuration is fat tree –N port-switch can be used to build N 2 /2-port fat tree using 3N/2 switches (max 288 ports for N=24) –Larger switches available from some vendors are actually a 2-level fat tree based on 24-port switches. –A fat tree has “full bisection bandwidth” – supports all nodes communicating at the same time

NERCS Users’ Group, Oct. 3, 2005 Example: 120-port thin tree Each L1 switch has 4 connections to L2 switch L2 L1 All small switches have 24 ports (7 switches total) Each L1 switch has 4 “up” and 20 “down” (to nodes) 20 120 connections to nodes

NERCS Users’ Group, Oct. 3, 2005 Example: 96-port fat tree (Clos) Each L1 switch has 3 connections to each L2 switch L2 L1 All small switches have 24 ports (12 switches total) Each L1 switch has 12 “up” and 12 “down” (to nodes) 12 96 connections to nodes

NERCS Users’ Group, Oct. 3, 2005 Infiniband Routing Infiniband is “destination routed” –Switches make decisions based on destination of a packet –Even though a fat tree has full bisection bandwidth, hot spots are possible. –Routing scheme makes it more difficult to avoid network “hot spots” (not yet clear if Jacquard users impacted) –Workarounds available – will be addressed in future versions of MPI

NERCS Users’ Group, Oct. 3, 2005 Jacquard configuration Jacquard is a “2-level” fat tree –24-port switches on L1 (to nodes) –96-port switches on L2 –Really a 3-level tree because 96-port switches are 2-level trees internally –4X connections (1 GB/s) to all nodes –Innovation: 12X uplinks from L1 to L2 – smaller number of fat pipes Full bisection bandwidth –supports all nodes communicating at the same time –network supports 2X what PCI-X busses can sustain

NERCS Users’ Group, Oct. 3, 2005 Infiniband Software IB software interface originally called “Virtual Interface Architecture” (VI Architecture or VIA) NERSC wrote the first MPI for VIA (MVICH) – basis for current MPI implementation on Jacquard Microsoft derailed API in the standard De-facto current standard is VAPI – from Mellanox (part of OpenIB generation 1 software) OpenIB Gen 2 will have slightly different interface

NERCS Users’ Group, Oct. 3, 2005 MPI For Infiniband Jacquard uses MVAPICH (MPICH + VAPI) –Based on MVICH from NERSC (MPICH + VIA) and MPICH from ANL –OSU: Porting to VAPI + performance improvements –Support path OSU->Mellanox->LNXI->NERSC Support mechanisms/responsibilities being discussed MPI-1 functionality NERSC is tracking OpenMPI for Infiniband

NERCS Users’ Group, Oct. 3, 2005 Compiling/Linking MPI MPI versioning controlled by modules –“module load mvapich” in default startup files –compiler loaded independently mpicc/mpif90 –mpicc –o myprog myprog.c –mpif90 –o myprog myprog.f –uses the currently loaded pathscale module –automatically finds MPI include files –automatically find MPI libraries –latest version uses shared libraries

NERCS Users’ Group, Oct. 3, 2005 Running MPI programs Always use the “mpirun” command –written by NERSC –integrates PBS and MPI –runs with processor affinity enabled Inside a PBS job: –mpirun./a.out runs a.out on all processors allocated by PBS no need for “$PBS_NODEFILE” hack make sure to request ppn=2 with PBS “-np N” optional. Can be used to run on fewer processors On a login node –“mpirun –np 32 a.out” just works –internally: creates a PBS script (on 32 processors); runs the script interactively using “qsub –I” and expect –Max wallclock time: 30 minutes

NERCS Users’ Group, Oct. 3, 2005 mpirun current limitations Currently propagates these environment variables: –FILENV, LD_LIBRARY_PATH, LD_PRELOAD –To propagate other variables: ask NERSC Does not directly support MPMD –To run different binaries on different nodes, use a starter script that “execs” the correct binary based on the value of MPIRUN_RANK Does not allow redirection of standard input, e.g. –mpirun a.out < file Does not propagate $PATH, so “./a.out” needed even if “.” is in $PATH

NERCS Users’ Group, Oct. 3, 2005 Orphan processes mpirun (using ssh) has a habit of leaving “orphan” processes on nodes when a program fails PBS (NERSC additions) goes to great lengths to clean these up between jobs mpirun detects whether it has been previously called in the same PBS job. If so, it first tried to clean up orphan processes in case previous run failed

NERCS Users’ Group, Oct. 3, 2005 Peeking inside mpirun mpirun currently uses ssh to start up processes (internal starter is called “mpirun_rsh” – do not use this yourself) NERSC expects to move to PBS-based startup (internal starter called “mpiexec”) –may help with orphan processes, accounting, ability to redirect standard input, direct mpmd support Do not use mpirun_rsh or mpiexec directly. They are not supported by NERSC

NERCS Users’ Group, Oct. 3, 2005 MPI Memory Use Current MVAPICH uses a lot of memory per process – linear in number of MPI processes Per process: –64MB + –276KB/process up to 64 –1.2 MB/process above 64 Due to limitation in VI Architecture that does not exist in Infiniband but was carried forward Future implementations of MPI will have lower memory use Note: getrusage() doesn’t report memory use under Linux.

NERCS Users’ Group, Oct. 3, 2005 MPI Performance ping pong bandwidth: –800 MB/s (Seaborg: 320 MB/s) drops to 500 MB/s above 200k messages –theoretical peak 1000 MB/s ping pong latency: –5.6 us between nodes (Seaborg: 24us default; 21us with MPI_SINGLE_THREAD) –0.6 us within a node “random ring bandwidth”: –184 MB/s (Seaborg: ~43 MB/s at 4 nodes) –measures contention in network –theoretical peak 250 MB/s

NERCS Users’ Group, Oct. 3, 2005 MPI Futures Different startup mechanism – fewer orphans, faster startup, full environment propagated Lower memory use More control over memory registration cache Higher bandwidth

NERCS Users’ Group, Oct. 3, 2005 Summary All you need to know” –mpicc/mpif77/mpif90/mpicxx –mpirun –np N./a.out

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

Similar presentations

Presentation on theme: "NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

Similar presentations

Presentation on theme: "NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir."— Presentation transcript:

Similar presentations

About project

Feedback