Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory
Oslo, June 15, 2005ICPP-HPSEC Motivation fully permutable loops always a computational challenge for HPC hybrid parallelization attractive for DSM architectures currently, popular free message passing libraries provide limited multi-threading support SPMD hybrid parallelization suffers from intrinsic load imbalance
Oslo, June 15, 2005ICPP-HPSEC Contribution two static thread load balancing schemes (constant-variable) for coarse-grain funneled hybrid parallelization of fully permutable loops generic simple to implement experimental evaluation against micro-kernel benchmarks of different programming models message passing fine-grain hybrid coarse-grain hybrid (unbalanced, balanced)
Oslo, June 15, 2005ICPP-HPSEC Algorithmic model foracross tile 1 do … foracross tile N do for tile n-1 do Receive(tile); Compute(A,tile); Send(tile); Restrictions: fully permutable loops unitary inter-process dependencies
Oslo, June 15, 2005ICPP-HPSEC Message passing parallelization tiling transformation (overlapped?) computation and communication phases pipelined execution portable scalable highly optimized
Oslo, June 15, 2005ICPP-HPSEC Hybrid parallelization So… why bother?
Oslo, June 15, 2005ICPP-HPSEC Hybrid parallelization: why bother I shared memory programming model vs message passing programming model for shared memory architecture
Oslo, June 15, 2005ICPP-HPSEC Hybrid parallelization: why bother II DSM architectures are popular!
Oslo, June 15, 2005ICPP-HPSEC Fine-grain hybrid parallelization incremental parallelization of loops relatively easy to implement popular Amdahl’s law restricts parallel efficiency overhead of thread structures re-initialization restrictive programming model for many applications
Oslo, June 15, 2005ICPP-HPSEC Coarse-grain hybrid parallelization generic SPMD programming style good parallelization efficiency no thread re-initialization overhead more difficult to implement intrinsic load imbalance assuming common funneled thread support level
Oslo, June 15, 2005ICPP-HPSEC MPI thread support levels single masteronly funneled serialized multiple fine-grain hybrid coarse-grain hybrid comm comp comm … comp …
Oslo, June 15, 2005ICPP-HPSEC Load balancing Idea Consequence master thread assumes a smaller fraction of the process tile computational load compared to other threads
Oslo, June 15, 2005ICPP-HPSEC Load balancing (2) T………total number of threads p………current process id Assuming It follows
Oslo, June 15, 2005ICPP-HPSEC Load balancing (3)
Oslo, June 15, 2005ICPP-HPSEC Experimental Results 8-node dual SMP Linux Cluster (800 MHz PIII, 256 MB RAM, kernel ) MPICH v ( --with-device=ch_p4, --with-comm=shared, P4_SOCKBUFSIZE=104KB ) Intel C++ compiler 8.1 ( -O3 -static -mcpu=pentiumpro ) FastEthernet interconnection network
Oslo, June 15, 2005ICPP-HPSEC Alternating Direction Implicit (ADI) Stencil computation used for solving partial differential equations Unitary data dependencies 3D iteration space (X x Y x Z)
Oslo, June 15, 2005ICPP-HPSEC ADI
Oslo, June 15, 2005ICPP-HPSEC Synthetic benchmark
Oslo, June 15, 2005ICPP-HPSEC Conclusions fine-grain hybrid parallelization inefficient unbalanced coarse-grain hybrid parallelization also inefficient balancing improves hybrid model performance variable balanced coarse-grain hybrid model most efficient approach overall relative performance improvement increases for higher communication vs computation needs
Oslo, June 15, 2005ICPP-HPSEC Thank You! Questions?
Oslo, June 15, 2005ICPP-HPSEC ADI
Oslo, June 15, 2005ICPP-HPSEC Synthetic benchmark