Parallel Programming On the IUCAA Clusters Sunu Engineer.

Parallel Programming On the IUCAA Clusters Sunu Engineer

IUCAA Clusters  The Cluster – Cluster of Intel Machines on Linux  Hercules – Cluster of HP ES45 quad processor nodes  References: http://www.iucaa.ernet.in/

The Cluster  Four Single Processor Nodes with 100 Mbps Ethernet interconnect.  1.4 GHz, Intel Pentium 4  512 MB RAM  Linux 2.4 Kernel (Redhat 7.2 Distribution)  MPI – LAM 6.5.9  PVM – 3.4.3

Hercules  Four quad processor nodes with Memory Channel interconnect  1.25 GHz Alpha 21264D RISC Processor  4 GB RAM  Tru64 5.1A with TruCluster software  Native MPI  LAM 7.0  PVM 3.4.3

Expected Computational Performance  Intel Cluster  Processor - 512/590  System GFLOPS ~ 2  Algorithm/Benchmark Used – Specint/float/HPL  ES45 Cluster  Processor ~ 679/960  System GFLOPS ~ 30  Algorithm/Benchmark Used – Specint/float/HPL

Parallel Programs  Move towards large scale distributed programs  Larger class of problems with higher resolution  Enhanced levels of details to be explored  …

The Starting Point  Model  Single Processor Program  Multi Processor Program  Model  Multiprocessor Program

Decomposition of a Single Processor Program  Temporal  Initialization  Control  Termination  Spatial  Functional  Modular  Object based

Multi Processor Programs  Spatial delocalization – Dissolving the boundary  Single spatial coordinate - Invalid  Single time coordinate - Invalid  Temporal multiplicity  Multiple streams at different rates w.r.t an external clock.

In comparison  Multiple points of initialization  Distributed control  Multiple points and times of termination  Distribution of the activity in space and time

Breaking up a problem

Yet Another way

And another

Amdahl’s Law

Degrees of refinement  Fine parallelism  Instruction level  Program statement level  Loop level  Coarse parallelism  Process level  Task level  Region level

Patterns and Frameworks  Patterns - Documented solutions to recurring design problems.  Frameworks – Software and hardware structures implementing the infrastructure

Processes and Threads  From heavy multitasking to lightweight multitasking on a single processor  Isolated memory spaces to shared memory space

Posix Threads in Brief  pthread_create(pthread_t id, pthread_attr_t attributes, void *(*thread_function)(void *), void * arguments)  pthread_exit  pthread_join  pthread_self  pthread_mutex_init  pthread_mutex_lock/unlock  Link with –lpthread

Multiprocessing architectures  Symmetric Multiprocessing  Shared memory  Space Unified  Different temporal streams  OpenMP standard

OpenMP Programming  Set of directives to the compiler to express shared memory parallelism  Small library of functions  Environment variables.  Standard language bindings defined for FORTRAN, C and C++

Open MP example #include int main(int argc, char ** argv) { #pragma omp parallel { printf(“Hello World from %d\n”,omp_get_thread_num() ); } return(0); } C An openMP program program openmp !$OMP PARALLEL print *, “Hello world from”, omp_get_thread_num() !$OMP END PARALLEL stop end

Open MP directives Parallel and Work sharing  OMP Parallel [clauses]  OMP do [ clauses]  OMP sections [ clauses]  OMP section  OMP single

Combined work sharing Synchronization  OMP parallel do  OMP parallel sections  OMP master  OMP critical  OMP barrier  OMP atomic  OMP flush  OMP ordered  OMP threadprivate

OpenMP Directive clauses  shared(list)  private(list)/threadprivate  firstprivate/lastprivate(list)  default(private|shared|none)  default(shared|none)  reduction (operator|intrinsic : list)  copyin(list)  if (expr)  schedule(type[,chunk])  ordered/nowait

Open MP Library functions  omp_get/set_num_threads()  omp_get_max_threads()  omp_get_thread_num()  omp_get_num_procs()  omp_in_parallel()  omp_get/set_(dynamic/nested)()  omp_init/destroy/test_lock()  omp_set/unset_lock()

OpenMP environment variables  OMP_SCHEDULE  OMP_NUM_THREADS  OMP_DYNAMIC  OMP_NESTED

OpenMP Reduction and Atomic Operators  Reduction : +,-,*,&,|,&&,||  Atomic : ++,--,+,*,-,/,&,>>,<<,|

Simple loops  do I=1,N z(I) = a * x(I) + y end do !$OMP parallel do do I=1,N z(I) = a * x(I) + y end do

Data Scoping  Loop index private by default  Declare as shared, private or reduction

Private variables  !$OMP parallel do private(a,b,c) do I=1,m do j =1,n b=f(I) c=k(j) call abc(a,b,c) end do #pragma omp parallel for private(a,b,c)

Dependencies  Data dependencies (Lexical/dynamic extent)  Flow dependencies  Classifying and removing the dependencies  Non removable dependencies  Examples Do I=2,n a(I) =a(I)+a(I-1) end do Do I=2,N,2 a(I)= a(I)+a(I-1) End do

Making sure everyone has enough work  Parallel overhead – Creation of threads, synchronization vs. work done in the loop $!OMP parallel do schedule(dynamic,3) schedule type – static, dynamic, guided,runtime

Parallel regions – from fine to coarse parallelism  $!OMP Parallel  threadprivate and copyin  Work sharing constructs  do, sections, section, single Synchronization  critical, atomic, barrier, ordered, master

To distributed memory systems  MPI, PVM, BSP …

Existing parallel libraries and toolkits include:  PUL, the Parallel Utilities Library from EPCC.  The Multicomputer Toolbox from Tony Skjellum and colleagues at LLNL and MSU.  The Portable, Extensible, Toolkit for Scientific computation from ANL.  ScaLAPACK from ORNL and UTK.  ESSL, PESSL on AIX  PBLAS, PLAPACK, ARPACK Some Parallel Libraries

Parallel Programming On the IUCAA Clusters Sunu Engineer.

Similar presentations

Presentation on theme: "Parallel Programming On the IUCAA Clusters Sunu Engineer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming On the IUCAA Clusters Sunu Engineer.

Similar presentations

Presentation on theme: "Parallel Programming On the IUCAA Clusters Sunu Engineer."— Presentation transcript:

Similar presentations

About project

Feedback