Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming On the IUCAA Clusters Sunu Engineer.

Similar presentations


Presentation on theme: "Parallel Programming On the IUCAA Clusters Sunu Engineer."— Presentation transcript:

1

2 Parallel Programming On the IUCAA Clusters Sunu Engineer

3 IUCAA Clusters  The Cluster – Cluster of Intel Machines on Linux  Hercules – Cluster of HP ES45 quad processor nodes  References: http://www.iucaa.ernet.in/

4 The Cluster  Four Single Processor Nodes with 100 Mbps Ethernet interconnect.  1.4 GHz, Intel Pentium 4  512 MB RAM  Linux 2.4 Kernel (Redhat 7.2 Distribution)  MPI – LAM 6.5.9  PVM – 3.4.3

5 Hercules  Four quad processor nodes with Memory Channel interconnect  1.25 GHz Alpha 21264D RISC Processor  4 GB RAM  Tru64 5.1A with TruCluster software  Native MPI  LAM 7.0  PVM 3.4.3

6 Expected Computational Performance  Intel Cluster  Processor - 512/590  System GFLOPS ~ 2  Algorithm/Benchmark Used – Specint/float/HPL  ES45 Cluster  Processor ~ 679/960  System GFLOPS ~ 30  Algorithm/Benchmark Used – Specint/float/HPL

7 Parallel Programs  Move towards large scale distributed programs  Larger class of problems with higher resolution  Enhanced levels of details to be explored  …

8 The Starting Point  Model  Single Processor Program  Multi Processor Program  Model  Multiprocessor Program

9 Decomposition of a Single Processor Program  Temporal  Initialization  Control  Termination  Spatial  Functional  Modular  Object based

10 Multi Processor Programs  Spatial delocalization – Dissolving the boundary  Single spatial coordinate - Invalid  Single time coordinate - Invalid  Temporal multiplicity  Multiple streams at different rates w.r.t an external clock.

11 In comparison  Multiple points of initialization  Distributed control  Multiple points and times of termination  Distribution of the activity in space and time

12 Breaking up a problem

13 Yet Another way

14 And another

15 Amdahl’s Law

16 Degrees of refinement  Fine parallelism  Instruction level  Program statement level  Loop level  Coarse parallelism  Process level  Task level  Region level

17 Patterns and Frameworks  Patterns - Documented solutions to recurring design problems.  Frameworks – Software and hardware structures implementing the infrastructure

18 Processes and Threads  From heavy multitasking to lightweight multitasking on a single processor  Isolated memory spaces to shared memory space

19 Posix Threads in Brief  pthread_create(pthread_t id, pthread_attr_t attributes, void *(*thread_function)(void *), void * arguments)  pthread_exit  pthread_join  pthread_self  pthread_mutex_init  pthread_mutex_lock/unlock  Link with –lpthread

20 Multiprocessing architectures  Symmetric Multiprocessing  Shared memory  Space Unified  Different temporal streams  OpenMP standard

21 OpenMP Programming  Set of directives to the compiler to express shared memory parallelism  Small library of functions  Environment variables.  Standard language bindings defined for FORTRAN, C and C++

22 Open MP example #include int main(int argc, char ** argv) { #pragma omp parallel { printf(“Hello World from %d\n”,omp_get_thread_num() ); } return(0); } C An openMP program program openmp !$OMP PARALLEL print *, “Hello world from”, omp_get_thread_num() !$OMP END PARALLEL stop end

23 Open MP directives Parallel and Work sharing  OMP Parallel [clauses]  OMP do [ clauses]  OMP sections [ clauses]  OMP section  OMP single

24 Combined work sharing Synchronization  OMP parallel do  OMP parallel sections  OMP master  OMP critical  OMP barrier  OMP atomic  OMP flush  OMP ordered  OMP threadprivate

25 OpenMP Directive clauses  shared(list)  private(list)/threadprivate  firstprivate/lastprivate(list)  default(private|shared|none)  default(shared|none)  reduction (operator|intrinsic : list)  copyin(list)  if (expr)  schedule(type[,chunk])  ordered/nowait

26 Open MP Library functions  omp_get/set_num_threads()  omp_get_max_threads()  omp_get_thread_num()  omp_get_num_procs()  omp_in_parallel()  omp_get/set_(dynamic/nested)()  omp_init/destroy/test_lock()  omp_set/unset_lock()

27 OpenMP environment variables  OMP_SCHEDULE  OMP_NUM_THREADS  OMP_DYNAMIC  OMP_NESTED

28 OpenMP Reduction and Atomic Operators  Reduction : +,-,*,&,|,&&,||  Atomic : ++,--,+,*,-,/,&,>>,<<,|

29 Simple loops  do I=1,N z(I) = a * x(I) + y end do !$OMP parallel do do I=1,N z(I) = a * x(I) + y end do

30 Data Scoping  Loop index private by default  Declare as shared, private or reduction

31 Private variables  !$OMP parallel do private(a,b,c) do I=1,m do j =1,n b=f(I) c=k(j) call abc(a,b,c) end do #pragma omp parallel for private(a,b,c)

32 Dependencies  Data dependencies (Lexical/dynamic extent)  Flow dependencies  Classifying and removing the dependencies  Non removable dependencies  Examples Do I=2,n a(I) =a(I)+a(I-1) end do Do I=2,N,2 a(I)= a(I)+a(I-1) End do

33 Making sure everyone has enough work  Parallel overhead – Creation of threads, synchronization vs. work done in the loop $!OMP parallel do schedule(dynamic,3) schedule type – static, dynamic, guided,runtime

34 Parallel regions – from fine to coarse parallelism  $!OMP Parallel  threadprivate and copyin  Work sharing constructs  do, sections, section, single Synchronization  critical, atomic, barrier, ordered, master

35 To distributed memory systems  MPI, PVM, BSP …

36 Existing parallel libraries and toolkits include:  PUL, the Parallel Utilities Library from EPCC.  The Multicomputer Toolbox from Tony Skjellum and colleagues at LLNL and MSU.  The Portable, Extensible, Toolkit for Scientific computation from ANL.  ScaLAPACK from ORNL and UTK.  ESSL, PESSL on AIX  PBLAS, PLAPACK, ARPACK Some Parallel Libraries

37


Download ppt "Parallel Programming On the IUCAA Clusters Sunu Engineer."

Similar presentations


Ads by Google