Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Parallel Programming on the SGI Origin2000 1)Parallelization Concepts 2)SGI Computer Design 3)Efficient Scalar Design 4)Parallel Programming -OpenMP 5)Parallel Programming- MPI

4) Parallel Programming-OpenMP

Read IL500 Read IL500 IL 0 IL100 IL350 Take IL150 (Write IL350) Take IL400 (Write IL100) Limor in Haifa Shimon in Tel Aviv Is this your joint bank account? IL150 IL400 IL350 Initial amount Final amount

Introduction - Parallelization instruction to the compiler: f77 –o prog –mp prog.f Or: f77 –o prog –pfa prog.f - Now try to understand what a compiler has to determine when deciding how to parallelize - Note that when loosely talk about parallelization, what is meant is: “Is the program as presented here parallelizable?” -This is an important distinction, because sometimes rewriting can transform non-parallelizable code into a parallelizable form, as we will see…

Data dependency types 1)Iteration i depends on values calculated in the previous iteration i-1 (loop carried dependence) do i=2,n a(i) = a(i-1) cannot be parallelized enddo 2) Data dependence within single iteration (non-loop carried dependence) do i=2,n c =.... a(I) =... c... parallelizable enddo 3) Reduction do i=1,n s = s + x parallelizable enddo All data dependencies in programs are variations on these fundamental types.

Data dependency analysis Question: Are the following loops parallelizable? do i=2,n a(i) = b(i-1) enddo do i=2,n a(i) = a(i-1) enddo YES! NO! Why?

Data dependency analysis do i=2,n a(i) = b(i-1) enddo YES! CPU1 CPU2 CPU3 A(2)=B(1) A(3)=B(2) A(4)=B(3) A(5)=B(4) A(6)=B(5) A(7)=B(6) cycle1 cycle2

Data dependency analysis do i=2,n a(i) = a(i-1) enddo CPU1 A(2)=A(1) cycle1 A(3)=A(2) cycle2 A(4)=A(3) cycle3 Scalar (non-parallel) run: A(5)=A(4) cycle4 In each cycle NEW data from previous cycle is read

Data dependency analysis do i=2,n a(i) = a(i-1) enddo No! CPU1 CPU2 CPU3 A(2)=A(1) A(3)=A(2) A(4)=A(3) cycle1 Will probably read OLD data

Data dependency analysis do i=2,n a(i) = a(i-1) enddo No! CPU1 CPU2 CPU3 A(2)=A(1) A(3)=A(2) A(4)=A(3) A(5)=A(4) A(6)=A(5) A(7)=A(6) cycle1 cycle2 May read NEW data Will probably read OLD data

Data dependency analysis Another question: Are the following loops parallelizable? do i=3,n,2 a(i) = a(i-1) enddo do i=1,n s = s + a(i) enddo YES! Depends!

Data dependency analysis do i=3,n,2 a(i) = a(i-1) enddo YES! CPU1 CPU2 CPU3 A(3)=A(2) A(5)=A(4) A(7)=A(6) A(9)=A(8) A(11)=A(10) A(13)=A(12) cycle1cycle2

Data dependency analysis do i=1,n s = s + a(i) enddo Depends! CPU1 CPU2 CPU3 S=S+A(1) S=S+A(2) S=S+A(3) S=S+A(4) S=S+A(5) S=S+A(6) cycle1 cycle2 -The value of S will be undetermined and typically it will vary from one run to the next - This bug in parallel programming is called a “race condition”

Data dependency analysis What is the principle involved here? The examples shown fall into two categories: 1)Data being read is independent of data that is written: a(i) = b(i-1) i=2,3,4... a(i) = a(i-1) i=3,5,7... 2) Data being read depends on data that is written: a(i) = a(I-1) i=2,3,4... s = s + a(i) i=1,2,3...

Data dependency analysis Here is a typical situation: Is there a data dependency in the following loop? do i = 1,n a(i) = sin(x(i)) result = a(i) + b(i) c(i) = result * c(i) enddo Clearly, “result” is a temporary variable that is reassigned for every iteration. Note: “result” must be a “private” variable (this will be discussed later). No!

Data dependency analysis Here is a (slightly different) typical situation: Is there a data dependency in the following loop? do i = 1,n a(i) = sin(result) result = a(i) + b(i) c(i) = result * c(i) enddo Yes! The value of “result” is carried over from one iteration to the next. This is the classical read/write situation but now it is somewhat hidden.

Data dependency analysis do i = 1,n a(i) = sin(result(i-1)) result(i) = a(i) + b(i) c(i) = result(i) * c(i) enddo do i = 1,n a(i) = sin(result(i-1)) result(i) = sin(result(i-1)) + b(i) c(i) = result(i) * c(i) enddo The loop could (symbolically) be rewritten: Now substitute the expression for a(i): This is really of the type “a(i)=a(i-1)” !

Data dependency analysis One more: Can the following loop be parallelized? do i = 3,n a(i) = a(i-2) enddo If this is parallelized, there will probably be different answers from one run to another. Why?

Data dependency analysis CPU1 CPU2 A(3)=A(1) A(4)=A(2) A(5)=A(3) A(6)=A(4) cycle1 cycle2 do i = 3,n a(i) = a(i-2) enddo This looks like it will be safe.

Data dependency analysis CPU1 CPU2 CPU3 A(3)=A(1) A(4)=A(2) A(5)=A(3) cycle1 do i = 3,n a(i) = a(i-2) enddo HOWEVER: what if there are 3 cpu’s and not 2? In this case, a(3) is read and written in two threads at once

RISC memory levels CPU Main memory Cache Single CPU

RISC memory levels Main memory Multiple CPU’s CPU Cache 1 CPU 0 1 Cache 0

Main memory Multiple CPU’s CPU Cache 1 CPU 0 1 Cache 0 RISC Memory Levels

Definition of OpenMP - Application Program Interface (API) for Shared Memory Parallel Programming - Directive based approach with library support - Targets existing applications and widely used languages: Fortran API first released October 1997 C, C++ API first released October 1998 - Multi-vendor/platform support

Why was OpenMP developed? - Parallel programming before OpenMP * Standards for distributed memory (MPI and PVM) * No standard for shared memory programming - Vendors had different directive-based API for SMP * SGI, Cray, Kuck&Assoc, DEC * Vendor proprietary, similar but not the same * Most were targeted at loop level parallelism - Commercial users, high end software vendors, have big investment in existing codes - End result: users wanting portability were forced to use MPI even for shared memory * This sacrifices built-in SMP hardware benefits * Requires major effort

The Spread of OpenMP Organization: Architecture review board Web site: www.openmp.org Hardware: HP/DEC IBM Intel SGI Sun Software: Portland (PGI) NAG Intel Kuck & Assoc (KAI) Absoft

OpenMP Interface model Control structures Work sharing Data scope attributes * private,firstprivate, lastprivate * shared * reduction -Control and query * number thread * nested parallel? * throughput mode - Lock API -Runtime environment * schedule type * max number threads * nested parallelism * throughput mode Directives And Pragmas Runtime Library routines Environment variables

OpenMP execution model OpenMP programs starts in a single thread, sequential mode To create additional threads, user opens a parallel region * additional slave threads launched * master thread is part of team * threads “disappear” at the end of parallel region run This model is repeated as needed Master thread Parallel: 4 threads Parallel: 2 threads Parallel: 3 threads

Creating parallel threads Fortran C/C++ c$omp parallel [clause,clause] code to run in parallel c$omp end parallel #pragma omp parallel [clause,clause] { code to run in parallel } Replicate execution i=0 C$omp parallel call foo(i,a,b) C$omp end parallel print*,i foo i=0 print*,i Number of threads: set in library or environment call

OpenMP on the Origin 2000 Switches, formats f77 -mp c$omp parallel do c$omp+shared(a,b,c) OR c$omp parallel do shared(a,b,c) c$ iam = omp_get_thread()+1 Conditional compilation

OpenMP on the Origin 2000 -C Switches, formats cc -mp #pragma omp parallel for\ shared(a,b,c) OR #pragma omp parallel for shared(a,b,c)

OpenMP on the Origin 2000 Parallel Do Directive c$omp parallel do private(I) c$omp end parallel do --> optional do I=1,n a(I)= I+1 enddo Topics: Clauses, Detailed construct

OpenMP on the Origin 2000 Parallel Do Directive - Clauses shared private default(private|shared|none) firstprivate lastprivate reduction({operator|intrinsic}:var) schedule(type,[chunk]) if(scalar_logical_expression) ordered copyin(var)

S S Single thread Parallel region Single thread S = shared variable P = private variable Allocating private and shared variables

Clauses in OpenMP - 1 Clauses for the “parallel” directive specify data association rules and conditional computation shared (list) - data accessible by all threads, which all refer to the same storage private (list) - data private to each thread - a new storage location is created with that name for each thread, and the of the storage are not available outside the parallel region default (private | shared | none) - default association for variables not otherwise mentioned firstprivate (list) - same as for private(list), but the contents are given an initial value from the variable with the same name, from outside the parallel region lastprivate (list) - available only for work sharing constructs - a shared variable with that name is set to the last computed value of a thread private variable in the work sharing construct

Clauses in OpenMP - 2 reduction ({op/intrinsic}:list) - variables in the list are named scalars of intrinsic type - a private copy of each variable will be made in each thread and initialized according to the intended operation - at the end of the parallel region or other synchronization point all private copies will be combined - the operation must be of one of the forms: x = x op expr x = intrinsic(x,expr) if (x.LT.expr) x = expr x++; x--; ++x; --x; where expr does not contain x Op Init + or - 0 * 1 & -0 | 0 ^ 0 && 1 || 0 Op/intrinsic Init + or - 0 * 1.AND..TRUE..OR..FALSE..EQV..TRUE..NEQV..FALSE. MAX smallest number MIN largest number IAND all bits on IOR or IEOR 0 - example: c$omp parallel do reduction(+:a,y) reduction (.OR.:s)

Clauses in OpenMP - 3 copyin(list) - the list must contain common block (or global) names tahat have been declared threadprivate - data in the master thread in that common block will be copied to the thread private storage at the beginning of the parallel region - there is no “copyout” clause – data in private common block is not available outside of that thread if (scalar_logical_expression) - when an “if” clause is present, the enclosed code block is executed in parallel only if the scalar_logical_expression is.TRUE. ordered - only for do/for work sharing constructs – the code in the ORDERED block will be executed in the same sequence as sequential execution schedule (kind,[chunk]) - only for do/for work sharing constructs – specifies scheduling discipline for loop iterations nowait - end of worksharing construct and SINGLE directive implies a synchronization\ point unless “nowait” is specified

OpenMP on the Origin 2000 Parallel Sections Directive c$omp parallel sections private(I) c$omp end parallel sections c$omp section block1 c$omp section block2 Topics: Clauses, Detailed construct

OpenMP on the Origin 2000 Parallel Sections Directive - Clauses shared private default(private|shared|none) firstprivate lastprivate reduction({operator|intrinsic}:var) if(scalar_logical_expression) copyin(var)

OpenMP on the Origin 2000 Defining a Parallel Region - Individual Do Loops c$omp parallel shared(a,b) do j=1,n a(j)=j enddo do k=1,n b(k)=k enddo c$omp do private(j) c$omp end do nowait c$omp do private(k) c$omp end do c$omp end parallel

OpenMP on the Origin 2000 Defining a Parallel Region - Explicit Sections c$omp parallel shared(a,b) c$omp section block1 c$omp single block2 c$omp section block3 c$omp end parallel

OpenMP on the Origin 2000 Synchronization Constructs master/end master critical/end critical barrier atomic flush ordered/end ordered

OpenMP on the Origin 2000 Run-Time Library Routines Execution environment omp_set_num_threads omp_get_num_threads omp_get_max_threads omp_get_thread_num omp_get_num_procs omp_in_parallel omp_set_dynamic/omp_get_dynamic omp_set_nested/omp_get_nested

OpenMP on the Origin 2000 Run-Time Library Routines Lock routines omp_init_lock omp_destroy_lock omp_set_lock omp_unset_lock omp_test_lock

OpenMP on the Origin 2000 Environment Variables OMP_NUM_THREADS or MP_SET_NUMTHREADS OMP_DYNAMIC OMP_NESTED

Exercise 5 – OpenMP to parallelize a loop

main loop initial values

Enhancing Performance Ensuring sufficient work : running a loop in parallel adds runtime costs Scheduling loops for load - balancing

The SCHEDULE clause SCHEDULE (TYPE[,CHUNK]) StaticEach thread is assigned one chunk of iterations, according to variable or equally sized DynamicAt runtime, chunks are assigned to threads dynamically

OpenMP summary - Small number of compiler directives to set up parallel execution of code and runtime library system for locking function - Portable directives (supported by different vendors in the same way) - Parallelization is for SMP programming model – the machine should have a global address space - Number of execution threads is controlled outside the program - A correct OpenMP program should not depend on the exact number of execution threads nor on the scheduling mechanism for work distribution - In addition, a correct OpenMP program should be (weakly) serially equivalent – that is, the results of the computation should be within rounding accuracy when compared to sequential program - On SGI, OpenMP programming can be mixed with MPI library, so that it is possible to have “hierarchical parallelism” * OpenMP parallelism in a single node (Global Address Space) * MPI parallelism between nodes in a cluster (Network connection)

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Similar presentations

Presentation on theme: "Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Similar presentations

Presentation on theme: "Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia."— Presentation transcript:

Similar presentations

About project

Feedback