Open Multiprocessing Dr. Bo Yuan

Open Multiprocessing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Note on Parallel Programming An incorrect program may produce correct results. –The order of execution of processes/threads is unpredictable. –May depend on your luck! A program that always produce correct results may not make sense. –The outputs of a program are just part of the story. –Efficiency matters! 2

OpenMP An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. –Supports multiple platforms (processor architectures and operating systems). –Higher level implementation (a block of code that should be executed in parallel). A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them. Based on preprocessor directives ( Pragma ) –Requires compiler support. –omp.h References –http://openmp.org/ –https://computing.llnl.gov/tutorials/openMP/ –http://supercomputingblog.com/openmp/ 3

Hello, World! 4 #include void Hello(void); int main(int argc, char* argv[]) { /* Get the number of threads from command line */ int thread_count=strtol(argv[1], NULL, 10); # pragma omp parallel num_threads(thread_count) Hello(); return 0; } void Hello(void) { int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads(); printf(“Hello from thread %d of %d\n”, my_rank, thread_count); }

Definitions 5 # pragma omp parallel [clauses] { code_block } Error Checking #ifdef _OPENMP # include #endif #ifdef _OPENMP int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads(); #else int my_rank=0; int thread_count=1; #endif implicit barrier Thread Team = Master + Slaves text to modify the directive

The Trapezoidal Rule 6 /* Input: a, b, n */ h=(b-a)/n; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i); } approx=h*approx; Thread 0Thread 2 # pragma omp critical global_result+=my_result; Shared Memory  Shared Variables  Race Condition

The critical Directive 7 # pragma omp critical y=f(x);... double f(double x) { # pragma omp critical z=g(x);... } Cannot be executed simultaneously! # pragma omp critical(one) y=f(x);... double f(double x) { # pragma omp critical(two) z=g(x);... }

The atomic Directive 8 # pragma omp atomic x = ; can be one of the binary operators : +, *, -, /, &, ^, |, > Higher performance than the critical directive. Only single C assignment statement is protected. Only the load and store of x is protected. must not reference x. # pragma omp atomic# pragma omp critical x+=f(y); x=g(x); Can be executed simultaneously! x++ ++x x-- --x

Locks 9 /* Executed by one thread */ Initialize the lock data structure;... /* Executed by multiple threads */ Attempt to lock or set the lock data structure; Critical section; Unlock or unset the lock data structure;... /* Executed by one thread */ Destroy the lock data structure; void omp_init_lock(omp_lock_t*lock_p); void omp_set_lock(omp_lock_t*lock_p); void omp_unset_lock(omp_lock_t*lock_p); void omp_destroy_lock(omp_lock_t*lock_p);

Trapezoidal Rule in OpenMP 10 #include void Trap(double a, double b, int n, double* global_result_p); int main(int argc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count; thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n); # pragma omp parallel num_threads(thread_count) Trap(a, b, n, &global_result); printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0; }

Trapezoidal Rule in OpenMP 11 void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; int i, local_n; int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads(); h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h; # pragma omp critical *global_result_p+=my_result; }

Scope of Variables 12 Private Scope Only accessible by a single thread Declared in the code block Shared Scope Accessible by all threads in a team Declared before a parallel directive a, b, n global_result thread_count my_rank my_result global_result_p *global_result_p In serial programming: Function-wide scope File-wide scope

Another Trap Function 13 double Local_trap(double a, double b, int n); global_result=0.0; # pragma omp parallel num_threads(thread_count) { # pragma omp critical global_result+=Local_trap(a, b, n); } global_result=0.0; # pragma omp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragma omp critical global_result+=my_result; }

The Reduction Clause Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result. Note: –The reduction variable itself is shared. –A private variable is created for each thread in the team. –The private variables are initialized to 0 for addition operator. 14 global_result=0.0; # pragma omp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n); reduction( : )

The parallel for Directive 15 h=(b-a)/n; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx; h=(b-a)/n; approx=(f(a)+f(b))/2.0; # pragma omp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx; The code block must be a for loop. Iterations of the for loop are divided among threads. approx is a reduction variable. i is a private variable.

The parallel for Directive Sounds like a truly wonderful approach to parallelizing serial programs. Does not work with while or do-while loops. –How about converting them into for loops? The number of iterations must be determined in advance. 16 for (; ;) {... } for (i=0; i<n; i++) { if (...) break;... } int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } private(y)

Estimating π 17 double factor=1.0; double sum=0.0; for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; double factor=1.0; double sum=0.0; # pragma omp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; ? Loop-carried dependence

Estimating π 18 if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); factor=(k%2 == 0)?1.0: -1.0; sum+=factor/(2*k+1); double factor=1.0; double sum=0.0; # pragma omp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

Scope Matters 19 double factor=1.0; double sum=0.0; # pragma omp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum; With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block. The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block. The private factor is not specified.

Bubble Sort 20 for (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; } Can we make it faster? Can we parallelize the outer loop? Can we parallelize the inner loop?

Odd-Even Sort 21 Phase Subscript in Array 0123 09786 7968 17968 7698 27698 6789 36789 6789 Any opportunities for parallelism?

Odd-Even Sort 22 void Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; }

Odd-Even Sort in OpenMP 23 for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ # pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ # pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; }

Odd-Even Sort in OpenMP 24 # pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ # pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ # pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; }

Data Partitioning 25 012345678 0 0 1 1 2 2 Iterations Threads 012345678 0 0 1 1 2 2 Iterations Threads Block Cyclic

Scheduling Loops 26 double Z[N][N]; … sum=0.0; for (i=0; i<N; i++) sum+=f(i); double f(int r) { int i; double val=0.0; for (i=r+1; i<N; i++) { return_val+=sin(Z[r][i]); } return val; }

The schedule clause 27 sum=0.0; # pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i); n=12, t=3 schedule(static, 1) schedule(static, 2) schedule(static, 4) Thread 0: 0, 3, 6, 9 Thread 1: 1, 4, 7, 10 Thread 2: 2, 5, 8, 11 Thread 0: 0, 1, 6, 7 Thread 1: 2, 3, 8, 9 Thread 2: 4, 5, 10, 11 Thread 0: 0, 1, 2, 3 Thread 1: 4, 5, 6, 7 Thread 2: 8, 9, 10, 11 chunksize schedule(static, total_iterations/thread_count)

The dynamic and guided Types In a dynamic schedule: –Iterations are broken into chunks of chunksize consecutive iterations. –Default chunksize value: 1 –Each thread executes a chunk. –When a thread finishes a chunk, it requests another one. In a guided schedule: –Each thread executes a chunk. –When a thread finishes a chunk, it requests another one. –As chunks are completed, the size of the new chunks decreases. –Approximately equals to the number of iterations remaining divided by the number of threads. –The size of chunks decreases down to chunksize or 1 (default). 28

Example of guided Schedule ThreadChunkSize of ChunkRemaining Iterations 01-500050004999 15001-750025002499 17501-875012501249 18751-9375625624 09376-9687312 19688-9843156 09844-992178 19922-996039 19961-99802019 19981-9990109 19991-999554 09996-999722 19998-999811 09999-999910 29

Which schedule? The optimal schedule depends on: –The type of problem –The number of iterations –The number of threads Overhead –guided > dynamic > static –If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further. The Cost of Iterations –If it is roughly the same, use the default schedule. –If it decreases or increases linearly as the loop executes, a static schedule with small chunksize values will be good. –If it cannot be determined in advance, try to explore different options. 30

Performance Issue 31 Axy # pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }

Performance Issue 32 Number of Threads Matrix Dimension 8,000,000 x 88,000 x 8,0008 x 8,000,000 TimeEfficiencyTimeEfficiencyTimeEfficiency 10.3221.0000.2641.0000.3331.000 20.2190.7350.1890.6980.3000.555 30.1410.5710.1190.5550.3030.275

Performance Issue 8,000,000-by-8 –y has 8,000,000 elements  Potentially large number of write misses 8-by-8,000,000 –x has 8,000,000 elements  Potentially large number of read misses 8-by-8,000,000 –y has 8 elements (8 doubles)  Could be stored in the same cache line (64 bytes). –Potentially serious false sharing effect for multiple processors 8000-by-8000 –y has 8,000 elements (8,000 doubles). –Thread 2: 4000 to 5999Thread 3: 6000 to 7999 –{y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] } –The effect of false sharing is highly unlikely. 33

Thread Safety How to generate random numbers in C? –First, call srand() with an integer seed. –Second, call rand() to create a sequence of random numbers. Pseudorandom Number Generator (PRNG) Is it thread safe? –Can it be simultaneously executed by multiple threads without causing problems? 34

Foster’s Methodology Partitioning –Divide the computation and the data into small tasks. –Identify tasks that can be executed in parallel. Communication –Determine what communication needs to be carried out. –Local Communication vs. Global Communication Agglomeration –Group tasks into larger tasks. –Reduce communication. –Task Dependence Mapping –Assign the composite tasks to processes/threads. 35

Foster’s Methodology 36

The n-body Problem To predict the motion of a group of objects that interact with each other gravitationally over a period of time. –Inputs: Mass, Position and Velocity Astrophysicist –The positions and velocities of a collection of stars Chemist –The positions and velocities of a collection of molecules 37

Newton’s Law 38

The Basic Algorithm 39 Get input data; for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q; } for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; }

Newton’s 3 rd Law of Motion 40 f 38 f 58 f 83 f 85 n=12 q=8 r=3

The Reduced Algorithm 41 for each particle q forces[q][0]=forces[q][1]=0; for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; }

Euler Method 42

Position and Velocity 43 for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q]; }

Communications: Basic 44 s q (t) v q (t) s r (t) v r (t) s q (t + △ t) v q (t + △ t) s r (t + △ t) v r (t + △ t) F q (t) F r (t) F q (t+ △ t) F r (t+ △ t)

Agglomeration: Basic 45 s q, v q, F q s r, v r, F r sqsq srsr sqsq srsr t t + △ t

Agglomeration: Reduced 46 s q, v q, F q s r, v r, F r f qr srsr srsr t t + △ t q<r

Parallelizing the Basic Solver 47 # pragma omp parallel for each timestep { if (timestep output){ # pragma omp single nowait Print positions and velocities of particles; } # pragma omp for for each particle q Compute total force on q; # pragma omp for for each particle q Compute position and velocity of q; }

Parallelizing the Reduced Solver 48 # pragma omp for for each particle q forces[q][0]=forces[q][1]=0; # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; }

Does it work properly? Consider 2 threads and 4 particles. Thread 1 is assigned particle 0 and particle 1. Thread 2 is assigned particle 2 and particle 3. F 3 =-f 03 -f 13 -f 23 Who will calculate f 03 and f 13 ? Who will calculate f 23 ? Any race conditions? 49

Thread Contributions 50 ThreadParticle Contributions of Threads 012 00f 01 +f 02 +f 03 +f 04 +f 05 00 1-f 01 +f 12 +f 13 +f 14 +f 15 00 12-f 02 -f 12 f 23 +f 24 +f 25 0 3-f 03 -f 13 -f 23 +f 34 +f 35 0 24-f 04 -f 14 -f 24 -f 34 f 45 5-f 05 -f 15 -f 25 -f 35 -f 45 3 Threads, 6 Particles, Block Partition

Thread Contributions 51 ThreadParticle Contributions of Threads 012 00f 01 +f 02 +f 03 +f 04 +f 05 00 11-f 01 f 12 +f 13 +f 14 +f 15 0 22-f 02 -f 12 f 23 +f 24 +f 25 03-f 03 +f 34 +f 35 -f 13 -f 23 14-f 04 -f 34 -f 14 +f 45 -f 24 25-f 05 -f 35 -f 15 -f 45 -f 25 3 Threads, 6 Particles, Cyclic Partition

First Phase 52 # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; }

Second Phase 53 # pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces ( loc_forces ). In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads.

Evaluating the OpenMP Codes In the reduced code: –Loop 1: Initialization of the loc_forces array –Loop 2: The first phase of the computation of forces –Loop 3: The second phase of the computation of forces –Loop 4: The updating of positions and velocities Which schedule should be used? 54 ThreadsBasic Reduced Default Reduced Forces Cyclic Reduced All Cyclic 17.713.90 23.872.941.982.01 41.951.731.011.08 80.990.950.540.61

Review What are the major differences between MPI and OpenMP? What is the scope of a variable? What is a reduction variable? How to ensure mutual exclusion in a critical section? What are the common loop scheduling options? What is a thread safe function? What factors may affect the performance of an OpenMP program? 55

Open Multiprocessing Dr. Bo Yuan

Similar presentations

Presentation on theme: "Open Multiprocessing Dr. Bo Yuan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Open Multiprocessing Dr. Bo Yuan

Similar presentations

Presentation on theme: "Open Multiprocessing Dr. Bo Yuan"— Presentation transcript:

Similar presentations

About project

Feedback