Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Similar presentations

Presentation on theme: "Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming."— Presentation transcript:

1 Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming

2 Korea Univ Parallel Performance of OpenMP Performance is influenced by at least the following factors  Memory access pattern by the individual threads If each thread accesses a distinct portion of data consistently throughout the program, it probably makes excellent use of the memory hierarchy  Overhead of OpenMP constructs When a parallel region is created, threads might have to be created or woken up and some data structures have to be set up to carry information needed by the runtime system  Load imbalance between synchronization points Threads might have to wait for a member of team to carry out the work of a single construct  Other synchronization costs Threads typically waste time waiting for access to a critical region (or to acquire a lock) 2

3 Korea Univ #threads on Performance When running the parallel application, make sure that the load (#threads) on the system does not exceed the number of processors  If it does, the system is said to be oversubscribed  It not only degrades performance, but also makes it hard to analyze the program’s behavior On an SMP system, a program should use a fewer threads than the number of processors  OS daemons and services need to run on a processor, too  If all processors are in use by the application, even a relatively lightweight daemon disrupts the execution of the user program, because one thread has to give way to this process 3

4 Korea Univ Performance Sequential performance of an application program is still a major concern when creating a parallel program Poor sequential performance is often caused by suboptimal usage of the cache found in contemporary computers  In particular, a cache-miss is expensive because it implies that the data must be fetched from main memory  If the cache miss happens frequently, it can severely reduce program performance On an SMP system, the impact of the cache miss could be even stronger due to the limited bandwidth and latency of interconnection network 4

5 Korea Univ Cache A major goal is to organize data accesses so that data are used as often as possible while they are still in cache The most common strategies are based on the fact that programming languages typically specify that the elements of arrays be stored contiguously in memory  Take advantage of temporal and spatial localities 5

6 Korea Univ Cache-friendly Code In C, a 2-dimensional array is stored according to row-major ordering  Example: int A[10][8] 6 for (i=0 ; i

7 Korea Univ Cache-friendly Code In Fortran, a 2-dimensional array is stored according to column-major ordering  Example: INTEGER A(10,8) 7 A Cache Line (block) Data cache A(0,0)A(1,0)A(2,0)A(3,0)A(5,0)A(6,0)A(7,0)A(4,0) A(0,3)A(1,3)A(2,3)A(3,3)A(5,3)A(6,3)A(7,3)A(4,3) A(0,1)A(1,1)A(2,1)A(3,1)A(5,1)A(6,1)A(7,1)A(4,1) A(1,2)A(2,2)A(3,2)A(0,2)A(5,2)A(6,2)A(7,2)A(4,2)... A(0,0) A(1,0) A(2,0) A(3,0) A(4,0) A(5,0) A(6,0) A(7,0) A(0,1) A(1,1) A(2,1) A(3,1) A(4,1) A(5,1) A(6,1) A(7,1) A(0,2) A(1,2)... memory DO J= 1, 8 DO I = 1, 10 sum = sum + A(I,J) END DO

8 Korea Univ TLB Consideration The page size is determined by the size the CPU supports, plus the choice offered by the operating system  Typically, the page size is 4KB TLB is on the critical path for performance  Think about PIPT cache Just as with data cache, it is important to make good use of the TLB entries 8

9 Korea Univ Loop Optimizations Both the programmer and the compiler can improve the use of memory A simple reordering of the statements inside the body of the loop nest may make a difference  Loop Interchange (or Loop Exchange)  Loop Unrolling  Unroll and Jam  Loop Fusion  Loop Fision  Loop Tiling (or Blocking) 9

10 Korea Univ Loop Interchange 10 j=0 i=0 /* Before */ for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2*x[i][j] /* After */ for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2*x[i][j] j=0 i=0 Improved cache efficiency Row-major ordering What is the worst that could happen? Prof Sean Lee’s Slide in Georgia Tech

11 Korea Univ Loop Unrolling 11 for (int i=0; i<100; i++) { a[i] = b[i] + 1; c[i] = b[i] + a[i-1] + b[i-1]; } The loop overhead includes incrementing the loop variable, testing its value and branching to the start of the loop Unroll the loop (in the example, unroll factor = 2) brings;  Overall loop overhead roughly halved  Data reuse improved The value of a[i] just computed can be used immediately  ILP could be increased Nowadays, a programmer seldom needs to apply this transformation manually, since compilers are very good at doing this for (int i=0; i<100; i+=2) { a[i] = b[i] + 1; c[i] = b[i] + a[i-1] + b[i-1]; a[i+1] = b[i+1] + 1; c[i+1] = b[i+1] + a[i] + b[i]; } Unrolled factor = 2

12 Korea Univ Unroll and Jam Unroll and Jam is an extension of loop unrolling that is appropriate for some loop nests with multiple loops 12 for (int j=0; j

13 Korea Univ Loop Fusion Loop Fusion merges 2 or more loops to create a bigger loop  May improve cache efficiency  Could increase the amount of computation per iteration in order to improve the ILP  Lower loop overheads 13 for (int i=0; i

14 Korea Univ Loop Fission Loop Fission is a transformation that breaks up a loop into several loops  May improve use of cache or isolate a part that inhibits full optimization of the loop  Likely to be most useful if a loop nest is large and its data does not fit into cache 14 for (int i=0; i

15 Korea Univ Why Loop Blocking? 15 /* Before */ for (i=0; i

16 Korea Univ Loop Blocking (Loop Tilting) Partition the loop’s iteration space into many smaller chunks and e nsure that the data stays in the cache until it is reused 16 i k k j y[i][k] z[k][j] i j X[i][j] Modified Slide from Prof. Sean Lee in Georgia Tech /* After */ for (jj=0; jj

17 Korea Univ Use of Pointers and Contiguous Memory in C Pointers pose a serious challenge for performance tuning Pointer Aliasing Problem  The memory model in C is such that, without additional information, one must assume that all pointers may reference any memory address  It prevents a compiler from performing many program optimizations, since it cannot determine that they are safe  If pointers are guaranteed to point to portions of non- overlapping memory, (for example, because each pointer targets memory allocated through a call to malloc() ), more aggressive techniques can be applied  In general, only the programmer knows what memory locations a pointer may refer to 17

18 Korea Univ Use of Pointers and Contiguous Memory in C The restrict keyword is provided in C99 to inform the compiler that the memory referenced by one pointer does not overlap with a memory section pointed to by another pointer 18 void mxv(int m, int n, double * restrict a, double * restrict b, double * restrict c) { int i, j; for (i=0; i

19 Korea Univ Using Compilers Modern compilers implement most, if not all, of the loop optimizations  They perform a variety of analyses (such as data dependence analysis) to determine whether they may be applied  Check out what kinds of compiler options are there However, the compiler’s ability to transform code is limited by its ability to analyze the program  It may be hindered by the presence of pointers So, the programmer has to take action. Some rewriting of the source code may lead to better results 19

20 Korea Univ Best Practices General recommendations for efficient OpenMP program  Optimize barrier use  Avoid the ordered construct  Avoid large critical regions  Maximize parallel regions  Avoid parallel regions in inner loops  Load balancing Additional performance considerations  Single vs Master construct  Private vs Shared data  Avoid false sharing 20

21 Korea Univ Optimize Barrier Use No matter how efficiently barriers are implemented, they are expensive operations  It is always worthwhile to reduce their usage to the minimum  The nowait clause makes it easy to eliminate the barrier that is implied on several constructs 21 #pragma omp parallel {.. #pragma omp for for (int i=0; i

22 Korea Univ Optimize Barrier Use Example 22 #pragma omp parallel default (none) \ shared (n, a, b, c, d, sum) private (i) { #pragma omp for nowait for (int i=0; i

23 Korea Univ Avoid the Ordered Construct The ordered construct ensures that the corresponding block of code within a parallel loop is executed in the order of the loop iterations  It is expensive to implement  The runtime system has to keep track which iterations have finished and possibly keep threads in a wait state until their results are needed  It inevitably slows the program execution 23

24 Korea Univ Avoid the Large Critical Regions A critical region is used to ensure that no two threads executes a piece of code simultaneously  The more code contained in the critical region, the greater the likelihood that threads have to wait to enter it  Thus, the programmer should minimize the amount of code enclosed within a critical region If possible, an atomic update is to be preferred  Whereas a critical region forces threads to perform all of the code enclosed within it one at a time, an atomic update enforces exclusive access to just one memory location 24

25 Korea Univ Maximize Parallel Regions Indiscriminate use of parallel regions may give rise to suboptimal performance  Overheads are associated with starting and terminating a parallel region Large parallel regions offer more opportunities for using data in cache and provide a bigger context for other compiler optimizations 25 #pragma omp parallel for for ( … ) { /* Work-sharing loop 1 */ } #pragma omp parallel for for ( … ) { /* Work-sharing loop 2 */ } #pragma omp parallel for for ( … ) { /* Work-sharing loop 3 */ } #pragma omp parallel { #pragma omp for { … ) /* Work-sharing loop 1 */ #pragma omp for { … ) /* Work-sharing loop 2 */ #pragma omp for { … ) /* Work-sharing loop 3 */ } Fewer implied barriers Potential for cache data reuse between loops Downside: no adjustment of #threads on a per loop basis

26 Korea Univ Avoid Parallel Regions in Inner Loops Another common technique to improve performance is to move parallel regions out of innermost loops  Otherwise, we repeatedly experience the overheads of the parallel construct 26 for (i=0; i

27 Korea Univ Load Balancing In some parallel algorithms, threads have different amounts of work to do  One solution is to use the schedule clause with a non- static schedule  The caveat is that the dynamic and guided schedules have higher overheads than does the static schedule  However, if the load imbalance is severe enough, this cost is offset by the more flexible allocation of work to threads 27

28 Korea Univ Pipelined Processing 28 for (i=0; i

29 Korea Univ Single vs Master The functionality of the single and master constructs is similar The difference is that a single region can be executed by any thread (typically the first to encounter it), whereas this is not the case for the master region The efficiency is implementation and application-dependent  In general, the master construct is more efficient, as the single construct requires more work in the OpenMP library 29

30 Korea Univ Private vs Shared The programmer may often choose whether data should be shared or private  Either choice might lead to a correct application, but the performance impact can be substantial if the wrong choice is made As an example, if threads need unique read/write access to a 1- dimensional array, there are 2 options:  Declare a 2-dimensional shared array with one row accessed by each thread, or allocate a 1-dimensional private array by each thread  In general, the latter is to be preferred over the former In the former, when modifying shared data, a data element might be in the same cache line as the data modified by another thread. Thus, performance degrades because of false sharing 30

31 Korea Univ Private vs Shared If data is always read in a parallel region, it could be shared But, it could also be privatized so that each thread has a local copy of the data, using the firstprivate clause to initialize it to the values prior to the parallel region Both approaches work, but the performance could be different  Sharing the data seems the reasonable choice  There is no risk of false sharing because the data is not modified, memory usage does not increase, and there is no runtime overhead to copy the data How about on a ccNUMA system? 31

32 Korea Univ Avoid False Sharing One of the factors limiting scalable performance is false sharing  It is a side-effect of the cache-line granularity of cache coherence  When threads running on different processors update different words in the same cache line, the cache coherence protocol maintains the data consistency by invalidation of the entire cache line  If some or all of the threads update the same cache line frequently, performance degrades 32

33 Korea Univ False Sharing Example Assuming that  Cache line size is 8 words  #threads is 8 (Nthreads=8) 33 #pragma omp parallel for shared(Nthreads, a) schedule(static, 1) for (i=0; i

34 Korea Univ Avoid False Sharing In general, using private data instead of shared data significantly reduces the risk of false sharing  In contrast of array padding, it is also a portable optimization 34

35 Korea Univ Binding Threads to CPUs Use an environment variable  export GOMP_CPU_AFFINITY=“ ” Thread0 attached to CPU0 Thread1 attached to CPU 4 Thread2 attached to CPU1 Thread3 attached to CPU5 Try lstopo command (show the topology of the system) 35

36 Korea Univ Our Server Config. 36

37 Korea Univ Single Thread Overhead Single thread overhead: How effective the parallel version is when executed on single thread  Ideally, the execution time of the OpenMP version is equal to the sequential version  In many cases, the sequential version would be faster  However, there is also a chance that the OpenMP version on a single thread might be faster because of a difference in compiler optimizations 37 Overhead single thread =100 x ( Elapsed Time (Sequential) Elapsed Time (OpenMP single thread ) - 1) %

38 Korea Univ Case Study 38 Matrix x Vector Product Experiment environment  Sun Fire E6900 NUMA with UltraSparc IV (dual-core), 2006 Sun Fire was a series of server computers introduced in CPU and memory boards (each board (SB#) can have up to 4 UltraSpac) 24 processors (=48 cores)  Solaris 9 OS In general, performance results are significantly influenced by  Coding style by the application developer  Compiler, compiler options, and its runtime libraries  OS features including its support for memory allocation and thread scheduling  Hardware characteristics: memory hierarchy, cache coherence mechanisms, support for atomic operations, and more

39 Korea Univ Case Study Single-thread overhead 39

40 Korea Univ Case Study Performance 40

41 Korea Univ Superlinear Speedup With a parallel program, there can be a positive effect, offsetting some of the performance loss caused by sequential code and the various overheads This is because a parallel program has more aggregate cache capacity at its disposal since each thread will have some amount of local cache It might result in a superlinear speedup  The speedup exceeds the number of processors used 41

42 Korea Univ Backup 42

43 Korea Univ Overheads of the OpenMP Translation A cost is associated with the creation of OpenMP parallel regions, with the sharing of work among threads, and with all kinds of synchronization The sources of the overheads include  The cost of starting up threads and creating their execution environment  The potential additional expense incurred by the encapsulation of a parallel region in a separate function  The cost of computing the schedule  The time taken to block and unblock threads, and the time for them to fetch work and signal that they are ready 43

44 Korea Univ Overheads of the OpenMP Translation Minor overheads are incurred by using firstprivate and lastprivate clauses  In most cases, however, these are relatively modest compared to the cost of barriers and other forms of thread synchronization, as well as the loss in speedup whenever one or more threads are idle Dynamic forms of scheduling can lead to much more thread interaction than do static schedules, and therefore inevitably incur higher overheads  On the other hand, they may reduce thread idle time in the presence of load imbalance 44

45 Korea Univ Overheads of the OpenMP Translation The EPCC microbenchmarks were created to help programmers estimate the relative cost of using different OpenMP constructs  Overheads for major OpenMP constructs as measured by the EPCC microbenchmarks for the first version of the OpenUH compiler A few results;  Overheads for the for directive and for the barrier are almost identical  Overheads for the parallel loop consist of calling the static loop schedule and the barrier  Overheads for the parallel for are just slightly higher than those for parallel This result is accounted for by the overheads of sharing the work, which is negligible for the default static scheduling policy  The single directive has higher overheads than a barrier This is not surprising, as the overheads consist of a call to a runtime library routine that ensures that one thread executes the region, and then a barrier at the end  The reduction clause is costly because it is implemented via a critical region 45

46 Korea Univ Overheads of the OpenMP Translation 46 single reduction parallel for parallel barrier for

47 Korea Univ Overheads of the OpenMP Translation Overheads for the different kinds of loop schedules  It clearly shows the performance benefits of a static schedule, and the penalties incurred by a dynamic schedule (where loops must grab chunks of work, especially small chunks at run time) 47 Dynamic,n Guided, n Static, n Static

Download ppt "Lecture 7. Performance Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming."

Similar presentations

Ads by Google