Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

Similar presentations


Presentation on theme: "1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016."— Presentation transcript:

1 1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

2 2 We have seen OpenMP for specifying parallelism Programmer decides on what parts of the code should be parallelized and inserts compiler directives (pragma’s) Issue for the programmer is deciding what can be done safely in parallel. Let us use generic language constructs for parallelism. Recognizing when parallelism can be used

3 3 par Construct For specifying concurrent statements: par { S1; S2;. Sn; } Says one can execute all statement S1 to Sn simultaneously if resources available, or execute them in any order and still get the correct result

4 4 Question How is this specified in OpenMP?

5 5 forall Construct To start multiple similar processes together: forall (i = 0; i < n; i++) { S1; S2;. Sm; } Says each iteration of body can be executed simultaneously if resources available, or in any order and still get the correct result. The statements of each instance of body executed in order given. Each instance of the body uses a different value of i.

6 6 Example forall (i = 0; i < 5; i++) a[i] = 0; clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.

7 7 Question How is this specified in OpenMP?

8 8 Dependency Analysis To identify which processes could be executed together. Example Can see immediately in the code forall (i = 0; i < 5; i++) a[i] = 0; that every instance of the body is independent of other instances and all instances can be executed simultaneously. However, it may not be that obvious. Need algorithmic way of recognizing dependencies, especially for a parallelizing compiler.

9 9

10 10

11 11 Can use Berstein’s conditions at: Machine instruction level inside processor – have logic to detect if conditions satisfied (see computer architecture course) At the process level to detect whether two processes can be executed simultaneously (using the inputs and outputs of processes). Can be extended to more than two processes but number of conditions rises – need every input/out combination checked. For three statements, need how many conditions checked? For four statements, need how many conditions checked?

12 Use Bernstein’s conditions to determine whether the sequence can be executed in parallel: a = b + c; y = 3; a = 3; x = y + z; Clearly show how you got your answer. (No marks for just yes or no!) Question

13 Use Bernstein’s conditions to determine whether the two code sequences: forall (i = 0 i < 4; i++) a[i] = a[i+3]; and for (i = 0 i < 4; i++) a[i] = a[i+3]; always produce the same results. Clearly show how you got your answer. (No marks for just yes or no!) Question

14 14 Shared Memory Programming Some Performance Issues

15 15 Performance issues with Threads Program might actually go slower when parallelized! Too many threads can significantly reduce the program performance.

16 16 Reasons: Too little work -- Work split among too many threads gives each thread too little work. Overhead of starting and terminating threads swamps useful work. Having to share fixed hardware resources -- incurs overhead from having to share fixed hardware resources OS typically schedules threads in round robin with a time- slice. Time-slicing incurs overhead Need to save registers, effects on cache memory, virtual memory management …. Waiting to acquire a lock. When a thread is suspended while holding a lock, all threads waiting for lock will have to wait for thread to re- start. Critical sections can serialize code (see earlier) Source: Multi-core programming by S. Akhter and J. Roberts, Intel Press.

17 17 Some Strategies Limit number of runnable threads to number of hardware threads. (See later we do not do this with GPUs) For a n-core machine have n runnable threads. If hyper-threaded (with 2 virtual threads per core) double this. Can have more threads in total but others may be blocked. Separate I/O threads from compute threads. I/O threads wait for external events Never hard-code number of threads – leave as a tuning parameter. Let OpenMP optimize number of threads Implement a thread pool Implement a work stealing approach in which threads has a work queue. Threads with no work take work from other threads

18 18 Shared Data in Systems with Caches All modern computer systems have cache memory, high- speed memory closely attached to each processor for holding recently referenced data and code. Processors Cache memory Main memory

19 19 Cache coherence protocols Update policy - copies of data in all caches are updated at the time one copy is altered, or Invalidate policy - when one copy of data is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). These copies are only updated when the associated processor makes reference for it. Protocol needed even on a single processor system (Why?) More details in a computer architecture class.

20 20 False Sharing Different parts of block required by different processors but not same bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated although the actual data is not shared.

21 21 Solution for False Sharing Compiler to alter the layout of the data stored in the main memory, separating data only altered by one processor into different blocks. Possible (?) for a programmer to pad out data to make shared items in different cache lines (if the programmer knows the cache details – will change from system to system)

22 Explain the potential false sharing effect in the following code main { int x,y;... #pragma omp parallel (shared x,y) { tid = omp_get_thread_num(); if (tid == 0) x++; if (tid == 1) y++; } … } Suggest how could false sharing be prevented in this code. Question

23 23 Interleaved Statements Instructions of processes/threads interleaved in time. Example Process/Thread 1 Process/Thread 2 Instruction 1.1 Instruction 2.1 Instruction 1.2 Instruction 2.2 Instruction 1.3 Instruction 2.3 Many possible orderings, e.g.: Instruction 1.1 Instruction 1.2 Instruction 2.1 Instruction 1.3 Instruction 2.2 Instruction 2.3 assuming instructions cannot be divided into smaller steps. Each process/thread must achieve the desired results irrespective of the interleaving order

24 24 Calling the same routine by multiple threads Thread-Safe Routines Thread safe if routine can be called from multiple threads simultaneously and always produce correct results. Standard I/O thread safe (prints messages without interleaving the characters, assuming print buffer not exceeded). System routines that return time may not be thread safe. Routines that access shared data may require special care to be made thread safe.

25 25 Sequential Consistency Formally defined by Lamport (1979): A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processors occur in this sequence in the order specified by its program. i.e., the overall effect of a parallel program is not changed by any arbitrary interleaving of instruction execution in time.

26 26 Program Order Sequential consistency refers to “operations of each individual processor.. occur in the order specified in its program” or program order. In previous figure, this order is that of the stored machine instructions to be executed.

27 27 Code re-ordering Order of execution is not necessarily the same as the order of the corresponding high level statements in the source program. 1.Static re-ordering (done before executing) Compiler may reorder statements for improved performance, for example to space out dependent instructions allowing more instructions to be executed at the same time. (Compiler Optimizations) 2.Dynamic re-ordering (done during executing) -- High performance processors also usually reorder machine instructions internally during execution for increased performance. In both cases, objective is to best utilize available computer resources and minimize execution time.

28 28 Compiler/Processor Optimizations Compiler and processor reorder instructions to improve performance. Example Suppose one had the code: a = b + 5; x = y * 4; p = x + 9; and processor can perform, as is usual, multiple arithmetic operations at the same time. Can reorder to: x = y * 4; a = b + 5; p = x + 9; and still be logically correct. This gives multiply operation longer time to complete before result (x) is needed in last statement. Very common for processors to execute machines instructions “out of program order” for increased speed.

29 29 Processors use re-named temporary registers internally to store values and move results back to the actual registers specified in the machine instructions later, to achieve higher performance (see an architecture course for more details) Does not alter a multiprocessor being sequential consistency, if the processor only produces final results in program order -- that is, retire values to registers in program order. Processors have option of operating under sequential consistency model i.e. retire values to registers in program order. However, can severely limit compiler optimizations and processor performance. Processors retiring values in program order

30 30 Example Process P1 Process 2. data = new;. flag = TRUE;.. while (flag != TRUE) { };. data_copy = data;. Expect data_copy to be set to new because we expect data = new to be executed before flag = TRUE and while (flag != TRUE) { } to be executed before data_copy = data. Ensures that process 2 reads new data from process 1. Process 2 will simple wait for the new data to be produced. Writing a parallel program for a system known to be sequentially consistent – that is, each program is executed in program order and any interleaving instructions results in the correct answer - enables us to reason about the result of the program.

31 31 Example of Processor Re-ordering Process P1 Process 2. new = a * b;. data = new;. flag = TRUE;... while (flag != TRUE) { };. data_copy = data;. Multiply machine instruction corresponding to new = a * b is issued for execution. Next instruction corresponding to data = new cannot be issued until the multiply has produced its result. However following statement, flag = TRUE, completely independent and a clever processor could start this operation before multiply has completed leading to the sequence: Possible re-order

32 32 Process P1 Process P2. new = a * b;. flag = TRUE;. data = new;... while (flag != TRUE) { };. data_copy = data;. Now the while statement might occur before new assigned to data, and code would fail. To achieve desired result, operate under sequential consistency model, i.e. not reorder instructions and forcing multiply instruction above to complete before starting subsequent instruction that depend upon its result.

33 33 Relaxing Read/Write Orders Processors may be able to relax the consistency in terms of the order of reads and writes of one processor with respect to those of another processor to obtain higher performance, and have instructions to enforce consistency when needed. Examples of machine instructions Memory barrier (MB) instruction - waits for all previously issued memory accesses instructions to complete before issuing any new memory operations. Write memory barrier (WMB) instruction - as MB but only on memory write operations, i.e. waits for all previously issued memory write accesses instructions to complete before issuing any new memory write operations - which means memory reads could be issued after a memory write operation but overtake it and complete before the write operation.

34 34 Questions


Download ppt "1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016."

Similar presentations


Ads by Google