Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Computer Architecture Programming Level

Similar presentations


Presentation on theme: "Advanced Computer Architecture Programming Level"— Presentation transcript:

1 Advanced Computer Architecture Programming Level
A.R. Hurson 128 EECH Building, Missouri S&T

2 Advanced Computer Architecture
Instruction Set Design A computer with a storage component which may contain both data to be manipulated and instructions to manipulate the data is called a stored program machine — the user can change the sequence of operations on the data. A program is a finite sequence of instructions that specify the operands, operations and the sequence by which processing has to occur.

3 Advanced Computer Architecture
Compiler and Compilation A compiler is aimed to translate a high level program (source code) into a machine level program (object code). This translation should be performed correctly and efficiently — valid source code should be correctly and efficiently translated into an efficient object code (optimization).

4 Advanced Computer Architecture
Compiler and Compilation Performance Efficient translation: one pass vs. multi pass compilers. Efficient execution: performance and hardware utilization.

5 Advanced Computer Architecture
Compiler and Compilation Compiler Optimization — Optimization can be performed at different levels: High level optimizations, Local optimizations, Global optimizations — global common sub-expression elimination, Register allocation — graph coloring and no. of registers available, Machine dependent optimization — exploitation of concurrency.

6 Advanced Computer Architecture
Concurrent Processing To move concurrent processing into the mainstream of computation, one needs to develop: Computational model for concurrent processing, Proper means for interconnection in concurrent systems, and Proper means to integrate concurrent processing into general computing environment. Abstract Model Flow of Operations and Control Mapping

7 Advanced Computer Architecture
Concurrent Processing These can be attributed to issues such as: level of concurrency, computational granularity, time and space complexity, communication latencies, and scheduling and load balancing.

8 Advanced Computer Architecture
Summary Stored Program Machine Program Instruction: Definition and format Addressing Mode Compiler Concurrent Processing

9 Advanced Computer Architecture
Concurrent Processing Independence among segments of a program is a necessary condition to execute them concurrently. In general, two independent segments could be executed in any order without affecting each other — a segment can be an instruction or a sequence of instructions.

10 Advanced Computer Architecture
Concurrent Processing Dependence graph is used to determine the dependence relations among the program segments.

11 Advanced Computer Architecture
Concurrent Processing Dependence Graph — A dependence graph is a directed graph G  G (N,A) in which the set of nodes (N) represents the program segments and the set of the directed arcs (A) shows the order of dependence among the segments.

12 Advanced Computer Architecture
Concurrent Processing Dependence Graph Dependence comes in various forms and kinds: Data dependence Control dependence Resource dependence

13 Advanced Computer Architecture
Concurrent Processing Data Dependence: If an instruction uses a value produced by a previous instruction, then the second instruction is data dependent to the first instruction. Data Dependence comes in different forms:

14 Advanced Computer Architecture
Data dependence Flow dependence: At least one output of S1 is an input of S2 (Read-After-Write: RAW). Anti dependence: Output of S2 is overlapped with the input to S1 (Write-After-Read: WAR). Output dependence: S1 and S2 write to the same location (Write-After-Write: WAW). S1 ® S2

15 Advanced Computer Architecture
Data dependence I/O dependence: The same file is referred to by two I/O statements. Unknown dependence: The dependence relation can not be determined.

16 Advanced Computer Architecture
Example Assume the following sequence of the instructions: S1: R1  (A) S2: R2  (R1) + (R2) S3: R1  (R3) S4: B  (R1)

17 Advanced Computer Architecture
S 1 4 3 2 S1: R1 ¬ (A) S2: R2 ¬ (R1) + (R2) S3: R1 ¬ (R3) S4: B ¬ (R1) Example

18 Advanced Computer Architecture
Control dependence The order of execution is determined during run-time — Conditional statements. Control dependence could also exist between operations performed in successive iterations of a loop: Do I = 1, N If (A(I-1) = 0) then A(I) = 0 End

19 Advanced Computer Architecture
Control dependence Control dependence often does not allow efficient exploitation of parallelism.

20 Advanced Computer Architecture
Resource dependence Conflict in using shared resources such as concurrent request for the same functional unit. A resource conflict arises when two instructions attempt to use the same resource at the same time.

21 Advanced Computer Architecture
Resource dependence Within the scope of the resource dependence then we can talk about storage dependence, ALU dependence,...

22 Advanced Computer Architecture
Question What is “true Dependence”? What is “False Dependence”? What is the source of “False Dependence”? How can one eliminate/moderate “resource dependence”?

23 Advanced Computer Architecture
Concurrent Processing Bernstein's Conditions Let Ii and Oi be the input and output sets of process Pi, respectively. The two processes P1 and P2 can be executed in parallel (P1 || P2) iff: I1 Ç O2 = WAR I2 Ç O1 = RAW O1 Ç O2 = WAW

24 Advanced Computer Architecture
Concurrent Processing Bernstein's Conditions In general, P1, P2,..., Pk can be executed in parallel if Bernstein condition is held for every pair of processes: P1 || P2 || P3... || Pk iff Pi || Pj " i ¹ j

25 Advanced Computer Architecture
Concurrent Processing Bernstein's Conditions parallelism relation (||) is commutative: Pi || Pj Þ Pj || Pi parallelism relation (||) is not transitive: Pi || Pj and Pj || Pk does not necessarily guarantee Pi || Pk

26 Advanced Computer Architecture
Concurrent Processing Bernstein's Conditions parallelism relation (||) is not equivalence relation: parallelism relation (||) is associative: Pi || Pj || Pk Þ (Pi || Pj) || Pk = Pi || (Pj || Pk)

27 Advanced Computer Architecture
Concurrent Processing Bernstein's Conditions — Example Detect parallelism in the following program, assume a uniform execution time: P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

28 Advanced Computer Architecture
Concurrent Processing — Bernstein's Condition * + 3 2 / D E C B L G F 1 A P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

29 Advanced Computer Architecture
Concurrent Processing — Bernstein's Condition Example: If two adders are available then, P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

30 Advanced Computer Architecture
Concurrent Processing Bernstein's Conditions — Example * + 1 3 2 / P 4 5 Resource dependence Data dependence P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

31 Advanced Computer Architecture
Concurrent Processing Hardware parallelism is referred to the type and degree of parallelism defined by the architecture and hardware multiplicity — a k-issue processor is a processor with hardware capability that issues K instructions per machine cycle.

32 Advanced Computer Architecture
Concurrent Processing Software parallelism is defined by the control and data dependence of programs. It is a function of algorithm, programming style, and compiler optimization.

33 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism N = (A * B) + (C * D) M = (A * B) - (C * D)

34 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism In machine code we have: Load R1, A Load R2, B Load R3, C Load R4, D Mult Rx, R1, R2 Mult Ry, R3, R4 Add R1, Rx, Ry Sub R2, Rx, Ry Store N, R1 Store M, R2

35 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism A machine which allows parallel multiplications, add/subtraction and simultaneous load/store operations gives an average software parallelism of (10/4) = 2.5 instructions.

36 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism Load R1, A Load R2, B Load R3, C Load R4, D Mult Rx, R1, R2 Mult Ry, R3, R4 Add R1, Rx, Ry Sub R2, Rx, Ry Store N, R1 Store M, R2

37 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism For a machine which does not allow simultaneous Load/Store and arithmetic operations, we have an average software parallelism of (10/8) = 1.25 instructions.

38 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism Load R1, A Load R2, B Load R3, C Load R4, D Mult Rx, R1, R2 Mult Ry, R3, R4 Add R1, Rx, Ry Sub R2, Rx, Ry Store N, R1 Store M, R2

39 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism Now assume a multiprocessor composing of two processors:

40 Advanced Computer Architecture
Concurrent Processing Hardware vs. software parallelism Load R1, A Load R2, B Load R3, C Load R4, D Mult Rx, R1, R2 Mult Ry, R3, R4 Add R1, Rx, Ry Sub R2, Rx, Ry Store N, R1 Store M, R2

41 Advanced Computer Architecture
Summary Dependence Graph Different types of dependencies Bernstein's Conditions Hardware Parallelism Software Parallelism

42 Advanced Computer Architecture
Concurrent Processing Compilation support is one way to solve the mismatch between software and hardware parallelism. A suitable compiler could exploit hardware features in order to improve performance (machine dependent optimization).

43 Advanced Computer Architecture
Concurrent Processing Detection of concurrency in a program at instruction level using techniques like Bernstein is not practical, specially in case of large programs. So we will look at detection of concurrency at a higher level. This bring us to the issue of partitioning, scheduling, and load balancing.

44 Advanced Computer Architecture
Concurrent Processing Two issues of concern: How can we partition a program into concurrent branches, program modules, or grains to yield the shortest execution time, and What is the optimal size of concurrent grains in a computation?

45 Advanced Computer Architecture
Concurrent Processing Partitioning is defined as the ability to partition a program into subprograms that can be executed in parallel. Within the scope of partitioning, two major issues are of concern: Grain Size, and Latency

46 Advanced Computer Architecture
Partitioning and Scheduling Grain Size Granularity or grain size is a measure of the amount of computation involved in a process — It determines the basic program segment chosen for parallel processing. Grain sizes are commonly named as: fine, medium, and coarse.

47 Advanced Computer Architecture
Partitioning and Scheduling Latency Latency imposes a limiting factor on the scalability of the underlying platform. Communication latency — inter-processor communication — is a major factor of concern to a system designer.

48 Advanced Computer Architecture
Partitioning and Scheduling In general, n tasks communicating with each other may require n (n-1) / 2 communication links among them. This leads to a communication bound which limits the number of processors allowed in a computer system.

49 Advanced Computer Architecture
Partitioning and Scheduling Parallelism can be exploited at various levels: Job or Program Subprogram Procedure, Task, or Subroutine Loops or Iteration Instruction or Statement

50 Advanced Computer Architecture
Partitioning and Scheduling The lower the level, the finer the granularity. The finer the granularity, the higher the communication and scheduling overheads. The finer the granularity, the higher the degree of parallelism.

51 Advanced Computer Architecture
Partitioning and Scheduling Instruction and loop levels represent fine grain size. Procedure and subprogram levels represent medium grain size, and Job and subprogram levels represent coarse grain size.

52 Advanced Computer Architecture
Partitioning and Scheduling Instruction Level An instruction level granularity represents a grain size consisting of up to 20 instructions. This level offers a high degree of parallelism in common programs. It is expected that parallelism will be exploited by compiler automatically.

53 Advanced Computer Architecture
Partitioning and Scheduling Loop Level Here typically, we are concerned about iterative loops with less than 500 instructions. At this level, one can distinguish two classes of loop: Loops with independent iterations and, Loops with dependent iterations.

54 Advanced Computer Architecture
Partitioning and Scheduling Procedure Level A typical grain at this level contains less than 2,000 instructions. Communication requirement and penalty at this level is less compared with that of the fine grain levels at the expense of more complexities in detection of parallelism — inter- procedural dependence.

55 Advanced Computer Architecture
Partitioning and Scheduling Subprogram Level Multiprogramming on a uni-processor or on a multiprocessor platform represents this level. In the past, parallelism at this level was exploited by the programmers or algorithm designers rather than by compilers.

56 Advanced Computer Architecture
Partitioning and Scheduling Job Level This level corresponds to the parallel execution of independent jobs (programs) on concurrent computers. Supercomputers with a small number of powerful processors are the best platform for this level of parallelism. In general, parallelism at this level is exploitable by the program loader and the operating system.

57 Advanced Computer Architecture
Grain Packing Let us look at the following example to motivate the effect of grain size on the performance. In the following discussion, we make a reference to the term program graph.

58 Advanced Computer Architecture
Grain Packing A program graph is a dependence graph in which: Each operation is labeled as (n, s) where n is the node identifier and s is the execution time of the node. Each edge is labeled as (v, d) where v is the edge identifier and d is the communication delay. Consider the following program graph:

59 Advanced Computer Architecture
Grain Packing — Fine grain size 1,1 4,1 5,1 7,2 6,1 11,2 2,1 3,1 8,2 10,2 9,2 a,6 b,6 c,6 d,6 f,6 e,6 12,2 13,2 17,2 16,2 15,2 14,2 n,4 o,3 4. l,3 k,4 j,4 m,3 3. p,3 h,4 g,4 i,4

60 Advanced Computer Architecture
Grain Packing — Fine grain size Nodes 1-6 are all memory references 1 cycle to calculate address, and 6 cycles to fetch data from memory. Other nodes are CPU operations requiring 2 cycles each.

61 Advanced Computer Architecture
Grain Packing — Fine grain size The idea behind grain packing is to apply: Fine-grain first to achieve a higher degree of parallelism, Combine multiple fine-grain nodes into a coarse- grain node if it can eliminate unnecessary communication delays or reduce the overall cost.

62 Advanced Computer Architecture
Grain Packing — Partitioning A 1,1 4,1 5,1 7,2 6,1 11,2 2,1 3,1 8,2 10,2 9,2 a,6 b,6 c,6 d,6 f,6 e,6 12,2 13,2 17,2 16,2 15,2 14,2 n,4 o,3 4. l,3 k,4 j,4 m,3 3. p,3 h,4 g,4 i,4 B C D E

63 Advanced Computer Architecture
Grain Packing — Coarse Grain E,6 C,4 B,4 D,6 A,8 4. g,4 e,6 f,6 d,6 c,6 b,6 a,6 h,4 i,4 n,4 j,4 k,4 3.

64 Advanced Computer Architecture
Grain Packing — Scheduling Fine Grain Size P1 6 9 1 11 2 3 8 7 P2 10 13 12 5 6 4 16 15 14 17 Busy Communication Idle

65 Advanced Computer Architecture
Grain Packing — Scheduling Coarse Grain Size P1 A C D E 7 P2 B Busy Communication Idle

66 Advanced Computer Architecture
Grain Packing As can be noted, through grain packing we were able to reduce the overall execution time by reducing the communication overhead. The concept of grain packing can be recursively applied.

67 Advanced Computer Architecture
Task Duplication Task duplication is another way to reduce the communication overhead and hence the execution time. Consider the following program graph:

68 Advanced Computer Architecture
Task Duplication D,2 E,2 B,1 C,1 A,4 d,4 a,1 b,1 c,1 c,8 a,8 e,4

69 Advanced Computer Architecture
Task Duplication P1 A B D E 7 P2 C Busy Communication Idle

70 Advanced Computer Architecture
B,1 C',1 A,4 d,4 a,1 b,1 c,1 a,1 E,2 C,1 c,1 e,4 A',4 Task Duplication Now let us duplicate tasks A and C:

71 Advanced Computer Architecture
Task Duplication P1 A B D C E 7 P2 C A Busy Communication Idle

72 Advanced Computer Architecture
Performance improvement Advances in Technology Architectural Advances Better Resource Management Program behavior Extract Concurrency/ Prallelism

73 Advanced Computer Architecture
Summary Dependence Graph Bernstein's conditions In partitioning a task into subtasks two issues must be taken into consideration: Grain Size, and Latency Program Graph Grain Packing Task duplication

74 Advanced Computer Architecture
Loop Scheduling Loops are the largest source of parallelism in a program. Therefore, there is a need to pay attention to the partitioning and allocation of loop iterations among processors in a parallel environment.

75 Advanced Computer Architecture
Loop Scheduling Practically we can talk about three classes of loops: Sequential Loops, Doall Loops — vector (parallel) loops, Doacross Loops — Loop with intermediate degree of parallelism.

76 Advanced Computer Architecture
Loop Scheduling — Doall Loops Static Scheduling Dynamic Scheduling

77 Advanced Computer Architecture
Loop Scheduling — Doall Loops Static Scheduling schemes assign a fixed number of iterations to each processor. If N is the number of iterations and P is the number of processors, then:

78 Advanced Computer Architecture
Loop Scheduling — Doall Loops Block Scheduling (Static chunking), assigns iterations (1,…, N/P), (N/P+1, …, 2* N/P), … to processors 1, 2, …, respectively. Cyclic Scheduling assigns iterations (i, i+P, i+2P, …) to processor i (1 i  P).

79 Advanced Computer Architecture
Loop Scheduling — Doall Loops In practice, cyclic scheduling offers a better load balancing than block scheduling if the computation performed by each iteration varies significantly.

80 Advanced Computer Architecture
Loop Scheduling — Doall Loops Dynamic scheduling schemes have been proposed to respond to the imbalance work-load of the static scheduling schemes: Self Scheduling Scheme Fixed size chunking Scheme Guided Self Scheduling Scheme Factoring Scheme Trapezoid Self Scheduling Scheme

81 Advanced Computer Architecture
Loop Scheduling — Doall Loops Self scheduling scheme, one iteration at a time, on demand, is assigned to a processor (too many scheduling steps). Fixed size chunking, assigns chunks of iterations, on demand, at a time (could result in imbalanced load if execution time of iterations varies).

82 Advanced Computer Architecture
Loop Scheduling — Doall Loops Guided self scheduling, Factoring, and Trapezoid Self Scheduling schemes assign chunks of varying size based on the remaining number of iterations.

83 Advanced Computer Architecture
Loop Scheduling — Doall Loops Guided self scheduling assigns chunks of size R/P on demand (idle processor) to each processor. P is the number of processors and R is the remaining number of iterations. Larger chunks are assigned at earlier stage and smaller chunk sizes are assigned at later stage to smooth load imbalance. This scheme does not perform well if the execution time of iterations varies. If N=100 and p=4 then the chunk sizes are: 25, 19, 14, 11, 8, 6, …

84 Advanced Computer Architecture
Loop Scheduling — Doall Loops Factoring at each allocation cycle assigns chunk of fixed size to processors according to R/2P equation. P is the number of processors and R is the remaining number of iterations. Relative to GSS, at earlier stage, smaller chunks are assigned to the processors. If N=100 and p=4 then the chunk sizes are: (13, 13, 13, 13), (6, 6, 6,6), …

85 Advanced Computer Architecture
Loop Scheduling — Doall Loops Trapezoid Self Scheduling fixes the first (f) and last chunk (l) sizes and assigns successive chunks in decreasing number: C1 = f, C2 = C1 – s (s= (f-l)/(C-1), C= 2N/(f+l)) If N=100, p=4, f=25, and l=1 then the chunk sizes are: 25, 25-3=22, 22-3=19, ….

86 Advanced Computer Architecture
Assume N = 1000 and 4 processors: Algorithm P1 P2 P3 P4 GSS 250 188 141 106 33 59 79 19 25 45 6 14 11 3 8 4 1 2 Algorithm P1 P2 P3 P4 GSS 250 188 141 106 33 59 79 19 25 45 6 14 11 3 8 4 1 2 3 1 1 1

87 Advanced Computer Architecture
Algorithm P1 P2 P3 P4 Factoring 125 63 31 16 8 4 2 1 Algorithm P1 P2 P3 P4 Trapezoid 125 117 109 101 69 77 85 93 61 53 45 37 28

88 Advanced Computer Architecture
Loop Scheduling — Doall Loops (Shared Memory Uniform Access) matrix multiplication (N = 300) 20 15 10 5 FS GSS Factoring TSS Linear Number of PEs Speedup Small variances in iteration execution times. All behave the same when the number of iterations is large and the variance is small. .

89 Advanced Computer Architecture
Loop Scheduling — Doall Loops (Shared Memory Uniform Access) Adjoint convolution (N = 100) 20 15 10 5 FS GSS Factoring TSS Linear Number of PEs Speedup Larger variances in iteration execution times. Factoring and TSS perform much better than FS and GSS. .

90 Advanced Computer Architecture
- 1 32 31 30 29 2 28 27 26 25 4 24 23 22 21 8 20 19 18 3 17 16 15 6 14 13 37 11 12 45 53 10 61 9 69 63 33 77 7 85 59 93 79 5 101 125 250 106 109 141 117 188 TSS (f = 125, l = 1) Factoring FS GSS Step Number of iterations assigned to a processor at each scheduling step with N= 1000, P = 4

91 Advanced Computer Architecture
Loop Scheduling — Doall Loops (Shared Memory Non-Uniform Memory Access) Affinity Scheduling Dynamic Partitioned Affinity Scheduling Wrapped Partitioned Affinity Scheduling Locality Based Dynamic Scheduling

92 Advanced Computer Architecture
Loop Scheduling — Doall Loops (Shared Memory Non- Uniform Memory Access) In this scenario, access cost increases with the distance. So the algorithms should take the location of data into consideration. As a result, iterations are scheduled on the processor that holds the required data, at least initially. Affinity scheduling attempts to balance the load, minimize the number of synchronization operations, and exploit affinity. Initially, fixed N/P iterations are assigned to each processor, when a processor becomes idle, it gets 1/p (Iterations) from its local queue. If local queue is empty, it gets 1/P (iterations) from a heavily loaded processor.

93 Advanced Computer Architecture
Loop Scheduling — Doall Loops (Shared Memory Non-Uniform Memory Access) Dynamic Partitioned Affinity Scheduling, Wrapped Partitioned Affinity Scheduling, and Locality Based Dynamic Scheduling are intended for loops with wide varying execution times, by dynamically determining the chunk sizes in subsequent iterations. Also, they assume that the iteration size is proportion to the iteration sequence. Wrapped Partitioning is similar to Dynamic partitioning, however, a processor is assigned iterations that are at the distance of P from each other. Locality based, a processor first executes iterations for which data is locally available.

94 Advanced Computer Architecture
Loop Scheduling — Doall Loops (Shared Memory Non- Uniform Memory Access) Rectangular workload (N = 128) 8 7 6 5 4 3 2 1 GSS AFS DPAS WPAS Number of PEs Completion time (secs) In rectangular work load, iteration size can be partitioned into large and small. WRAP performs the best, since heavy iterations are not assigned to a processor.

95 Advanced Computer Architecture
Loop Scheduling — Doall Loops (Shared Memory Non-Uniform Memory Access) Jacobi algorithm (128 X 128 matrix) 8 7 6 5 4 3 2 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 GSS AFS DPAS WPAS Number of PEs Completion time (secs) In Jacobi algorithm (triangular work load) iteration execution time decreases linearly. Still WPAS performs better because of more uniform distribution of the work load. In case of a balanced work load (fixed iteration execution times) all affinity algorithms offered the same performance.

96 Advanced Computer Architecture
Summary Why Loops? Different types of Loops Scheduling of Doall Loops Static allocation vs dynamic allocation Scheduling Doall Looks in UMA and NUMA

97 Advanced Computer Architecture
Loop Scheduling Practice has shown that the lost of parallelism after serializing Doacross loops is very significant. There is a need to develop good schemes for the parallel execution of Doacross loops.

98 Advanced Computer Architecture
Loop Scheduling — Doacross Loops In a Doacross loop, iteration may be either data or control dependent on each other: Control dependence is caused by conditional statements. Data dependence appears in the form of sharing computational results. Data dependence can be either in the form of lexically-forward or lexically-backward.

99 Advanced Computer Architecture
Loop Scheduling — Doacross Loops Doacross loops can be either: regular (dependence distance is fixed) or irregular (dependence distance varies from iteration to iteration). Regular Doacross loops are more amenable to parallel execution than irregular loops.

100 Advanced Computer Architecture
Loop Scheduling — Doacross Loops Data dependence appears in the form of lexically forward or lexically backward dependencies.

101 Advanced Computer Architecture
Lexically Forward Dependency

102 Advanced Computer Architecture
Lexically Backward Dependency

103 Advanced Computer Architecture
Loop Scheduling — Doacross Loops DOACROSS Model — Cytron Pre-synchronization Scheduling — Krothapalli & Sadayappan Staggered Distribution Scheme — Lim et. al.,

104 Advanced Computer Architecture
Loop Scheduling DOACROSS Model Each iteration is assigned to a virtual processor and execution of two successive virtual processors is delayed with d time period — d can range from zero (DOALL) to T (the sequential loop).

105 Advanced Computer Architecture
Loop Scheduling DOACROSS Model Assume: T  the execution time of one iteration of the loop, d  the delay due to the lexically-backward dependency, C  the inter-processing communication cost, and T-d  the portion of the loop iteration that can be executed in parallel.

106 Advanced Computer Architecture
Loop Scheduling DOACROSS Model — Example Assume we have the following loop that is iterated 8 times. Further, assume the loop execution time is 10 and d is Finally, assume that we have three processors: DOACROSS I = 1, 8 { d = 4 } 10 END

107 Advanced Computer Architecture
Loop Scheduling DOACROSS Model — Example

108 Advanced Computer Architecture
Loop Scheduling — Doacross Loops The total execution time of a DOACROSS loop L of n iterations for: An unbounded number of processors: Limited number of processors (P): Considering inter-processor communication (C): TE(L)=(n-1)d+T

109 Advanced Computer Architecture
Loop Scheduling DOACROSS model was aimed to model the execution of the sequential loops, vector loops, and loops with intermediate parallelism by considering the: control dependencies, and data dependencies.

110 Advanced Computer Architecture
Loop Scheduling Control dependencies are caused by conditional and unconditional branches. Data dependencies are due to the lexically forward and lexically backward dependencies. This simple model, however, does not consider the effect of inter-processor communication cost.

111 Advanced Computer Architecture
Loop Scheduling Staggered Distribution The staggered distribution uses heuristics to distribute loop iterations unevenly among processors to mask the delay due to the data dependencies and inter-processor communications.

112 Advanced Computer Architecture
Staggered Distribution Assume: T  the execution time of one iteration of the loop, d  the delay due to the lexically-backward dependency, C  the inter-processing communication cost, and T-d  the portion of the loop iteration that can be executed in parallel.

113 Advanced Computer Architecture
Staggered Distribution The iterations assigned to PEi succeed the iterations assigned to PEi-1 with PEi having mi more iterations than PEi-1, Additional iterations (mi) are used to mask out the communication delay.

114 Advanced Computer Architecture
Staggered Distribution The number of iterations allocated to PEi would be:

115 Advanced Computer Architecture
Refer to earlier example, Staggered scheme For C = 0 makes 2,3,3 assignment of 10 iterations among three processors, which results in the execution time of 2* * 4+ 3* 4 =44 and hence a speed up of 100/44. For C = 6 makes 1,2,5 assignment of 10 iterations among three processors, which results in the execution time of (1*10 + 6) + (2 * 4 + 6) + 5 * 4 = 50 and hence a speed up of 160/50.

116 Advanced Computer Architecture
Staggered Distribution Random Loops For random loops, the number of iterations is not known in advance and termination of the loop is based on a parameter that is being modified during each loop iteration. In this case, the probability of the successful continuation is used to calculate the expected number of iterations.

117 Advanced Computer Architecture
Staggered Distribution Random Loops Based on the expected number of iterations, a copy of the loop along with the number of iterations is assigned to each processor based on the staggered scheme.

118 Advanced Computer Architecture
Staggered Distribution ─ Evaluation To evaluate the effectiveness of the staggered allocation scheme, a representative loop with T = 50 and loops 3,4,5,11,13, and 19 of the Livermore Loops have been simulated and compared using staggered scheme, even distribution, and cyclic distribution.

119 Advanced Computer Architecture
Staggered Distribution ─ Evaluation Max. Speed-Up Using Staggered Scheme (N = 300) k  d/T C/T  comm./T MS  Max Speedup k = d/T C/T communication time/ iteration execution time AP Average Parallelism (the ratio of the total execution time to the critical-path length) MS Maximum Parallelism Staggered distribution offers speed up close to the average parallelism.

120 Advanced Computer Architecture
Staggered Distribution ─ Evaluation Number of Processors to Attain Maximum Speed-up Speedup 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 4 8 12 16 20 24 28 32 36 40 k Number of PEs C/T = 1.0 N = 1000 (7.87) (6.31) (4.57) (3.92) (3.06) (2.84) (2.4) (2.23) (1.96) (1.84) (1.62) (1.57) (1.4) (1.36) (1.22) (1.21) (1.1) (1.09) Equal Distribution Staggered Distribution k  d/T C/T  comm./T Staggered distribution offers a better speed up while using fewer processors for various k factor.

121 Advanced Computer Architecture
Staggered Distribution ─ Evaluation Staggered distribution vs. Static Chunking and Cyclic Scheduling Static Chunking (SC) Cyclic Scheduling (CYC) k C/T = 3.0 C/T = 4.0 C/T = 5.0 C/T = 6.0 0.1 1.28 (17) 1.28 (16) 1.30 (16) 1.31 (15) 22.54 (17) 28.07 (16) 34.21 (16) 39.59 (15) 0.2 1.25 (8) 1.26 (8) 12.87 (8) 16.79 (8) 20.55 (8) 24.35 (8) 0.3 1.22 (5) 9.04 (5) 11.73 (5) 14.38 (5) 17.06 (5) 0.4 1.20 (4) 7.33 (4) 9.48 (4) 11.60 (4) 13.73 (4) 0.5 1.17 (3) 6.10 (3) 7.83 (3) 9.56 (3) 11.28 (3) 0.6 1.12 (2) 5.03 (2) 6.42 (2) 7.82 (2) 9.21 (2) 0.7 1.11 (2) 4.80 (2) 6.09 (2) 7.39 (2) 8.68 (2) 0.8 1.08 (2) 4.55 (2) 5.74 (2) 6.94 (2) 8.13 (2) 0.9 1.05 (2) 4.28 (2) 5.38 (2) 6.47 (2) 7.56 (2)

122 Advanced Computer Architecture
Staggered Distribution ─ Evaluation Speed up (SU) of Staggered Distribution vs. Static Chunking (SC) and Cyclic (CYC) Scheduling (Livermore Loops). No. of processors used (staggered distribution) PE = 4 PE = 8 LOOP # k C/T Su (SC) Su (CYC) 3 0.25 3.75 1.20 10.72 1.21 (7) 13.10 (7) 5 0.30 3.00 1.21 8.22 1.16 (6) 9.35 (6) 11 10.50 12.18 (7) 13 0.05 0.71 1.07 2.82 1.14 5.05 19(1) 0.33 3.33 1.24 7.53 1.34 (4) 7.53 (4) 19(2) 0.27 2.73 1.23 6.86 1.28 (5) 6.93 (5)

123 Advanced Computer Architecture
Staggered Distribution Staggered distribution is effective for loops with large degree of parallelism among iterations. The maximum speed up attained is very close to the optimal speed up. The maximum speed up is achieved at the expense of using a fewer number of processors. Staggered scheme however, distributes an unbalanced load among processors.

124 Advanced Computer Architecture
Cyclic Staggered Distribution Staggered distribution was extended (i.e., Cyclic Staggered distribution) to address: Uneven distribution of the loop iterations among the processors, Loops with varying iteration execution times, and Availability of free processors below maxpe.

125 Advanced Computer Architecture
Cyclic Staggered Distribution Initially, loop iterations are distributed in staggered fashion among the P processors, In the second phase, the remaining iterations are distributed based on the staggered concept according to the following equation, nP is the number of iterations already allocated to PEi. When the execution times of loop iterations vary, this scheme uses estimated worst case iteration execution time (possibly augmented by runtime support) in determining the distribution for the second and subsequent passes.

126 Advanced Computer Architecture
Comparative Analysis of Doacross Scheduling (Number of iterations assigned to a processor at each scheduling step ─ T = 10, d = 2, n = 500, C = 5, P = 4) 12 - 1 3 19 2 18 17 4 16 15 14 13 11 10 9 8 7 6 5 171 125 136 108 85 CSD SD SC Cyclic PE Step

127 Advanced Computer Architecture
Comparative Analysis of Doacross Scheduling (continued) Step PE Cyclic SC SD CSD 20 4 1 - 12 21 22 2 23 3 24 25 26 27 28 29 30 31 32 33 34 35 36

128 Advanced Computer Architecture
Comparative Analysis of Doacross Scheduling (continued) Step PE Cyclic SC SD CSD 37 1 - 12 38 2 39 3 40 4 41 42 43 44 45 46 6 3 . . Execution Time 3,503 2,015 1,710 1,298

129 Advanced Computer Architecture
7 6 5 4 3 2 1 CS1 CS2 SD Number of PEs Speedup k = 0.2 Cyclic Staggered Distribution ─ Evaluation (C/T = 3.0) Cyclic distribution performs better than Staggered regardless of the number of processors, C/T ratio, and k factor

130 Advanced Computer Architecture
7 6 5 4 3 2 1 CS1 CS2 SD Number of PEs Speedup k = 0.2 Cyclic Staggered Distribution ─ Evaluation (C/T = 5.0)

131 Advanced Computer Architecture
16 14 12 10 8 6 4 2 1 3 SD SC CYC Number of PEs Speedup Staggered Scheme in control-flow environment The loop was run on the nCUBE 2 multiprocessor using 2 to 16 processors (PE).

132 Advanced Computer Architecture
6 5 4 3 2 1 CS1 CS2 SD Number of PEs Speedup Staggered Scheme in control-flow environment CS1 and CS2 performed better than SD only when the number of processors are between 3 and 5. For an environment with two processors, SD performed better than CS1 and CS2 because of the additional overhead incurred to implement the cyclic staggered schemes.

133 Advanced Computer Architecture
Irregular Doacross Loop Scheduling For irregular Doacross loops, the dependence patterns are complicated and usually are not predictable at compile-time. DO I = 1, N Sp : A(B(I)) := ..... Sq : := A(C(I)) END

134 Advanced Computer Architecture
Irregular Doacross Loop Scheduling DO I = 1, 6 A(I) := A(B(I)) END 1 2 4 3 5 6 Iteration Space Graph B(I) I 1 A(1)  A(1) A(2)  A(3) A(3)  A(1) A(4)  A(5) A(5)  A(3) A(6)  A(4) Operations’ result

135 Advanced Computer Architecture
Irregular DoAcross Loop Scheduling Pre-synchronized Scheduling Runtime Parallelizing Schemes The Zhu-Yew Scheme Chen’s Scheme

136 Advanced Computer Architecture
Summary Different types of loop Sequential Parallel Loops with intermediate degree of parallelism Loop scheduling Static Dynamic Staggered Allocation

137 Advanced Computer Architecture
Partitioning and Scheduling The goal of scheduling (scheduler) is to determine an assignment of tasks to processing elements to optimize certain performance metrics ─ mainly performance and efficiency. Scheduling algorithms are judged on their time complexity and efficiency of the produced schedule.

138 Advanced Computer Architecture
Partitioning and Scheduling A proper allocation scheme should partition a task (program graph) into subtasks (sub- graphs) and distributes the subtasks among processors in an attempt to minimize: processor contentions, and inter-processor communications.

139 Advanced Computer Architecture
Partitioning and Scheduling There are two main allocation schemes: Static, and Dynamic.

140 Advanced Computer Architecture
Partitioning and Scheduling In a static scheme, the subtasks are allocated at compile time to the processors using global information about the task and system organization.

141 Advanced Computer Architecture
Partitioning and Scheduling In a dynamic scheme, the behavior of the task during the run time is used to measure the loads. In another words, tasks are scheduled on the fly. Assign activated subtasks to the least loaded processor, and Balance the load by migrating the subtasks.

142 Advanced Computer Architecture
Partitioning and Scheduling In static allocation scheme the cost of allocating subtasks is incurred once, even though the program may be executed repeatedly. However, it can be inefficient when estimates of run time dependent characteristics are inaccurate.

143 Advanced Computer Architecture
Partitioning and Scheduling The dynamic allocation scheme has a disadvantage due to its overhead involved in determining processor loads and allocation of subtasks at run time.

144 Advanced Computer Architecture
Partitioning and Scheduling Despite the conceptual differences between static and dynamic policies, they have the same goals: Exploit the maximum inherent concurrency in a task and minimize contention for resources. It has been shown that such a problem is NP- complete.

145 Advanced Computer Architecture
Partitioning and Scheduling It is possible to generate optimal solutions when restrictions are imposed on program behavior or system configuration:

146 Advanced Computer Architecture
Partitioning and Scheduling When subtasks are of the same weight and system is composed of two processors. When weight of subtasks are mutually commensurable — a set of subtasks are said to be mutually commensurable if there exists a weight t such that each subtask weight is an integer multiple of t.

147 Advanced Computer Architecture
Partitioning and Scheduling Heuristic solutions are promising approach to solving the allocation problem.

148 Advanced Computer Architecture
Summary Partitioning and Scheduling Load Balancing Static Scheduling Dynamic Scheduling Static Scheduling vs. Dynamic Scheduling Heuristic Based Scheduling and Load Balancing

149 Advanced Computer Architecture
Partitioning and Scheduling In the following we will look at several static heuristic based load balancing algorithms. Note that, these algorithms are heuristic based, in another words, they make sense and offer a reasonable solution. Also note that, these algorithms will build on each other.

150 Advanced Computer Architecture
Partitioning and Scheduling A scheduling system consists of: Program tasks, Target machine, and A scheduler.

151 Advanced Computer Architecture
Partitioning and Scheduling Characteristics of a parallel program is defined as (T, p, D, A), where: T = {t1, t2, …, tn}, p Is a partial order on T which specifies operational precedence constraints: i.e., ti p tj means that ti must be completed before tj can start execution (ti and tj T) . D is an n X n matrix of communication data, where Dij ≥ 0 is the amount of data required to be transferred from task ti to task tj. A is a vector of size n representing the computations: i.e., Ai > 0 is the computation time of task ti. Traditionally, a program is represented as a directed graph called the program (task) graph (G ≡ (T,E)).

152 Advanced Computer Architecture
Partitioning and Scheduling T is the set of the subtasks, and E is the set of directed edges among elements in T. An edge represents a partial ordering (p) among the subtasks — ti  tj implies that ti must be executed before tj, (ti, tj  T). In addition, nodes and edges are labeled by numbers representing the execution time and communication cost.

153 Advanced Computer Architecture
Partitioning and Scheduling Target machine is a set of m heterogeneous/homogeneous interconnected processing elements. Associated with each processing element Pi is its speed Si. The connectivity of the processing elements can be represented by an undirected graph called the network graph. Associated with each edge (i, j) between processing elements Pi and Pj in the network graph is the transfer rate Rij, that is, how many unit of data can be transmitted per unit of time over the link.

154 Advanced Computer Architecture
Partitioning and Scheduling Schedule is a function that maps the task graph (G ≡ (T,E)) on a target machine in order to satisfy a performance metric. The schedule can be illustrated as a Gantt chart where the start and finish times for all tasks can be easily shown. With respect to our earlier notations then: Ai/Sj is the execution time of task ti when executed on processor Pj and Dij/Rkl is the communication delay between tasks ti and tj when they are executed on adjacent processing elements Pk and Pl.

155 Advanced Computer Architecture
Partitioning and Scheduling — Example

156 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — List Scheduling Make a priority list of subtasks to be assigned to the processors. Recursively, remove the top most element from the priority list and allocate it to the "most suitable" processor for execution.

157 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — List Scheduling How to define the priority criteria and the "most suitable" processor? How to account the communication cost?

158 Advanced Computer Architecture
Partitioning and Scheduling No restriction on the number of processors, but constraints on the task graph. In-forest is a task graph where each node has at most one immediate successor. Out-forest is a task graph where each node has at most one immediate predecessor. Interval ordered is a task graph that describes the precedence relations among the system tasks in an interval order. A task graph is an interval ordered where related elements can be mapped into non overlapping time interval. The interval order has a special property: For any interval ordered pair of tasks u and v, either the successors of u are also the successors of v or vice versa.

159 Advanced Computer Architecture
Partitioning and Scheduling Assumptions: Task graph consists of n tasks, Target machine is composed of m processes Execution of each task is one unit of time, Communication between pair of tasks is zero.

160 Advanced Computer Architecture
Partitioning and Scheduling In-forest/Out-forest task graphs: Assign the highest level first: Calculate and assign the node level to each node (node priority) ─ node level of node x is the maximum number of nodes (including x) of any path from x to a terminal node. Whenever a processor becomes available, assign it the unexecuted (unscheduled) ready task with the highest priority.

161 Advanced Computer Architecture
b c f g e i d h j k m l Partitioning and Scheduling ─ In-forest

162 Advanced Computer Architecture
b c f g e i d h j k m l Partitioning and Scheduling ─ In-forest Node a b c d e f g h i j k l m Level 5 4 3 2 1 P1 P2 P3 P4 a d i k m b f j c g l e h

163 Advanced Computer Architecture
Partitioning and Scheduling ─ Interval ordered Here we use the number of all successors of a node as its priority. Calculate priority of each node, Whenever a processor becomes available, assign it the unexecuted ready task with the highest priority.

164 Advanced Computer Architecture
f h i j g e b Partitioning and Scheduling ─ Interval ordered

165 Advanced Computer Architecture
Partitioning and Scheduling ─ Interval ordered a c d f h i j g e b Node a b c d e f g h i j # of Successors 8 6 5 4 1 3 a c g h b d f i e j P1 P2 P3

166 Advanced Computer Architecture
Partitioning and Scheduling No restriction on the task graph, but two processors Each node is labeled and labels are used as the node priority. Assign 1 to one of the terminal tasks, Repeat until all nodes are labeled: Let labels 1,2, …, j-1 have been assigned. Let S be the set of unassigned tasks with no unlabeled successors. For each node x S define l(x) as follows: Let L(y1), L(y2), …, L(yk) be the labels of the immediate successors of x. Then l(x) is the decreasing sequence of integers formed by ordering the set {L(y1), L(y2), …, L(yk) }, Let x S such that for all x’ S l(x)  l(x’), Assign j to the x (i.e., L(x) = j). Whenever a processor becomes available, assign it the unexecuted (unscheduled) ready task with the highest priority.

167 Advanced Computer Architecture
f h i j g e b k Partitioning and Scheduling

168 Advanced Computer Architecture
Partitioning and Scheduling Node a b c d e f g h i j k Level 11 10 9 8 6 7 5 4 3 1 2 a c d f h i j g e b k a c f g i k b d e h j P1 P2

169 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First Partition the DAG into horizontal levels. Prioritize the DAG nodes level by level. At each level, nodes with a higher weight get a higher priority. Load balancing is achieved by assigning the heaviest nodes to the least loaded processor.

170 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First Assume the following DAG and a platform composed of three processors. Note that each node in DAG has a name and an integer representing its execution time.

171 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First P 1 2 C E H M N O D F I J K L A B G

172 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Critical Path Method Classical method for task allocation. Subtasks on the critical path determine the shortest possible execution time of the program. Prioritize the subtasks according to the length of their critical path. For an n node DAG, the complexity of the algorithm is O(n2).

173 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Critical Path Method

174 Advanced Computer Architecture
Summary Partitioning and Scheduling Static, Dynamic Heuristic Solutions Heavy node first Critical Path Method

175 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Weighted Length Algorithm Consider the weight of a node, its critical path, and the weight of its successors. Let:

176 Advanced Computer Architecture
Partitioning and Scheduling WT(P) represents execution time of node P U(P) represents maximum weighted length of the children of P V(P) represents the sum of the weighted length of the children of P, then weighted length (WL) of P is defined as: WL(P) = WT(P) + U(P) + V(P)/U(P)

177 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Weighted Length Algorithm

178 Advanced Computer Architecture
Partitioning and Scheduling — Communication Models Program completion Time = Execution Time + Total Communication Delay Total Communication Delay = Total number of messages * communication delay per message

179 Advanced Computer Architecture
Partitioning and Scheduling — Communication Models ModelA: Total number of messages is equal to the number of node pairs (u, v) such that (u, v)  E and Proc(u)  Proc(v). ModelB: Total number of messages is defined as the number of processor task pairs (P, v) such that processor P does not compute v but computes at least one of its immediate successors. ModelC: A processor can compute a task and communicate with another processor simultaneously.

180 Advanced Computer Architecture
b Partitioning and Scheduling — Communication Models Consider the following task graph: If we use ModelA for scheduling this program, then: Computation Time = 3+2 = 5 If we use ModelB for scheduling this program, then: Computation Time = 3+1 = 4 a b d c e P1 P2

181 Advanced Computer Architecture
Partitioning and Scheduling — Communication Models If we use ModelC for scheduling this program, then: a c d e b a b c e d P1 P2

182 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Define DSMi,t as the Desirable Starting Moment — the moment that a subtask receives all of its input messages — of subtask t on processor i. Define Load(Pi) be the accumulated execution time (including the communication delays) of processor i.

183 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Define ASMi,t as the Actual Starting Moment — the moment that a subtask can begin its execution — of subtask t on processor i, then: ASMi,t = MAX (LOAD(Pi), DSMi,t)

184 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Assume DSM0,3 = 5 Load(P0) = 15 ASM0,3 = 15 DSM1,3 = 10 Load(P1) = 0 ASM1,3 = 10

185 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay The following relation defines the most suitable PE(i) for task t: Min (ASMi,t) i = 1, 2, 3,...,n

186 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay In our case then PE1 is the most suitable PE for task 3. To reduce the overhead, one can just investigate the nearest neighbors for the smallest ASM value.

187 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Assume the following configuration:

188 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay

189 Advanced Computer Architecture
Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay

190 Advanced Computer Architecture
DSM0,D = 10 Load(P0) = 10 ASM0,D = 10 DSM1,D = 5 Load(P1) = 5 ASM1,D = 5 DSM2,D = 15 Load(P2) = 5 ASM2,D = 15 DSM3,D = 10 Load(P3) = 0 ASM3,D = 10 Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay DSM0,E = 15 Load(P0) = 10 ASM0,D = 15 DSM1,E = 15 Load(P1) = 25 ASM1,,E = 25 DSM2,E = 25 Load(P2) = 5 ASM2,E = 25 DSM3,E = 20 Load(P3) = 0 ASM3,E = 20


Download ppt "Advanced Computer Architecture Programming Level"

Similar presentations


Ads by Google