Advanced Computer Architecture Programming Level

Advanced Computer Architecture Programming Level
A.R. Hurson 128 EECH Building, Missouri S&T

Advanced Computer Architecture
Instruction Set Design A computer with a storage component which may contain both data to be manipulated and instructions to manipulate the data is called a stored program machine — the user can change the sequence of operations on the data. A program is a finite sequence of instructions that specify the operands, operations and the sequence by which processing has to occur.

Compiler and Compilation A compiler is aimed to translate a high level program (source code) into a machine level program (object code). This translation should be performed correctly and efficiently — valid source code should be correctly and efficiently translated into an efficient object code (optimization).

Compiler and Compilation Performance Efficient translation: one pass vs. multi pass compilers. Efficient execution: performance and hardware utilization.

Compiler and Compilation Compiler Optimization — Optimization can be performed at different levels: High level optimizations, Local optimizations, Global optimizations — global common sub-expression elimination, Register allocation — graph coloring and no. of registers available, Machine dependent optimization — exploitation of concurrency.

Concurrent Processing To move concurrent processing into the mainstream of computation, one needs to develop: Computational model for concurrent processing, Proper means for interconnection in concurrent systems, and Proper means to integrate concurrent processing into general computing environment. Abstract Model Flow of Operations and Control Mapping

Concurrent Processing These can be attributed to issues such as: level of concurrency, computational granularity, time and space complexity, communication latencies, and scheduling and load balancing.

Summary Stored Program Machine Program Instruction: Definition and format Addressing Mode Compiler Concurrent Processing

Concurrent Processing Independence among segments of a program is a necessary condition to execute them concurrently. In general, two independent segments could be executed in any order without affecting each other — a segment can be an instruction or a sequence of instructions.

Concurrent Processing Dependence graph is used to determine the dependence relations among the program segments.

Concurrent Processing Dependence Graph — A dependence graph is a directed graph G  G (N,A) in which the set of nodes (N) represents the program segments and the set of the directed arcs (A) shows the order of dependence among the segments.

Concurrent Processing Dependence Graph Dependence comes in various forms and kinds: Data dependence Control dependence Resource dependence

Concurrent Processing Data Dependence: If an instruction uses a value produced by a previous instruction, then the second instruction is data dependent to the first instruction. Data Dependence comes in different forms:

Data dependence Flow dependence: At least one output of S1 is an input of S2 (Read-After-Write: RAW). Anti dependence: Output of S2 is overlapped with the input to S1 (Write-After-Read: WAR). Output dependence: S1 and S2 write to the same location (Write-After-Write: WAW). S1 ® S2

Data dependence I/O dependence: The same file is referred to by two I/O statements. Unknown dependence: The dependence relation can not be determined.

Example Assume the following sequence of the instructions: S1: R1  (A) S2: R2  (R1) + (R2) S3: R1  (R3) S4: B  (R1)

S 1 4 3 2 S1: R1 ¬ (A) S2: R2 ¬ (R1) + (R2) S3: R1 ¬ (R3) S4: B ¬ (R1) Example

Control dependence The order of execution is determined during run-time — Conditional statements. Control dependence could also exist between operations performed in successive iterations of a loop: Do I = 1, N If (A(I-1) = 0) then A(I) = 0 End

Control dependence Control dependence often does not allow efficient exploitation of parallelism.

Resource dependence Conflict in using shared resources such as concurrent request for the same functional unit. A resource conflict arises when two instructions attempt to use the same resource at the same time.

Resource dependence Within the scope of the resource dependence then we can talk about storage dependence, ALU dependence,...

Question What is “true Dependence”? What is “False Dependence”? What is the source of “False Dependence”? How can one eliminate/moderate “resource dependence”?

Concurrent Processing Bernstein's Conditions Let Ii and Oi be the input and output sets of process Pi, respectively. The two processes P1 and P2 can be executed in parallel (P1 || P2) iff: I1 Ç O2 = WAR I2 Ç O1 = RAW O1 Ç O2 = WAW

Concurrent Processing Bernstein's Conditions In general, P1, P2,..., Pk can be executed in parallel if Bernstein condition is held for every pair of processes: P1 || P2 || P3... || Pk iff Pi || Pj " i ¹ j

Concurrent Processing Bernstein's Conditions parallelism relation (||) is commutative: Pi || Pj Þ Pj || Pi parallelism relation (||) is not transitive: Pi || Pj and Pj || Pk does not necessarily guarantee Pi || Pk

Concurrent Processing Bernstein's Conditions parallelism relation (||) is not equivalence relation: parallelism relation (||) is associative: Pi || Pj || Pk Þ (Pi || Pj) || Pk = Pi || (Pj || Pk)

Concurrent Processing Bernstein's Conditions — Example Detect parallelism in the following program, assume a uniform execution time: P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

Concurrent Processing — Bernstein's Condition * + 3 2 / D E C B L G F 1 A P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

Concurrent Processing — Bernstein's Condition Example: If two adders are available then, P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

Concurrent Processing Bernstein's Conditions — Example * + 1 3 2 / P 4 5 Resource dependence Data dependence P1: C = D * E P2: M = G + C P3: A = B + C P4: C = L + M P5: F = G / E

Concurrent Processing Hardware parallelism is referred to the type and degree of parallelism defined by the architecture and hardware multiplicity — a k-issue processor is a processor with hardware capability that issues K instructions per machine cycle.

Concurrent Processing Software parallelism is defined by the control and data dependence of programs. It is a function of algorithm, programming style, and compiler optimization.

Concurrent Processing Hardware vs. software parallelism N = (A * B) + (C * D) M = (A * B) - (C * D)

Concurrent Processing Hardware vs. software parallelism In machine code we have: Load R1, A Load R2, B Load R3, C Load R4, D Mult Rx, R1, R2 Mult Ry, R3, R4 Add R1, Rx, Ry Sub R2, Rx, Ry Store N, R1 Store M, R2

Concurrent Processing Hardware vs. software parallelism A machine which allows parallel multiplications, add/subtraction and simultaneous load/store operations gives an average software parallelism of (10/4) = 2.5 instructions.

Concurrent Processing Hardware vs. software parallelism Load R1, A Load R2, B Load R3, C Load R4, D Mult Rx, R1, R2 Mult Ry, R3, R4 Add R1, Rx, Ry Sub R2, Rx, Ry Store N, R1 Store M, R2

Concurrent Processing Hardware vs. software parallelism For a machine which does not allow simultaneous Load/Store and arithmetic operations, we have an average software parallelism of (10/8) = 1.25 instructions.

Concurrent Processing Hardware vs. software parallelism Now assume a multiprocessor composing of two processors:

Summary Dependence Graph Different types of dependencies Bernstein's Conditions Hardware Parallelism Software Parallelism

Concurrent Processing Compilation support is one way to solve the mismatch between software and hardware parallelism. A suitable compiler could exploit hardware features in order to improve performance (machine dependent optimization).

Concurrent Processing Detection of concurrency in a program at instruction level using techniques like Bernstein is not practical, specially in case of large programs. So we will look at detection of concurrency at a higher level. This bring us to the issue of partitioning, scheduling, and load balancing.

Concurrent Processing Two issues of concern: How can we partition a program into concurrent branches, program modules, or grains to yield the shortest execution time, and What is the optimal size of concurrent grains in a computation?

Concurrent Processing Partitioning is defined as the ability to partition a program into subprograms that can be executed in parallel. Within the scope of partitioning, two major issues are of concern: Grain Size, and Latency

Partitioning and Scheduling Grain Size Granularity or grain size is a measure of the amount of computation involved in a process — It determines the basic program segment chosen for parallel processing. Grain sizes are commonly named as: fine, medium, and coarse.

Partitioning and Scheduling Latency Latency imposes a limiting factor on the scalability of the underlying platform. Communication latency — inter-processor communication — is a major factor of concern to a system designer.

Partitioning and Scheduling In general, n tasks communicating with each other may require n (n-1) / 2 communication links among them. This leads to a communication bound which limits the number of processors allowed in a computer system.

Partitioning and Scheduling Parallelism can be exploited at various levels: Job or Program Subprogram Procedure, Task, or Subroutine Loops or Iteration Instruction or Statement

Partitioning and Scheduling The lower the level, the finer the granularity. The finer the granularity, the higher the communication and scheduling overheads. The finer the granularity, the higher the degree of parallelism.

Partitioning and Scheduling Instruction and loop levels represent fine grain size. Procedure and subprogram levels represent medium grain size, and Job and subprogram levels represent coarse grain size.

Partitioning and Scheduling Instruction Level An instruction level granularity represents a grain size consisting of up to 20 instructions. This level offers a high degree of parallelism in common programs. It is expected that parallelism will be exploited by compiler automatically.

Partitioning and Scheduling Loop Level Here typically, we are concerned about iterative loops with less than 500 instructions. At this level, one can distinguish two classes of loop: Loops with independent iterations and, Loops with dependent iterations.

Partitioning and Scheduling Procedure Level A typical grain at this level contains less than 2,000 instructions. Communication requirement and penalty at this level is less compared with that of the fine grain levels at the expense of more complexities in detection of parallelism — inter- procedural dependence.

Partitioning and Scheduling Subprogram Level Multiprogramming on a uni-processor or on a multiprocessor platform represents this level. In the past, parallelism at this level was exploited by the programmers or algorithm designers rather than by compilers.

Partitioning and Scheduling Job Level This level corresponds to the parallel execution of independent jobs (programs) on concurrent computers. Supercomputers with a small number of powerful processors are the best platform for this level of parallelism. In general, parallelism at this level is exploitable by the program loader and the operating system.

Grain Packing Let us look at the following example to motivate the effect of grain size on the performance. In the following discussion, we make a reference to the term program graph.

Grain Packing A program graph is a dependence graph in which: Each operation is labeled as (n, s) where n is the node identifier and s is the execution time of the node. Each edge is labeled as (v, d) where v is the edge identifier and d is the communication delay. Consider the following program graph:

Grain Packing — Fine grain size 1,1 4,1 5,1 7,2 6,1 11,2 2,1 3,1 8,2 10,2 9,2 a,6 b,6 c,6 d,6 f,6 e,6 12,2 13,2 17,2 16,2 15,2 14,2 n,4 o,3 4. l,3 k,4 j,4 m,3 3. p,3 h,4 g,4 i,4

Grain Packing — Fine grain size Nodes 1-6 are all memory references 1 cycle to calculate address, and 6 cycles to fetch data from memory. Other nodes are CPU operations requiring 2 cycles each.

Grain Packing — Fine grain size The idea behind grain packing is to apply: Fine-grain first to achieve a higher degree of parallelism, Combine multiple fine-grain nodes into a coarse- grain node if it can eliminate unnecessary communication delays or reduce the overall cost.

Grain Packing — Partitioning A 1,1 4,1 5,1 7,2 6,1 11,2 2,1 3,1 8,2 10,2 9,2 a,6 b,6 c,6 d,6 f,6 e,6 12,2 13,2 17,2 16,2 15,2 14,2 n,4 o,3 4. l,3 k,4 j,4 m,3 3. p,3 h,4 g,4 i,4 B C D E

Grain Packing — Coarse Grain E,6 C,4 B,4 D,6 A,8 4. g,4 e,6 f,6 d,6 c,6 b,6 a,6 h,4 i,4 n,4 j,4 k,4 3.

Grain Packing — Scheduling Fine Grain Size P1 6 9 1 11 2 3 8 7 P2 10 13 12 5 6 4 16 15 14 17 Busy Communication Idle

Grain Packing — Scheduling Coarse Grain Size P1 A C D E 7 P2 B Busy Communication Idle

Grain Packing As can be noted, through grain packing we were able to reduce the overall execution time by reducing the communication overhead. The concept of grain packing can be recursively applied.

Task Duplication Task duplication is another way to reduce the communication overhead and hence the execution time. Consider the following program graph:

Task Duplication D,2 E,2 B,1 C,1 A,4 d,4 a,1 b,1 c,1 c,8 a,8 e,4

Task Duplication P1 A B D E 7 P2 C Busy Communication Idle

B,1 C',1 A,4 d,4 a,1 b,1 c,1 a,1 E,2 C,1 c,1 e,4 A',4 Task Duplication Now let us duplicate tasks A and C:

Task Duplication P1 A B D C E 7 P2 C A Busy Communication Idle

Performance improvement Advances in Technology Architectural Advances Better Resource Management Program behavior Extract Concurrency/ Prallelism

Summary Dependence Graph Bernstein's conditions In partitioning a task into subtasks two issues must be taken into consideration: Grain Size, and Latency Program Graph Grain Packing Task duplication

Loop Scheduling Loops are the largest source of parallelism in a program. Therefore, there is a need to pay attention to the partitioning and allocation of loop iterations among processors in a parallel environment.

Loop Scheduling Practically we can talk about three classes of loops: Sequential Loops, Doall Loops — vector (parallel) loops, Doacross Loops — Loop with intermediate degree of parallelism.

Loop Scheduling — Doall Loops Static Scheduling Dynamic Scheduling

Loop Scheduling — Doall Loops Static Scheduling schemes assign a fixed number of iterations to each processor. If N is the number of iterations and P is the number of processors, then:

Loop Scheduling — Doall Loops Block Scheduling (Static chunking), assigns iterations (1,…, N/P), (N/P+1, …, 2* N/P), … to processors 1, 2, …, respectively. Cyclic Scheduling assigns iterations (i, i+P, i+2P, …) to processor i (1 i  P).

Loop Scheduling — Doall Loops In practice, cyclic scheduling offers a better load balancing than block scheduling if the computation performed by each iteration varies significantly.

Loop Scheduling — Doall Loops Dynamic scheduling schemes have been proposed to respond to the imbalance work-load of the static scheduling schemes: Self Scheduling Scheme Fixed size chunking Scheme Guided Self Scheduling Scheme Factoring Scheme Trapezoid Self Scheduling Scheme

Loop Scheduling — Doall Loops Self scheduling scheme, one iteration at a time, on demand, is assigned to a processor (too many scheduling steps). Fixed size chunking, assigns chunks of iterations, on demand, at a time (could result in imbalanced load if execution time of iterations varies).

Loop Scheduling — Doall Loops Guided self scheduling, Factoring, and Trapezoid Self Scheduling schemes assign chunks of varying size based on the remaining number of iterations.

Loop Scheduling — Doall Loops Guided self scheduling assigns chunks of size R/P on demand (idle processor) to each processor. P is the number of processors and R is the remaining number of iterations. Larger chunks are assigned at earlier stage and smaller chunk sizes are assigned at later stage to smooth load imbalance. This scheme does not perform well if the execution time of iterations varies. If N=100 and p=4 then the chunk sizes are: 25, 19, 14, 11, 8, 6, …

Loop Scheduling — Doall Loops Factoring at each allocation cycle assigns chunk of fixed size to processors according to R/2P equation. P is the number of processors and R is the remaining number of iterations. Relative to GSS, at earlier stage, smaller chunks are assigned to the processors. If N=100 and p=4 then the chunk sizes are: (13, 13, 13, 13), (6, 6, 6,6), …

Loop Scheduling — Doall Loops Trapezoid Self Scheduling fixes the first (f) and last chunk (l) sizes and assigns successive chunks in decreasing number: C1 = f, C2 = C1 – s (s= (f-l)/(C-1), C= 2N/(f+l)) If N=100, p=4, f=25, and l=1 then the chunk sizes are: 25, 25-3=22, 22-3=19, ….

Assume N = 1000 and 4 processors: Algorithm P1 P2 P3 P4 GSS 250 188 141 106 33 59 79 19 25 45 6 14 11 3 8 4 1 2 Algorithm P1 P2 P3 P4 GSS 250 188 141 106 33 59 79 19 25 45 6 14 11 3 8 4 1 2 3 1 1 1

Algorithm P1 P2 P3 P4 Factoring 125 63 31 16 8 4 2 1 Algorithm P1 P2 P3 P4 Trapezoid 125 117 109 101 69 77 85 93 61 53 45 37 28

Loop Scheduling — Doall Loops (Shared Memory Uniform Access) matrix multiplication (N = 300) 20 15 10 5 FS GSS Factoring TSS Linear Number of PEs Speedup Small variances in iteration execution times. All behave the same when the number of iterations is large and the variance is small. .

Loop Scheduling — Doall Loops (Shared Memory Uniform Access) Adjoint convolution (N = 100) 20 15 10 5 FS GSS Factoring TSS Linear Number of PEs Speedup Larger variances in iteration execution times. Factoring and TSS perform much better than FS and GSS. .

- 1 32 31 30 29 2 28 27 26 25 4 24 23 22 21 8 20 19 18 3 17 16 15 6 14 13 37 11 12 45 53 10 61 9 69 63 33 77 7 85 59 93 79 5 101 125 250 106 109 141 117 188 TSS (f = 125, l = 1) Factoring FS GSS Step Number of iterations assigned to a processor at each scheduling step with N= 1000, P = 4

Loop Scheduling — Doall Loops (Shared Memory Non-Uniform Memory Access) Affinity Scheduling Dynamic Partitioned Affinity Scheduling Wrapped Partitioned Affinity Scheduling Locality Based Dynamic Scheduling

Loop Scheduling — Doall Loops (Shared Memory Non- Uniform Memory Access) In this scenario, access cost increases with the distance. So the algorithms should take the location of data into consideration. As a result, iterations are scheduled on the processor that holds the required data, at least initially. Affinity scheduling attempts to balance the load, minimize the number of synchronization operations, and exploit affinity. Initially, fixed N/P iterations are assigned to each processor, when a processor becomes idle, it gets 1/p (Iterations) from its local queue. If local queue is empty, it gets 1/P (iterations) from a heavily loaded processor.

Loop Scheduling — Doall Loops (Shared Memory Non-Uniform Memory Access) Dynamic Partitioned Affinity Scheduling, Wrapped Partitioned Affinity Scheduling, and Locality Based Dynamic Scheduling are intended for loops with wide varying execution times, by dynamically determining the chunk sizes in subsequent iterations. Also, they assume that the iteration size is proportion to the iteration sequence. Wrapped Partitioning is similar to Dynamic partitioning, however, a processor is assigned iterations that are at the distance of P from each other. Locality based, a processor first executes iterations for which data is locally available.

Loop Scheduling — Doall Loops (Shared Memory Non- Uniform Memory Access) Rectangular workload (N = 128) 8 7 6 5 4 3 2 1 GSS AFS DPAS WPAS Number of PEs Completion time (secs) In rectangular work load, iteration size can be partitioned into large and small. WRAP performs the best, since heavy iterations are not assigned to a processor.

Loop Scheduling — Doall Loops (Shared Memory Non-Uniform Memory Access) Jacobi algorithm (128 X 128 matrix) 8 7 6 5 4 3 2 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 GSS AFS DPAS WPAS Number of PEs Completion time (secs) In Jacobi algorithm (triangular work load) iteration execution time decreases linearly. Still WPAS performs better because of more uniform distribution of the work load. In case of a balanced work load (fixed iteration execution times) all affinity algorithms offered the same performance.

Summary Why Loops? Different types of Loops Scheduling of Doall Loops Static allocation vs dynamic allocation Scheduling Doall Looks in UMA and NUMA

Loop Scheduling Practice has shown that the lost of parallelism after serializing Doacross loops is very significant. There is a need to develop good schemes for the parallel execution of Doacross loops.

Loop Scheduling — Doacross Loops In a Doacross loop, iteration may be either data or control dependent on each other: Control dependence is caused by conditional statements. Data dependence appears in the form of sharing computational results. Data dependence can be either in the form of lexically-forward or lexically-backward.

Loop Scheduling — Doacross Loops Doacross loops can be either: regular (dependence distance is fixed) or irregular (dependence distance varies from iteration to iteration). Regular Doacross loops are more amenable to parallel execution than irregular loops.

Loop Scheduling — Doacross Loops Data dependence appears in the form of lexically forward or lexically backward dependencies.

Lexically Forward Dependency

Lexically Backward Dependency

Loop Scheduling — Doacross Loops DOACROSS Model — Cytron Pre-synchronization Scheduling — Krothapalli & Sadayappan Staggered Distribution Scheme — Lim et. al.,

Loop Scheduling DOACROSS Model Each iteration is assigned to a virtual processor and execution of two successive virtual processors is delayed with d time period — d can range from zero (DOALL) to T (the sequential loop).

Loop Scheduling DOACROSS Model Assume: T  the execution time of one iteration of the loop, d  the delay due to the lexically-backward dependency, C  the inter-processing communication cost, and T-d  the portion of the loop iteration that can be executed in parallel.

Loop Scheduling DOACROSS Model — Example Assume we have the following loop that is iterated 8 times. Further, assume the loop execution time is 10 and d is Finally, assume that we have three processors: DOACROSS I = 1, 8 { d = 4 } 10 END

Loop Scheduling DOACROSS Model — Example

Loop Scheduling — Doacross Loops The total execution time of a DOACROSS loop L of n iterations for: An unbounded number of processors: Limited number of processors (P): Considering inter-processor communication (C): TE(L)=(n-1)d+T

Loop Scheduling DOACROSS model was aimed to model the execution of the sequential loops, vector loops, and loops with intermediate parallelism by considering the: control dependencies, and data dependencies.

Loop Scheduling Control dependencies are caused by conditional and unconditional branches. Data dependencies are due to the lexically forward and lexically backward dependencies. This simple model, however, does not consider the effect of inter-processor communication cost.

Loop Scheduling Staggered Distribution The staggered distribution uses heuristics to distribute loop iterations unevenly among processors to mask the delay due to the data dependencies and inter-processor communications.

Staggered Distribution Assume: T  the execution time of one iteration of the loop, d  the delay due to the lexically-backward dependency, C  the inter-processing communication cost, and T-d  the portion of the loop iteration that can be executed in parallel.

Staggered Distribution The iterations assigned to PEi succeed the iterations assigned to PEi-1 with PEi having mi more iterations than PEi-1, Additional iterations (mi) are used to mask out the communication delay.

Staggered Distribution The number of iterations allocated to PEi would be:

Refer to earlier example, Staggered scheme For C = 0 makes 2,3,3 assignment of 10 iterations among three processors, which results in the execution time of 2* * 4+ 3* 4 =44 and hence a speed up of 100/44. For C = 6 makes 1,2,5 assignment of 10 iterations among three processors, which results in the execution time of (1*10 + 6) + (2 * 4 + 6) + 5 * 4 = 50 and hence a speed up of 160/50.

Staggered Distribution Random Loops For random loops, the number of iterations is not known in advance and termination of the loop is based on a parameter that is being modified during each loop iteration. In this case, the probability of the successful continuation is used to calculate the expected number of iterations.

Staggered Distribution Random Loops Based on the expected number of iterations, a copy of the loop along with the number of iterations is assigned to each processor based on the staggered scheme.

Staggered Distribution ─ Evaluation To evaluate the effectiveness of the staggered allocation scheme, a representative loop with T = 50 and loops 3,4,5,11,13, and 19 of the Livermore Loops have been simulated and compared using staggered scheme, even distribution, and cyclic distribution.

Staggered Distribution ─ Evaluation Max. Speed-Up Using Staggered Scheme (N = 300) k  d/T C/T  comm./T MS  Max Speedup k = d/T C/T communication time/ iteration execution time AP Average Parallelism (the ratio of the total execution time to the critical-path length) MS Maximum Parallelism Staggered distribution offers speed up close to the average parallelism.

Staggered Distribution ─ Evaluation Number of Processors to Attain Maximum Speed-up Speedup 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 4 8 12 16 20 24 28 32 36 40 k Number of PEs C/T = 1.0 N = 1000 (7.87) (6.31) (4.57) (3.92) (3.06) (2.84) (2.4) (2.23) (1.96) (1.84) (1.62) (1.57) (1.4) (1.36) (1.22) (1.21) (1.1) (1.09) Equal Distribution Staggered Distribution k  d/T C/T  comm./T Staggered distribution offers a better speed up while using fewer processors for various k factor.

Staggered Distribution ─ Evaluation Staggered distribution vs. Static Chunking and Cyclic Scheduling Static Chunking (SC) Cyclic Scheduling (CYC) k C/T = 3.0 C/T = 4.0 C/T = 5.0 C/T = 6.0 0.1 1.28 (17) 1.28 (16) 1.30 (16) 1.31 (15) 22.54 (17) 28.07 (16) 34.21 (16) 39.59 (15) 0.2 1.25 (8) 1.26 (8) 12.87 (8) 16.79 (8) 20.55 (8) 24.35 (8) 0.3 1.22 (5) 9.04 (5) 11.73 (5) 14.38 (5) 17.06 (5) 0.4 1.20 (4) 7.33 (4) 9.48 (4) 11.60 (4) 13.73 (4) 0.5 1.17 (3) 6.10 (3) 7.83 (3) 9.56 (3) 11.28 (3) 0.6 1.12 (2) 5.03 (2) 6.42 (2) 7.82 (2) 9.21 (2) 0.7 1.11 (2) 4.80 (2) 6.09 (2) 7.39 (2) 8.68 (2) 0.8 1.08 (2) 4.55 (2) 5.74 (2) 6.94 (2) 8.13 (2) 0.9 1.05 (2) 4.28 (2) 5.38 (2) 6.47 (2) 7.56 (2)

Staggered Distribution ─ Evaluation Speed up (SU) of Staggered Distribution vs. Static Chunking (SC) and Cyclic (CYC) Scheduling (Livermore Loops). No. of processors used (staggered distribution) PE = 4 PE = 8 LOOP # k C/T Su (SC) Su (CYC) 3 0.25 3.75 1.20 10.72 1.21 (7) 13.10 (7) 5 0.30 3.00 1.21 8.22 1.16 (6) 9.35 (6) 11 10.50 12.18 (7) 13 0.05 0.71 1.07 2.82 1.14 5.05 19(1) 0.33 3.33 1.24 7.53 1.34 (4) 7.53 (4) 19(2) 0.27 2.73 1.23 6.86 1.28 (5) 6.93 (5)

Staggered Distribution Staggered distribution is effective for loops with large degree of parallelism among iterations. The maximum speed up attained is very close to the optimal speed up. The maximum speed up is achieved at the expense of using a fewer number of processors. Staggered scheme however, distributes an unbalanced load among processors.

Cyclic Staggered Distribution Staggered distribution was extended (i.e., Cyclic Staggered distribution) to address: Uneven distribution of the loop iterations among the processors, Loops with varying iteration execution times, and Availability of free processors below maxpe.

Cyclic Staggered Distribution Initially, loop iterations are distributed in staggered fashion among the P processors, In the second phase, the remaining iterations are distributed based on the staggered concept according to the following equation, nP is the number of iterations already allocated to PEi. When the execution times of loop iterations vary, this scheme uses estimated worst case iteration execution time (possibly augmented by runtime support) in determining the distribution for the second and subsequent passes.

Comparative Analysis of Doacross Scheduling (Number of iterations assigned to a processor at each scheduling step ─ T = 10, d = 2, n = 500, C = 5, P = 4) 12 - 1 3 19 2 18 17 4 16 15 14 13 11 10 9 8 7 6 5 171 125 136 108 85 CSD SD SC Cyclic PE Step

Comparative Analysis of Doacross Scheduling (continued) Step PE Cyclic SC SD CSD 20 4 1 - 12 21 22 2 23 3 24 25 26 27 28 29 30 31 32 33 34 35 36

Comparative Analysis of Doacross Scheduling (continued) Step PE Cyclic SC SD CSD 37 1 - 12 38 2 39 3 40 4 41 42 43 44 45 46 6 3 . . Execution Time 3,503 2,015 1,710 1,298

7 6 5 4 3 2 1 CS1 CS2 SD Number of PEs Speedup k = 0.2 Cyclic Staggered Distribution ─ Evaluation (C/T = 3.0) Cyclic distribution performs better than Staggered regardless of the number of processors, C/T ratio, and k factor

7 6 5 4 3 2 1 CS1 CS2 SD Number of PEs Speedup k = 0.2 Cyclic Staggered Distribution ─ Evaluation (C/T = 5.0)

16 14 12 10 8 6 4 2 1 3 SD SC CYC Number of PEs Speedup Staggered Scheme in control-flow environment The loop was run on the nCUBE 2 multiprocessor using 2 to 16 processors (PE).

6 5 4 3 2 1 CS1 CS2 SD Number of PEs Speedup Staggered Scheme in control-flow environment CS1 and CS2 performed better than SD only when the number of processors are between 3 and 5. For an environment with two processors, SD performed better than CS1 and CS2 because of the additional overhead incurred to implement the cyclic staggered schemes.

Irregular Doacross Loop Scheduling For irregular Doacross loops, the dependence patterns are complicated and usually are not predictable at compile-time. DO I = 1, N Sp : A(B(I)) := ..... Sq : := A(C(I)) END

Irregular Doacross Loop Scheduling DO I = 1, 6 A(I) := A(B(I)) END 1 2 4 3 5 6 Iteration Space Graph B(I) I 1 A(1)  A(1) A(2)  A(3) A(3)  A(1) A(4)  A(5) A(5)  A(3) A(6)  A(4) Operations’ result

Irregular DoAcross Loop Scheduling Pre-synchronized Scheduling Runtime Parallelizing Schemes The Zhu-Yew Scheme Chen’s Scheme

Summary Different types of loop Sequential Parallel Loops with intermediate degree of parallelism Loop scheduling Static Dynamic Staggered Allocation

Partitioning and Scheduling The goal of scheduling (scheduler) is to determine an assignment of tasks to processing elements to optimize certain performance metrics ─ mainly performance and efficiency. Scheduling algorithms are judged on their time complexity and efficiency of the produced schedule.

Partitioning and Scheduling A proper allocation scheme should partition a task (program graph) into subtasks (sub- graphs) and distributes the subtasks among processors in an attempt to minimize: processor contentions, and inter-processor communications.

Partitioning and Scheduling There are two main allocation schemes: Static, and Dynamic.

Partitioning and Scheduling In a static scheme, the subtasks are allocated at compile time to the processors using global information about the task and system organization.

Partitioning and Scheduling In a dynamic scheme, the behavior of the task during the run time is used to measure the loads. In another words, tasks are scheduled on the fly. Assign activated subtasks to the least loaded processor, and Balance the load by migrating the subtasks.

Partitioning and Scheduling In static allocation scheme the cost of allocating subtasks is incurred once, even though the program may be executed repeatedly. However, it can be inefficient when estimates of run time dependent characteristics are inaccurate.

Partitioning and Scheduling The dynamic allocation scheme has a disadvantage due to its overhead involved in determining processor loads and allocation of subtasks at run time.

Partitioning and Scheduling Despite the conceptual differences between static and dynamic policies, they have the same goals: Exploit the maximum inherent concurrency in a task and minimize contention for resources. It has been shown that such a problem is NP- complete.

Partitioning and Scheduling It is possible to generate optimal solutions when restrictions are imposed on program behavior or system configuration:

Partitioning and Scheduling When subtasks are of the same weight and system is composed of two processors. When weight of subtasks are mutually commensurable — a set of subtasks are said to be mutually commensurable if there exists a weight t such that each subtask weight is an integer multiple of t.

Partitioning and Scheduling Heuristic solutions are promising approach to solving the allocation problem.

Summary Partitioning and Scheduling Load Balancing Static Scheduling Dynamic Scheduling Static Scheduling vs. Dynamic Scheduling Heuristic Based Scheduling and Load Balancing

Partitioning and Scheduling In the following we will look at several static heuristic based load balancing algorithms. Note that, these algorithms are heuristic based, in another words, they make sense and offer a reasonable solution. Also note that, these algorithms will build on each other.

Partitioning and Scheduling A scheduling system consists of: Program tasks, Target machine, and A scheduler.

Partitioning and Scheduling Characteristics of a parallel program is defined as (T, p, D, A), where: T = {t1, t2, …, tn}, p Is a partial order on T which specifies operational precedence constraints: i.e., ti p tj means that ti must be completed before tj can start execution (ti and tj T) . D is an n X n matrix of communication data, where Dij ≥ 0 is the amount of data required to be transferred from task ti to task tj. A is a vector of size n representing the computations: i.e., Ai > 0 is the computation time of task ti. Traditionally, a program is represented as a directed graph called the program (task) graph (G ≡ (T,E)).

Partitioning and Scheduling T is the set of the subtasks, and E is the set of directed edges among elements in T. An edge represents a partial ordering (p) among the subtasks — ti  tj implies that ti must be executed before tj, (ti, tj  T). In addition, nodes and edges are labeled by numbers representing the execution time and communication cost.

Partitioning and Scheduling Target machine is a set of m heterogeneous/homogeneous interconnected processing elements. Associated with each processing element Pi is its speed Si. The connectivity of the processing elements can be represented by an undirected graph called the network graph. Associated with each edge (i, j) between processing elements Pi and Pj in the network graph is the transfer rate Rij, that is, how many unit of data can be transmitted per unit of time over the link.

Partitioning and Scheduling Schedule is a function that maps the task graph (G ≡ (T,E)) on a target machine in order to satisfy a performance metric. The schedule can be illustrated as a Gantt chart where the start and finish times for all tasks can be easily shown. With respect to our earlier notations then: Ai/Sj is the execution time of task ti when executed on processor Pj and Dij/Rkl is the communication delay between tasks ti and tj when they are executed on adjacent processing elements Pk and Pl.

Partitioning and Scheduling — Example

Partitioning and Scheduling Static Scheme — List Scheduling Make a priority list of subtasks to be assigned to the processors. Recursively, remove the top most element from the priority list and allocate it to the "most suitable" processor for execution.

Partitioning and Scheduling Static Scheme — List Scheduling How to define the priority criteria and the "most suitable" processor? How to account the communication cost?

Partitioning and Scheduling No restriction on the number of processors, but constraints on the task graph. In-forest is a task graph where each node has at most one immediate successor. Out-forest is a task graph where each node has at most one immediate predecessor. Interval ordered is a task graph that describes the precedence relations among the system tasks in an interval order. A task graph is an interval ordered where related elements can be mapped into non overlapping time interval. The interval order has a special property: For any interval ordered pair of tasks u and v, either the successors of u are also the successors of v or vice versa.

Partitioning and Scheduling Assumptions: Task graph consists of n tasks, Target machine is composed of m processes Execution of each task is one unit of time, Communication between pair of tasks is zero.

Partitioning and Scheduling In-forest/Out-forest task graphs: Assign the highest level first: Calculate and assign the node level to each node (node priority) ─ node level of node x is the maximum number of nodes (including x) of any path from x to a terminal node. Whenever a processor becomes available, assign it the unexecuted (unscheduled) ready task with the highest priority.

b c f g e i d h j k m l Partitioning and Scheduling ─ In-forest

b c f g e i d h j k m l Partitioning and Scheduling ─ In-forest Node a b c d e f g h i j k l m Level 5 4 3 2 1 P1 P2 P3 P4 a d i k m b f j c g l e h

Partitioning and Scheduling ─ Interval ordered Here we use the number of all successors of a node as its priority. Calculate priority of each node, Whenever a processor becomes available, assign it the unexecuted ready task with the highest priority.

f h i j g e b Partitioning and Scheduling ─ Interval ordered

Partitioning and Scheduling ─ Interval ordered a c d f h i j g e b Node a b c d e f g h i j # of Successors 8 6 5 4 1 3 a c g h b d f i e j P1 P2 P3

Partitioning and Scheduling No restriction on the task graph, but two processors Each node is labeled and labels are used as the node priority. Assign 1 to one of the terminal tasks, Repeat until all nodes are labeled: Let labels 1,2, …, j-1 have been assigned. Let S be the set of unassigned tasks with no unlabeled successors. For each node x S define l(x) as follows: Let L(y1), L(y2), …, L(yk) be the labels of the immediate successors of x. Then l(x) is the decreasing sequence of integers formed by ordering the set {L(y1), L(y2), …, L(yk) }, Let x S such that for all x’ S l(x)  l(x’), Assign j to the x (i.e., L(x) = j). Whenever a processor becomes available, assign it the unexecuted (unscheduled) ready task with the highest priority.

f h i j g e b k Partitioning and Scheduling

Partitioning and Scheduling Node a b c d e f g h i j k Level 11 10 9 8 6 7 5 4 3 1 2 a c d f h i j g e b k a c f g i k b d e h j P1 P2

Partitioning and Scheduling Static Scheme — Heavy Node First Partition the DAG into horizontal levels. Prioritize the DAG nodes level by level. At each level, nodes with a higher weight get a higher priority. Load balancing is achieved by assigning the heaviest nodes to the least loaded processor.

Partitioning and Scheduling Static Scheme — Heavy Node First Assume the following DAG and a platform composed of three processors. Note that each node in DAG has a name and an integer representing its execution time.

Partitioning and Scheduling Static Scheme — Heavy Node First P 1 2 C E H M N O D F I J K L A B G

Partitioning and Scheduling Static Scheme — Critical Path Method Classical method for task allocation. Subtasks on the critical path determine the shortest possible execution time of the program. Prioritize the subtasks according to the length of their critical path. For an n node DAG, the complexity of the algorithm is O(n2).

Partitioning and Scheduling Static Scheme — Critical Path Method

Summary Partitioning and Scheduling Static, Dynamic Heuristic Solutions Heavy node first Critical Path Method

Partitioning and Scheduling Static Scheme — Weighted Length Algorithm Consider the weight of a node, its critical path, and the weight of its successors. Let:

Partitioning and Scheduling WT(P) represents execution time of node P U(P) represents maximum weighted length of the children of P V(P) represents the sum of the weighted length of the children of P, then weighted length (WL) of P is defined as: WL(P) = WT(P) + U(P) + V(P)/U(P)

Partitioning and Scheduling Static Scheme — Weighted Length Algorithm

Partitioning and Scheduling — Communication Models Program completion Time = Execution Time + Total Communication Delay Total Communication Delay = Total number of messages * communication delay per message

Partitioning and Scheduling — Communication Models ModelA: Total number of messages is equal to the number of node pairs (u, v) such that (u, v)  E and Proc(u)  Proc(v). ModelB: Total number of messages is defined as the number of processor task pairs (P, v) such that processor P does not compute v but computes at least one of its immediate successors. ModelC: A processor can compute a task and communicate with another processor simultaneously.

b Partitioning and Scheduling — Communication Models Consider the following task graph: If we use ModelA for scheduling this program, then: Computation Time = 3+2 = 5 If we use ModelB for scheduling this program, then: Computation Time = 3+1 = 4 a b d c e P1 P2

Partitioning and Scheduling — Communication Models If we use ModelC for scheduling this program, then: a c d e b a b c e d P1 P2

Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Define DSMi,t as the Desirable Starting Moment — the moment that a subtask receives all of its input messages — of subtask t on processor i. Define Load(Pi) be the accumulated execution time (including the communication delays) of processor i.

Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Define ASMi,t as the Actual Starting Moment — the moment that a subtask can begin its execution — of subtask t on processor i, then: ASMi,t = MAX (LOAD(Pi), DSMi,t)

Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Assume DSM0,3 = 5 Load(P0) = 15 ASM0,3 = 15 DSM1,3 = 10 Load(P1) = 0 ASM1,3 = 10

Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay The following relation defines the most suitable PE(i) for task t: Min (ASMi,t) i = 1, 2, 3,...,n

Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay In our case then PE1 is the most suitable PE for task 3. To reduce the overhead, one can just investigate the nearest neighbors for the smallest ASM value.

Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay Assume the following configuration:

Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay

DSM0,D = 10 Load(P0) = 10 ASM0,D = 10 DSM1,D = 5 Load(P1) = 5 ASM1,D = 5 DSM2,D = 15 Load(P2) = 5 ASM2,D = 15 DSM3,D = 10 Load(P3) = 0 ASM3,D = 10 Partitioning and Scheduling Static Scheme — Heavy Node First + Communication Delay DSM0,E = 15 Load(P0) = 10 ASM0,D = 15 DSM1,E = 15 Load(P1) = 25 ASM1,,E = 25 DSM2,E = 25 Load(P2) = 5 ASM2,E = 25 DSM3,E = 20 Load(P3) = 0 ASM3,E = 20

Advanced Computer Architecture Programming Level

Similar presentations

Presentation on theme: "Advanced Computer Architecture Programming Level"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Computer Architecture Programming Level

Similar presentations

Presentation on theme: "Advanced Computer Architecture Programming Level"— Presentation transcript:

Similar presentations

About project

Feedback