Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy

Optimizing Compilers for Modern Architectures Introduction Previously, our transformations targeted vector and superscalar architectures. In this lecture, we worry about transformations for symmetric multiprocessor machines. The difference between these transformations tends to be one of granularity.

Optimizing Compilers for Modern Architectures Review SMP machines have multiple processors all accessing a central memory. The processors are unrelated, and can run separate processes. Starting processes and synchonrization between proccesses is expensive.

Optimizing Compilers for Modern Architectures Synchonrization A basic synchonrization element is the barrier. A barrier in a program forces all processes to reach a certain point before execution continues. Bus contention can cause slowdowns.

Optimizing Compilers for Modern Architectures DO I == 1,N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO PARALLEL DO I = 1,N PRIVATE t S1 t = A(I) S2 A(I) = B(I) S3 B(I) = t ENDDO Single Loops The analog of scalar expansion is privatization. Temporaries can be given separate namespaces for each iteration.

Optimizing Compilers for Modern Architectures Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x. Privatizability can be stated as a data-flow problem: We can also do this by declaring a variable x private if its SSA graph doesn’t contain a phi function at the entry. Privatization

Optimizing Compilers for Modern Architectures We need to privatize array variables. For iteration J, upwards exposed variables are those exposed due to loop body without variables defined earlier. DO I = 1,100 S0 T(1)=X L1 DO J = 2,N S1 T(J) = T(J-1)+B(I,J) S2 A(I,J) = T(J) ENDDO So for this fragment, T(1) is the only exposed variable. Array Privatization

Optimizing Compilers for Modern Architectures PARALLEL DO I = 1,100 PRIVATE t S0 t(1) = X L1 DO J = 2,N S1 t(J) = t(J-1)+B(I,J) S2 A(I,J)=t(J) ENDDO Array Privatization Using this analysis, we get the following code:

Optimizing Compilers for Modern Architectures Loop Distribution Loop distribution eliminates carried dependencies. Consequently, it often creates opportunity for outer-loop parallelism. We must add extra barriers to keep dependent loops from executing out of order, so the overhead may override the parallel savings. Attempt other transformations before attempting this one.

Optimizing Compilers for Modern Architectures DO I = 2,N A(I) = B(I)+C(I) D(I) = A(I-1)*2.0 ENDDO DO I = 1,N+1 IF (I.GT. 1) A(I) = B(I)+C(I) IF (I.LE. N) D(I+1) = A(I)*2.0 ENDDO Alignment Many carried dependencies are due to array alignment issues. If we can align all references, then dependencies would go away, and parallelism is possible.

Optimizing Compilers for Modern Architectures DO I = 2,N J = MOD(I+N-4,N-1)+2 A(J) = B(J)+C D(I)=A(I-1)*2.0 ENDDO D(2) = A(1)*2.0 DO I = 2,N-1 A(I) = B(I)+C(I) D(I+1) = A(I)*2.0 ENDDO A(N) = B(N)+C(N) Alignment There are other ways to align the loop:

Optimizing Compilers for Modern Architectures DO I = 1,N A(I+1) = B(I)+C X(I) = A(I+1)+A(I) ENDDO DO I = 1,N A(I+1) = B(I)+C ! Replicated Statement IF (I.EQ 1) THEN t = A(I) ELSE t = B(I-1)+C END IF X(I) = A(I+1)+t ENDDO Alignment If an array is involved in a recurrence, then alignment isn’t possible. If two dependencies between the same statements have different dependency distances, then alignment doesn’t work. We can fix the second case by replicating code:

Optimizing Compilers for Modern Architectures Theorem: Alignment, replication, and statement reordering are sufficient to eliminate all carried dependencies in a single loop containing no recurrence, and in which the distance of each dependence is a constant independent of the loop index We can establish this constructively. Let G = (V,E,  ) be a weighted graph. v  V is a statement, and  (v1, v2) is the dependence distance between v1 and v2. Let o: V  Z give the offset of vertices. G is said to be carry free if o(v1) +  (v1, v2) = o(v2). Alignment

Optimizing Compilers for Modern Architectures procedure Align(V,E, ,0) While V is not empty remove element v from V for each (w,v)  E if w  V W  W  {w} o(w)  o(v) -  (w,v) else if o(w) != o(v) -  (w,v) create vertex w’ replace (w,v) with (w’,v) replicate all edges into w onto w’ W  W  {w’} o(w)’  o(v) -  (w,v) for each (v,w)  E if w  V W  W  {w} o(w)  o(v) +  (v,w) else if o(w) != o(v) +  (v,w) create vertex v’ replace (v,w) with (v’,w) replicate edges into v onto v’ W  W  {v’} o(v’)  o(w) -  (v,w) end align Alignment Procedure

Optimizing Compilers for Modern Architectures Loop Fusion Loop distribution was a method for separating parallel parts of a loop. Our solution attempted to find the maximal loop distribution. The maximal distribution often finds parallelizable components to small for efficient parallelizing. Two obvious solutions: — Strip mine large loops to create larger granularity. — Perform maximal distribution, and fuse together parallelizable loops.

Optimizing Compilers for Modern Architectures Definition: A loop-independent dependence between statements S1 and S2 in loops L1 and L2 respectively is fusion-preventing if fusing L1 and L2 causes the dependence to be carried by the combined loop in the opposite direction. DO I = 1,N S1A(I) = B(I)+C ENDDO DO I = 1,N S2D(I) = A(I+1)+E ENDDO DO I = 1,N S1A(I) = B(I)+C S2D(I) = A(I+1)+E ENDDO Fusion Safety

Optimizing Compilers for Modern Architectures Fusing L1 with L3 violates the ordering constraint. {L1,L3} must occur both before and after the node L2. Fusion Safety We shouldn’t fuse loops if the fusing will violate ordering of the dependence graph. Ordering Constraint: Two loops can’t be validly fused if there exists a path of loop-independent dependencies between them containing a loop or statement not being fused with them.

Optimizing Compilers for Modern Architectures Parallel loops should generally not be merged with sequential loops. Definition: An edge between two statements in loops L1 and L2 respectively is said to be parallelism-inhibiting if after merging L1 and L2, the dependence is carried by the combined loop. DO I = 1,N S1 A(I+1) = B(I) + C ENDDO DO I = 1,N S2D(I) = A(I) + E ENDDO DO I = 1,N S1A(I+1) = B(I) + C S2D(I) = A(I) + E ENDDO Fusion Profitability

Optimizing Compilers for Modern Architectures Typed Fusion We start off by classifying loops into two types: parallel and sequential. We next gather together all edges that inhibit efficient fusion, and call them bad edges. Given a loop dependency graph (V,E), we want to obtain a graph (V’,E’) by merging vertices of V subject to the following constraints: — Bad Edge Constraint: vertices joined by a bad edge aren’t fused. — Ordering Constraint: vertices joined by path containing non-parallel vertex aren’t fused

Optimizing Compilers for Modern Architectures procedure TypedFusion(G,T,type,B,t0) Initialize all variables to zero Set count[n] to be the in-degree of node n Initialize W with all nodes with in-degree zero while W isn’t empty remove element n with type t from W if t = t0 if maxBadPrev[n] = 0 then p  fused else p  next[maxBadPrev[n]] if p != 0 then x  node[p] num[n]  num[x] update_successors(n,t) fuse x and n and call the result n else create_new_fused_node(n) update_successors(n,t) else create_new_node(n) update_successors(n,t) end TypedFusion Typed Fusion Procedure

Optimizing Compilers for Modern Architectures Original loop graph Graph annotated (maxBadPrev,p)  num After fusing parallel loops After fusing sequential loops Typed Fusion Example

Optimizing Compilers for Modern Architectures Given an outer loop containing some number of inner loops, we want to be able to run some inner loops in parallel. We can do this as follows: Run TypedFusion with B = {fusion-preventing edges, parallelism-inhibiting edges, and edges between a parallel loop and a sequential loop} Put a barrier at the end of each identified cohort Run TypedFusion again to fuse the parallel loops in each cohort Cohort Fusion

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Similar presentations

Presentation on theme: "Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Similar presentations

Presentation on theme: "Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy."— Presentation transcript:

Similar presentations

About project

Feedback