Presentation on theme: "Using the Iteration Space Visualizer in Loop Parallelization Yijun YU"— Presentation transcript:
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Overview ISV – A 3D Iteration Space Visualizer : view the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values Detect the parallelism Estimate the speedup Derive a loop transformation Find Statement-level parallelism Future development
1.2 Visualize the Dependence A dependence is visualized in an iteration space dependence graph iteration Node Iteration Flow dependence Edge Dependence order between nodes Color Dependence type: FLOW: Write Read ANTI: Read Write OUTPUT: Write Write
1.3 Parallelism? Stepwise view sequential execution No parallelism found However, many programs have parallelism…
2. Potential Parallelism Time(sequential) = number of iterations Dataflow: iterations are executed as soon as its data are ready Time(dataflow) = number of iterations on the longest critical path The potential parallelism is denoted by speedup = Time(sequential)/Time(dataflow)
2.2 Irregular dependence Dependences have non-uniform distance Parallelism Analysis: 200 iterations over 15 data flow steps Speedup:13.3 Problem: How to exploit it?
3. Visualize parallelism Find answers to these questions What is the dependence pattern? Is there a parallel loop? (How to find?) What is the maximal parallelism? (How to exploit it?) Is the load of parallel tasks balanced?
3.1 Example 3
3.2 3D Space
3.3 Loop parallelizable? The I, J, K loops are in the 3D space: 32 iterations Simulate sequential execution Which loop can be parallel?
Interactively try the parallelization Interactively check a parallel loop I 3.4 Loop parallelization The blinking dependence edges prevent the parallelization of the given loop I.
Let ISV find the correct parallelization Automatically check the parallel loop Simulate parallel execution 3.5 Parallel execution It takes 16 time steps
Sequential execution takes 32 time steps Simulate data flow execution 3.6 Dataflow execution Dataflow execution only takes 4 times steps Potential speedup=8.
Dataflow speedup = 8 Iterating through partitions: the connected components 3.7 Graph partitioning All the partitions are load balanced
4. Loop Transformation Real parallelism Potential parallelism Transformation
4.1 Example 4
4.2 The iteration space Sequentially 25 iterations
4.3 Loop Parallelizable? check loop I check loop J
Totally 9 steps Potential speedup: 25/9=2.78 Wave front effect: all iterations on the same wave are on the same line 4.4 Dataflow execution
4.5 Zoom-in on the I-space
4.6 Speedup vs program size Zoom-in previews parallelism in part of a loop without modifying the program Executing the programs of different size n estimates a speedup of n 2 /(2n-1)
4.7 How to obtain the potential parallelism Here we already have these metrics: Sequential time steps = N 2 Dataflow time step = 2N-1 potential speedup = N 2 /(2N-1) Transformation. How to obtain the potential speedup of a loop?
4.8 Unimodular transformation (UT) A unimodular matrix is a square integer matrix that has unit determinant. It is the result of identity matrix by three kinds of basic transformations: reversal, interchange, and skewing The new loop execution order is determined by the transformed index. The iteration space remains unit step size Find a suitable UT reorders the iterations such that the new loop nest has a parallel loop Unimodular matrix New loop index Old loop index reversal interchange skewing
4.9 Hyperplane transformation Interactively define a hyper-plane Observe the plane iteration matches the dataflow simulation plane = dataflow The plane iteration Based on the plane, ISV calculates a unimodular transformation
The transformed iteration space and the generated loop 4.10 The derived UT
4.11 Verify the UT ISV checks if the transformation is valid Observe that the parallel loop execution in the transformed loop matches the plane execution parallel = plane
5. Statement-level parallelism Unimodular transformations work at iteration level The statement dependence within the loop body is hidden in the iteration space graph How to exploit parallelism at statement level? Statement to iteration
5.1 Example 5 SSV: statement space visualization
5.2 Iteration-level parallelism The iteration space is 2D. There are N 2 =16 iterations The dataflow execution has 2N-1=7 time steps. The potential speedup is: 16/7 = 2.29
5.3 Parallelism in statements The (statement) iteration space is 3D There are 2N 2 =32 statements The dataflow execution still has 2N-1=7 time steps. The potential speedup is: 32/7 = 4.58
5.4 Comparison Here: doubles the potential speedup at iteration level
5.5 Define the partition planes partitions hyper-planes
What is validity? Show the execution order on top of the dependence arrows. (for 1 plane or all together, depending on the density of the slide)
5.6 Invalid UT The invalid unimodular transformation derived from hyper-plane is refused by ISV Alternatively, ISV calculates the unimodular transformation based on the dependence distance vectors available in the dependence graph
6. Pseudo distance method The pseudo distance method: Extract base vectors from the dependent iterations Examine if the base vectors generates all the distances Calculate the unimodular transformation based on the base vectors The base vectors The unimodular matrix
Another way to find parallelism automatically The iteration space is a grid, non-uniform dependencies are members of a uniform dependence grid, with unknown base-vectors. Finding these base vectors allows us to extend existing parallelization to the non-uniform case.
6.1 Dependence distance (1,0,-1) (0,1,1)
6.2 The Transformation The transforming matrix discovered by pseudo distance method The distance vectors are transformed (1,0,-1) (0,1,0) (0,1,1) (0,0,1) The dependent iterations have the same first index, implies the outermost loop is parallel.
6.3 Compare the UT matrices The transforming matrix discovered by pseudo distance method An invalid transforming matrix discovered by the hyper-plane method The same first column means the transformed outermost loops have the same index.
6.4 The transformed space The outermost loop is parallel There are 8 parallel tasks The load of tasks is not balanced The longest task takes 7 time steps
7. Non-perfectly nested loop What is it? The unimodular transformations only work for perfectly nested loops For non-perfectly nested loop, the iteration space is constructed with extended indices N fold non-perfectly nested loop to a N+1 fold perfectly nested loop
7.1 Perfectly nested Loop? Non-perfectly nested loop: DO I1 = 1,3 A(I1) = A(I1-1) DO I2 = 1,4 B(I1,I2) = B(I1-1,I2)+B(I1,I2-1) ENDDO Perfectly nested loop: DO I1 = 1,3 DO I2 = 1,5 DO I3 = 0,1 IF (I2.EQ.1.AND.I3.EQ.0) A(I1) = A(I1-1) ELSE IF(I3.EQ.1) B(I1-1,I2)=B(I1-2,I2)+B(I1-1,I2-1) ENDDO
7.2 Exploit parallelism with UT
8. Applications ProgramsCatagoryDepthFormPatternTransformation Example 1 Tutorial 1PerfectUniformN/A Example 2 Tutorial 2PerfectNon-uniformN/A Example 3 Tutorial 3PerfectUniformWavefront UT Example 4 Tutorial 2PerfectUniformWavefront UT Example 5 Tutorial 2+1PerfectUniform Stmt Partitioning UT Example 6 Tutorial 2+1 Non- perfect UniformWavefront UT Matrix multiplication Algorithm 3PerfectUniformParallelization Gauss-Jordan Algorithm 3PerfectNon-UniformParallelization FFT Algorithm 3PerfectNon-UniformParallelization Cholesky Benchmark4 Non- perfect Non-UniformPartitioning UT TOMCATV Benchmark3 Non- perfect UniformParallelization Flow3D CFD App.3PerfectUniformWavefront UT
9. Future considerations Weighted dependence graph More semantics on data locality: data space graph, data communication graph data reuse iteration space graph, More loop transformation: Affine (statement) iteration space mappings Automatic statement distribution Integration with Omega library