Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Vector Functions and Space Curves

Load Balancing Parallel Applications on Heterogeneous Platforms.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

CS 201 Compiler Construction

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

SoC CAD 1 Automatically Exploiting Cross-Invocation Paralleism Using Runtime Information 徐子傑 Hsu,Zi Jei Department of Electrical Engineering National.

Lecture 19: Parallel Algorithms

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Lecture 6; The Finite Element Method 1-dimensional spring systems (modified ) 1 Lecture 6; The Finite Element Method 1-dimensional spring systems.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Honolulu, 23 rd of May 2011PESOS Evaluating the Compatibility of Conversational Service Interactions Sam Guinea and Paola Spoletini.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Low Density Parity Check Codes LDPC ( Low Density Parity Check ) codes are a class of linear bock code. The term “Low Density” refers to the characteristic.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

1 Compiling with multicore Jeehyung Lee Spring 2009.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.

USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.

1 Partitioning Loops with Variable Dependence Distances Yijun Yu and Erik D’Hollander Department of Electronics and Information Systems University of Ghent,

Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Static Process Scheduling

Digital Communications I: Modulation and Coding Course Term Catharina Logothetis Lecture 9.

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Auburn University

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.

Dependence Analysis Important and difficult

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Data Locality Analysis and Optimization

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Algorithm Analysis CSE 2011 Winter September 2018.

COMP 175: Computer Graphics February 9, 2016

Introduction to Optimization

Presentation transcript:

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

Overview ISV – A 3D Iteration Space Visualizer : view the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values Detect the parallelism Estimate the speedup Derive a loop transformation Find Statement-level parallelism Future development

1. Dependence DO I = 1,3 A(I) = A(I-1) ENDDO DOALL I = 1,3 A(I) = A(I-1) ENDDO A(2) = A(1) A(1) = A(0) A(3) = A(2) program A(1) = A(0) A(2) = A(1) A(3) = A(2) execution trace shared memory

1.1 Example1 ISV directive visualize

1.2 Visualize the Dependence A dependence is visualized in an iteration space dependence graph iteration Node Iteration Flow dependence Edge Dependence order between nodes Color Dependence type: FLOW: Write Read ANTI: Read Write OUTPUT: Write Write

1.3 Parallelism? Stepwise view sequential execution No parallelism found However, many programs have parallelism…

2. Potential Parallelism Time(sequential) = number of iterations Dataflow: iterations are executed as soon as its data are ready Time(dataflow) = number of iterations on the longest critical path The potential parallelism is denoted by speedup = Time(sequential)/Time(dataflow)

2.1 Example 2

Diophantine Equations + Loop bounds (polytope) = Iteration Space Dependencies

2.2 Irregular dependence Dependences have non-uniform distance Parallelism Analysis: 200 iterations over 15 data flow steps Speedup:13.3 Problem: How to exploit it?

3. Visualize parallelism Find answers to these questions What is the dependence pattern? Is there a parallel loop? (How to find?) What is the maximal parallelism? (How to exploit it?) Is the load of parallel tasks balanced?

3.1 Example 3

3.2 3D Space

3.3 Loop parallelizable? The I, J, K loops are in the 3D space: 32 iterations Simulate sequential execution Which loop can be parallel?

Interactively try the parallelization Interactively check a parallel loop I 3.4 Loop parallelization The blinking dependence edges prevent the parallelization of the given loop I.

Let ISV find the correct parallelization Automatically check the parallel loop Simulate parallel execution 3.5 Parallel execution It takes 16 time steps

Sequential execution takes 32 time steps Simulate data flow execution 3.6 Dataflow execution Dataflow execution only takes 4 times steps Potential speedup=8.

Dataflow speedup = 8 Iterating through partitions: the connected components 3.7 Graph partitioning All the partitions are load balanced

4. Loop Transformation Real parallelism Potential parallelism Transformation

4.1 Example 4

4.2 The iteration space Sequentially 25 iterations

4.3 Loop Parallelizable? check loop I check loop J

Totally 9 steps Potential speedup: 25/9=2.78 Wave front effect: all iterations on the same wave are on the same line 4.4 Dataflow execution

4.5 Zoom-in on the I-space

4.6 Speedup vs program size Zoom-in previews parallelism in part of a loop without modifying the program Executing the programs of different size n estimates a speedup of n 2 /(2n-1)

4.7 How to obtain the potential parallelism Here we already have these metrics: Sequential time steps = N 2 Dataflow time step = 2N-1 potential speedup = N 2 /(2N-1) Transformation. How to obtain the potential speedup of a loop?

4.8 Unimodular transformation (UT) A unimodular matrix is a square integer matrix that has unit determinant. It is the result of identity matrix by three kinds of basic transformations: reversal, interchange, and skewing The new loop execution order is determined by the transformed index. The iteration space remains unit step size Find a suitable UT reorders the iterations such that the new loop nest has a parallel loop Unimodular matrix New loop index Old loop index reversal interchange skewing

4.9 Hyperplane transformation Interactively define a hyper-plane Observe the plane iteration matches the dataflow simulation plane = dataflow The plane iteration Based on the plane, ISV calculates a unimodular transformation

The transformed iteration space and the generated loop 4.10 The derived UT

4.11 Verify the UT ISV checks if the transformation is valid Observe that the parallel loop execution in the transformed loop matches the plane execution parallel = plane

5. Statement-level parallelism Unimodular transformations work at iteration level The statement dependence within the loop body is hidden in the iteration space graph How to exploit parallelism at statement level? Statement to iteration

5.1 Example 5 SSV: statement space visualization

5.2 Iteration-level parallelism The iteration space is 2D. There are N 2 =16 iterations The dataflow execution has 2N-1=7 time steps. The potential speedup is: 16/7 = 2.29

5.3 Parallelism in statements The (statement) iteration space is 3D There are 2N 2 =32 statements The dataflow execution still has 2N-1=7 time steps. The potential speedup is: 32/7 = 4.58

5.4 Comparison Here: doubles the potential speedup at iteration level

5.5 Define the partition planes partitions hyper-planes

What is validity? Show the execution order on top of the dependence arrows. (for 1 plane or all together, depending on the density of the slide)

5.6 Invalid UT The invalid unimodular transformation derived from hyper-plane is refused by ISV Alternatively, ISV calculates the unimodular transformation based on the dependence distance vectors available in the dependence graph

6. Pseudo distance method The pseudo distance method: Extract base vectors from the dependent iterations Examine if the base vectors generates all the distances Calculate the unimodular transformation based on the base vectors The base vectors The unimodular matrix

Another way to find parallelism automatically The iteration space is a grid, non-uniform dependencies are members of a uniform dependence grid, with unknown base-vectors. Finding these base vectors allows us to extend existing parallelization to the non-uniform case.

6.1 Dependence distance (1,0,-1) (0,1,1)

6.2 The Transformation The transforming matrix discovered by pseudo distance method The distance vectors are transformed (1,0,-1) (0,1,0) (0,1,1) (0,0,1) The dependent iterations have the same first index, implies the outermost loop is parallel.

6.3 Compare the UT matrices The transforming matrix discovered by pseudo distance method An invalid transforming matrix discovered by the hyper-plane method The same first column means the transformed outermost loops have the same index.

6.4 The transformed space The outermost loop is parallel There are 8 parallel tasks The load of tasks is not balanced The longest task takes 7 time steps

7. Non-perfectly nested loop What is it? The unimodular transformations only work for perfectly nested loops For non-perfectly nested loop, the iteration space is constructed with extended indices N fold non-perfectly nested loop to a N+1 fold perfectly nested loop

7.1 Perfectly nested Loop? Non-perfectly nested loop: DO I1 = 1,3 A(I1) = A(I1-1) DO I2 = 1,4 B(I1,I2) = B(I1-1,I2)+B(I1,I2-1) ENDDO Perfectly nested loop: DO I1 = 1,3 DO I2 = 1,5 DO I3 = 0,1 IF (I2.EQ.1.AND.I3.EQ.0) A(I1) = A(I1-1) ELSE IF(I3.EQ.1) B(I1-1,I2)=B(I1-2,I2)+B(I1-1,I2-1) ENDDO

7.2 Exploit parallelism with UT

8. Applications ProgramsCatagoryDepthFormPatternTransformation Example 1 Tutorial 1PerfectUniformN/A Example 2 Tutorial 2PerfectNon-uniformN/A Example 3 Tutorial 3PerfectUniformWavefront UT Example 4 Tutorial 2PerfectUniformWavefront UT Example 5 Tutorial 2+1PerfectUniform Stmt Partitioning UT Example 6 Tutorial 2+1 Non- perfect UniformWavefront UT Matrix multiplication Algorithm 3PerfectUniformParallelization Gauss-Jordan Algorithm 3PerfectNon-UniformParallelization FFT Algorithm 3PerfectNon-UniformParallelization Cholesky Benchmark4 Non- perfect Non-UniformPartitioning UT TOMCATV Benchmark3 Non- perfect UniformParallelization Flow3D CFD App.3PerfectUniformWavefront UT

9. Future considerations Weighted dependence graph More semantics on data locality: data space graph, data communication graph data reuse iteration space graph, More loop transformation: Affine (statement) iteration space mappings Automatic statement distribution Integration with Omega library