An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University

Partition computation Insert comm / sync Manage storage Same answers as Fortran program Parallel Machine HPF Program Compilation Sequential Fortran program + data partitioning High-Performance Fortran (HPF) Industry-standard data parallel language Partitioning of data drives partitioning of computation, …

Motivation Obtaining high performance from applications written using high-level parallel languages has been elusive Tightly-coupled applications are particularly hard Data dependences serialize computation – induces tradeoffs between parallelism, communication granularity and frequency – traditional HPF partitionings limit scalability and performance Communication might be needed inside loops

Contributions A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications An analysis of their performance impact

dHPF Compiler Based on an abstract equational framework – manipulates sets of processors, array elements, iterations and pairwise mappings between these sets – optimizations and code generation are implemented as operations on these sets and mappings Sophisticated computation partitioning model – enables partial replication of computation to reduce communication Support for the multipartitioning distribution – MULTI distribution specifier – suited for line-sweep computations Innovative optimizations – reduce communication – improve locality

Overview Introduction  Line Sweep Computations Performance Comparison Optimization Evaluation – Partially Replicated Computation – Interprocedural Communication Elimination – Communication Coalescing – Direct Access Buffers Conclusions

Line-Sweep Computations 1D recurrences on a multidimensional domain Recurrences order computation along each dimension Compiler based parallelization is hard: loop carried dependences, fine-grained parallelism

Partitioning Choices (Transpose) Local Sweeps along x and z Transpose Local Sweep along y Transpose back

Partitioning Choices (block + CGP) Partial wavefront-type parallelism Processor 0 Processor 1 Processor 2 Processor 3

Partitioning Choices (multipartitioning) Full parallelism for sweeping along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3

NAS SP & BT benchmarks from NASA Ames – use ADI to solve the Navier-Stokes equation in 3D – forward & backward line sweeps on each dimension, for each time step SP solves scalar penta-diagonal systems BT solves block-tridiagonal systems SP has double communication volume and frequency NAS SP & BT Benchmarks

Experimental Setup 2 versions from NASA, each written in Fortran 77 – parallel MPI hand-coded version – sequential version (3500 lines) dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase) Inlined several procedures manually: – enables dHPF to overlap local computation with communication without interprocedural tiling Platform: SGI Origin 2000 (128 250 MHz procs.), SGI’s MPI implementation, SGI’s compilers

Performance Comparison Compare four versions of NAS SP & BT Multipartitioned MPI hand-coded version from NASA – different executables for each number of processors Multipartitioned dHPF-generated version – single executable for all numbers of processors Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition) – single executable for all numbers of processors Block-partitioned pghpf-compiled version from PGI’s source code (using a full transpose with a 1D partition) – single executable for all numbers of processors

Efficiency for NAS SP (102 3 ‘B’ size) > 2x multipartitioning comm. volume similar comm. volume, more serialization

Efficiency for NAS BT (102 3 ‘B’ size) > 2x multipartitioning comm. volume

Introduction Line Sweep Computations Performance Comparison  Optimization Evaluation – Partially Replicated Computation – Inteprocedural Communication Elimination – Communication Coalescing – Direct Access Buffers Conclusions Overview

Evaluation Methodology All versions are dHPF-generated using multipartitioning Turn off a particular optimization (“n - 1” approach) – determine overhead without it (% over fully optimized) Measure its contribution to overall performance – total execution time – total communication volume – L2 data cache misses (where appropriate) Class A (64 3 ) and class B (102 3 ) problem sizes on two different processor counts (16 & 64 processors)

Partially Replicated Computation ON_HOME a(i-2, j)  ON_HOME a(i+2, j)  ON_HOME a(i, j-2)  ON_HOME a(i-1, j+1)  ON_HOME a(i, j) ON_EXT_HOME a(i, j) SHADOW a(2, 2) Partial computation replication is used to reduce communication

Impact of Partial Replication BT: eliminate comm. for 5D arrays fjac and njac in lhs Both: eliminate comm. for six 3D arrays in compute_rhs

Impact of Partial Replication (cont.)

Interprocedural Communication Reduction REFLECT : placement of near-neighbor communication LOCAL : communication not needed for a scope extended ON HOME : partial computation replication Compiler doesn’t need full interprocedural communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh Extensions to HPF/JA Directives

Interprocedural Communication Reduction (cont.) SHADOW a(2, 1) REFLECT (a(0:0, 1:0), a(1:0, 0:0)) From left neighbor From top neighbor SHADOW a(2, 1) REFLECT (a) The combination of REFLECT, extended ON HOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time

Normalizing Communication Same non-local data needed P0P0 P1P1 a(i, j) a(i, j - 2) a(i, j + 2) a(i, j) P0P0 P1P1 do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2)

Coalescing Communication A A Coalesced Message

Impact of Normalized Coalescing

Key optimization for scalability

Direct Access Buffers Choices for receiving complex coalesced messages Unpack them into the shadow regions – two simultaneous live copies in cache – unpacking can be costly – uniform access to non-local & local data Reference them directly out of the receive buffer – introduces two modes of access for data (non-local & interior) – overhead of having a single loop with these two modes is high – loops should be split into non-local & interior portions, according to the data they reference

Impact of Direct Access Buffers Use direct access buffers for the main swept arrays Direct access buffers + loop splitting reduces L2 data cache misses by ~11%, resulting in a reduction of ~11% in execution time

Conclusions Compiler-generated code can match the performance of sophisticated hand-coded parallelizations High performance comes from the aggregate benefit of multiple optimizations Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand

Efficiency for NAS SP (‘A’)

Efficiency for NAS BT (‘A’)

Data Partitioning

Data Partitioning (cont.)

Partially Replicated Computation do i = 1, n do j = 2, n a(i,j) = u(i,j-1) + 1.0 ! ON_HOME a(i,j)  ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j) Processor pProcessor p + 1 Local portion A + Shadow Regions Local portion U + Shadow Regions Communication Replicated Computation Local portions U/B + Shadow Regions

Using HFP/JA for Comm. Elimination

do timestep = 1, T do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo Coalesce communication at this point Normalized Comm. Coalescing (cont.)

Impact of Direct Access Buffers

Direct Access Buffers Processor 0 Processor 1 Pack, Send, Receive & Unpack

Direct Access Buffers Processor 0 Processor 1 Pack, Send & ReceiveUse

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

Similar presentations

Presentation on theme: "An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

Similar presentations

Presentation on theme: "An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice."— Presentation transcript:

Similar presentations

About project

Feedback