An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Advertisements

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Compiler Challenges for High Performance Architectures

Daniel Chavarría-Miranda, Alain Darte (CNRS, ENS-Lyon), Robert Fowler, John Mellor-Crummey Work performed at Rice University Generalized Multipartitioning.

Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Topic Overview One-to-All Broadcast and All-to-One Reduction

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Parallel Computing Presented by Justin Reschke

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Presentation transcript:

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University

Partition computation Insert comm / sync Manage storage Same answers as Fortran program Parallel Machine HPF Program Compilation Sequential Fortran program + data partitioning High-Performance Fortran (HPF) Industry-standard data parallel language Partitioning of data drives partitioning of computation, …

Motivation Obtaining high performance from applications written using high-level parallel languages has been elusive Tightly-coupled applications are particularly hard Data dependences serialize computation – induces tradeoffs between parallelism, communication granularity and frequency – traditional HPF partitionings limit scalability and performance Communication might be needed inside loops

Contributions A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications An analysis of their performance impact

dHPF Compiler Based on an abstract equational framework – manipulates sets of processors, array elements, iterations and pairwise mappings between these sets – optimizations and code generation are implemented as operations on these sets and mappings Sophisticated computation partitioning model – enables partial replication of computation to reduce communication Support for the multipartitioning distribution – MULTI distribution specifier – suited for line-sweep computations Innovative optimizations – reduce communication – improve locality

Overview Introduction  Line Sweep Computations Performance Comparison Optimization Evaluation – Partially Replicated Computation – Interprocedural Communication Elimination – Communication Coalescing – Direct Access Buffers Conclusions

Line-Sweep Computations 1D recurrences on a multidimensional domain Recurrences order computation along each dimension Compiler based parallelization is hard: loop carried dependences, fine-grained parallelism

Partitioning Choices (Transpose) Local Sweeps along x and z Transpose Local Sweep along y Transpose back

Partitioning Choices (block + CGP) Partial wavefront-type parallelism Processor 0 Processor 1 Processor 2 Processor 3

Partitioning Choices (multipartitioning) Full parallelism for sweeping along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3

NAS SP & BT benchmarks from NASA Ames – use ADI to solve the Navier-Stokes equation in 3D – forward & backward line sweeps on each dimension, for each time step SP solves scalar penta-diagonal systems BT solves block-tridiagonal systems SP has double communication volume and frequency NAS SP & BT Benchmarks

Experimental Setup 2 versions from NASA, each written in Fortran 77 – parallel MPI hand-coded version – sequential version (3500 lines) dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase) Inlined several procedures manually: – enables dHPF to overlap local computation with communication without interprocedural tiling Platform: SGI Origin 2000 ( MHz procs.), SGI’s MPI implementation, SGI’s compilers

Performance Comparison Compare four versions of NAS SP & BT Multipartitioned MPI hand-coded version from NASA – different executables for each number of processors Multipartitioned dHPF-generated version – single executable for all numbers of processors Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition) – single executable for all numbers of processors Block-partitioned pghpf-compiled version from PGI’s source code (using a full transpose with a 1D partition) – single executable for all numbers of processors

Efficiency for NAS SP (102 3 ‘B’ size) > 2x multipartitioning comm. volume similar comm. volume, more serialization

Efficiency for NAS BT (102 3 ‘B’ size) > 2x multipartitioning comm. volume

Introduction Line Sweep Computations Performance Comparison  Optimization Evaluation – Partially Replicated Computation – Inteprocedural Communication Elimination – Communication Coalescing – Direct Access Buffers Conclusions Overview

Evaluation Methodology All versions are dHPF-generated using multipartitioning Turn off a particular optimization (“n - 1” approach) – determine overhead without it (% over fully optimized) Measure its contribution to overall performance – total execution time – total communication volume – L2 data cache misses (where appropriate) Class A (64 3 ) and class B (102 3 ) problem sizes on two different processor counts (16 & 64 processors)

Partially Replicated Computation ON_HOME a(i-2, j)  ON_HOME a(i+2, j)  ON_HOME a(i, j-2)  ON_HOME a(i-1, j+1)  ON_HOME a(i, j) ON_EXT_HOME a(i, j) SHADOW a(2, 2) Partial computation replication is used to reduce communication

Impact of Partial Replication BT: eliminate comm. for 5D arrays fjac and njac in lhs Both: eliminate comm. for six 3D arrays in compute_rhs

Impact of Partial Replication (cont.)

Interprocedural Communication Reduction REFLECT : placement of near-neighbor communication LOCAL : communication not needed for a scope extended ON HOME : partial computation replication Compiler doesn’t need full interprocedural communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh Extensions to HPF/JA Directives

Interprocedural Communication Reduction (cont.) SHADOW a(2, 1) REFLECT (a(0:0, 1:0), a(1:0, 0:0)) From left neighbor From top neighbor SHADOW a(2, 1) REFLECT (a) The combination of REFLECT, extended ON HOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time

Normalizing Communication Same non-local data needed P0P0 P1P1 a(i, j) a(i, j - 2) a(i, j + 2) a(i, j) P0P0 P1P1 do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2)

Coalescing Communication A A Coalesced Message

Impact of Normalized Coalescing

Key optimization for scalability

Direct Access Buffers Choices for receiving complex coalesced messages Unpack them into the shadow regions – two simultaneous live copies in cache – unpacking can be costly – uniform access to non-local & local data Reference them directly out of the receive buffer – introduces two modes of access for data (non-local & interior) – overhead of having a single loop with these two modes is high – loops should be split into non-local & interior portions, according to the data they reference

Impact of Direct Access Buffers Use direct access buffers for the main swept arrays Direct access buffers + loop splitting reduces L2 data cache misses by ~11%, resulting in a reduction of ~11% in execution time

Conclusions Compiler-generated code can match the performance of sophisticated hand-coded parallelizations High performance comes from the aggregate benefit of multiple optimizations Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand

Efficiency for NAS SP (‘A’)

Efficiency for NAS BT (‘A’)

Data Partitioning

Data Partitioning (cont.)

Partially Replicated Computation do i = 1, n do j = 2, n a(i,j) = u(i,j-1) ! ON_HOME a(i,j)  ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j) Processor pProcessor p + 1 Local portion A + Shadow Regions Local portion U + Shadow Regions Communication Replicated Computation Local portions U/B + Shadow Regions

Using HFP/JA for Comm. Elimination

do timestep = 1, T do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo Coalesce communication at this point Normalized Comm. Coalescing (cont.)

Impact of Direct Access Buffers

Direct Access Buffers Processor 0 Processor 1 Pack, Send, Receive & Unpack

Direct Access Buffers Processor 0 Processor 1 Pack, Send & ReceiveUse