High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

Reference: Message Passing Fundamentals.

Introduction CS 524 – High-Performance Computing.

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Parallel Computing Overview CS 524 – High-Performance Computing.

Introduction to Fortran Fortran Evolution Drawbacks of FORTRAN 77 Fortran 90 New features Advantages of Additions.

Introduction to Parallel Programming Language notation: message passing Distributed-memory machine (e.g., workstations on a network) 5 parallel algorithms.

Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Introduction to Parallel Programming Language notation: message passing Distributed-memory machine –(e.g., workstations on a network) 5 parallel algorithms.

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Introduction to Parallel Programming

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

Introduction to Parallel Programming Language notation: message passing 5 parallel algorithms of increasing complexity: –Matrix multiplication –Successive.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

Lecture 9 Architecture Independent (MPI) Algorithm Design

HPF (High Performance Fortran). What is HPF? HPF is a standard for data-parallel programming. Extends Fortran-77 or Fortran-90. Similar extensions exist.

Parallel Computing Presented by Justin Reschke

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Introduction to Parallel Programming Language notation: message passing Distributed-memory machine –(e.g., workstations on a network) 5 parallel algorithms.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Auburn University

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

3- Parallel Programming Models

Laxmi Narayan Bhuyan SIMD Architectures Laxmi Narayan Bhuyan

Parallel Programming in C with MPI and OpenMP

Introduction to Parallel Programming

Summary Background Introduction in algorithms and applications

Parallel Matrix Operations

Parallel Programming in C with MPI and OpenMP

شیوه های موازی سازی parallelization methods

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Question Can't we just have a clever compiler generate a parallel program from a sequential program? Fine-grained parallelism x = a*b + c*d Trivial parallelism for i := 1 to 100 do for j := 1 to 100 do C [i, j] := dotproduct ( A [ i,*], B [*, j ]); od

Automatic parallelism Automatic parallelization of any program is extremely hard Solutions: Make restrictions on source program Restrict kind of parallelism used Use semi-automatic approach Use application-domain oriented languages

High Performance Fortran (HPF) Designed by a forum from industry, government, universities Extends Fortran 90 To be used for computationally expensive numerical applications Portable to SIMD machines, vector processors, shared-memory MIMD and distributed- memory MIMD

Fortran 90 - Base language of HPF Extends Fortran 77 with 'modern' features abstract data types, modules recursion pointers, dynamic storage Array operators A = B + C A = A A(1:7) = B(1:7) + B(2:8) WHERE (X /= 0) X = 1.0/X

Data parallelism Data parallelism: same operation applied to different data elements in parallel Data parallel program: sequence of data parallel operations Overall approach: Programmer does domain decomposition Compiler partitions operations automatically Data may be regular (array) or irregular (tree, sparse matrix) Most data parallel languages only deal with arrays

Data parallelism - Concurrency Explicit parallel operations A = B + C ! A, B, and C are arrays Implicit parallelism do i = 1,m do j = 1,n A(i,j) = B(i,j) + C(i,j) enddo

Compiling data parallel programs Programs are translated automatically into parallel SPMD (Single Program Multiple Data) programs Each processor executes same program on a subset of the data Owner computes rule: - Each processor owns subset of the data structures - Operations required for an element are executed by the owner - Each processor may read (but not modify) other elements

Example real s, X(100), Y(100) ! s is scalar, X and Y are arrays X = X * 3.0 ! Multiply each X(i) by 3.0 do i = 2,99 Y(i) = (X(i-1) + X(i+1))/2 ! Communication required enddo s = SUM(X) ! Communication required Arrays X and Y are distributed (partitioned) Scalar s is replicated on each machine X Y

HPF primitives for data distribution Directives: PROCESSORS: shape & size of abstract processors ALIGN: align elements of different arrays DISTRIBUTE: distribute (partition) an array Directives affect performance of the program, not its result

Processors directive !HPF$ PROCESSORS P(32) !HPF$ PROCESSORS Q(4,8) Mapping of abstract to physical processors not specified in HPF (implementation-dependent)

Alignment directive Aligns an array with another array Species that specific elements should be mapped to the same processor real A(50), B(50), C(50,50) !HPF$ ALIGN A(I) WITH B(I) !HPF$ ALIGN A(I) WITH B(I+2) !HPF$ ALIGN A(:) WITH C(*,:)

Figure 7.6 from Foster's book

Distribution directive Species how elements should be partitioned among the local memories Each dimension can be distributed as follows: *no distribution BLOCK (n)block distribution CYCLIC (n)cyclic distribution

Figure 7.7 from Foster's book

Example: Successive Over relaxation (SOR) Recall algorithm discussed in Introduction: float G[1:N, 1:M], Gnew[1:N, 1:M]; for (step = 0; step < NSTEPS; step++) for (i = 2; i < N; i++) /* update grid */ for (j = 2; j < M; j++) Gnew[i,j] = f(G[i,j], G[i-1,j], G[i+1,j],G[i,j-1], G[i,j+1]); G = Gnew;

Parallel SOR with message passing float G[lb-1:ub+1, 1:M], Gnew[lb-1:ub+1, 1:M]; for (step = 0; step < NSTEPS; step++) SEND(cpuid-1, G[lb]); /* send 1st row left */ SEND(cpuid+1, G[ub]); /* send last row right */ RECEIVE(cpuid-1, G[lb-1]); /* receive from left */ RECEIVE(cpuid+1, G[ub+1]); /* receive from right */ for (i = lb; i <= ub; i++) /* update my rows */ for (j = 2; j < M; j++) Gnew[i,j] = f(G[i,j], G[i-1,j], G[i+1,j], G[i,j-1], G[i,j+1]); G = Gnew;

Finite differencing (~ SOR) in HPF See Ian Foster, Program 7.2; uses convergence criterion instead of fixed number of steps program hpf_finite_difference !HPF$ PROCESSORS pr(4) ! use 4 CPUs real X(100, 100), New(100, 100) ! data arrays !HPF$ ALIGN New(:,:) WITH X(:,:) !HPF$ DISTRIBUTE X(BLOCK,*) ONTO pr ! row-wise New(2:99, 2:99) = (X(1:98, 2:99) + X(3:100, 2:99) + X(2:99, 1:98) + X(2:99, 3:100))/4 diffmax = MAXVAL (ABS (New-X)) end

Changing the distribution Use block distribution instead of row distribution program hpf_finite_difference !HPF$ PROCESSORS pr(2,2)! use 2x2 grid real X(100, 100), New(100, 100)! data arrays !HPF$ ALIGN New(:,:) WITH X(:,:) !HPF$ DISTRIBUTE X(BLOCK, BLOCK) ONTO pr! block-wise New(2:99, 2:99) = (X(1:98, 2:99) + X(3:100, 2:99) + X(2:99, 1:98) + X(2:99, 3:100))/4 diffmax = MAXVAL (ABS (New-X)) end

Performance Distribution affects Load balance Amount of communication Example (communication costs): !HPF$ PROCESSORS pr(3) integer A(8), B(8), C(8) !HPF$ ALIGN B(:) WITH A(:) !HPF$ DISTRIBUTE A(BLOCK) ONTO pr !HPF$ DISTRIBUTE C(CYCLIC) ONTO pr

Figure 7.9 from Foster's book

Conclusions High-level model User species data distribution Compiler generates parallel program + communication More restrictive than general message passing model (only data parallelism) Restricted to array-based data structures HPF programs will be easy to modify, which enhances portability Changing data distribution only requires changing directives