Data Parallel SPMD Programming Environments: Fortran to Java

Data Parallel SPMD Programming Environments: Fortran to Java
Han-Ku Lee Department of Computer Science Florida State University

Outline Background - historical review of data-parallel languages, message-passing frameworks, and high-level libraries for distributed arrays HPspmd programming language model – HPJava The compilation strategies for HPJava Related systems Conclusions

Acknowledgements This work was supported in part by the National Science Foundation (NSF ) Division of Advanced Computational Infrastructure and Research Contract number –

Research Objectives Data-parallel programming and languages have played a major role in high-performance computing HPF – difficult (compilation) Library-based lower-level SPMD programming – successful HPspmd programming language model – a flexible hybrid of HPF-like data-parallel language and the popular, library-oriented, SPMD style Base-language for HPspmd model should be clean and simple object semantics, cross-platform portability, security, and popular – Java To power up Java in data-parallel SPMD environment

Data Parallel Languages
Large data-structures, typically arrays, are split across nodes Each node performs similar computations on a different part of the data structure SIMD – Illiac IV and ICL DAP introduced a new concept, distributed arrays MIMD – asynchronous, flexible, hard to program SPMD – loosely synchronous model (SIMD+MIMD) Each node has its own local copy of program

HPF (High Performance Fortran)
By early 90s, value of portable, standardized languages universally acknowledged. Goal of HPF Forum – a single language for High Performance programming. Effective across architectures—vector, SIMD, MIMD, though SPMD a focus. HPF - an extension of Fortran 90 to support the data parallel programming model on distributed memory parallel computers Supported by Cray, DEC, Fujitsu, HP, IBM, Intel, Maspar, Meiko, nCube, Sun, and Thinking Machines

HPF Ideal data distribution Processors Memory Area Multi-processing and data distribution – communication and load-balance Introduced processor arrangement and Templates Data Alignment

Message-passing for HPC
Processes explicitly communicate through messages on some classes of parallel machines with distributed memory Early Message-Passing Frameworks – p4, PARMACS, PVM, and Express Message Passing Interface Forum established a standard API for message-passing library routines – MPI Portability and scalability

High Level Libraries for Distributed Arrays
Distributed Array – a collective object shared by a number of processes PARTI, The Global Array (GA) Toolkit Adlib high-level runtime library, designed to support translation of data-parallel languages Implemented 1994 in the shpf project at Southampton University and much improved during the Parallel Compiler Runtime Consortium (PCRC) project at Syracuse University Initially invented for HPF Currently used in the HPJava project at the Florida State University and Indiana University

Adlib Built-in model of distributed arrays and sections.
Equivalent to HPF 1.0 model, plus ghost extensions and general block distribution from HPF 2.0 Collective communication library. Direct support for array section assignments, ghost region updates, F90 array intrinsics, general gather/scatter. Implemented on top of MPI. Adlib kernel implemented in C++. Object-based distributed array descriptor (DAD) Interfaces – shpf Fortran interface, PCRC Fortran interface, ad++ interface, and HPJava interface

Features of HPJava A language for parallel programming, especially suitable for massively parallel, distributed memory computers. Takes various ideas from High Performance Fortran. HPJava has a distributed array model very similar to the HPF model. Almost identical set of distribution and alignment options. In other respects of HPJava is a lower level parallel programming language than HPF. Programming model is explicit SPMD, needing explicit calls to communication libraries such as MPI The HPJava system is built on Java technology. The HPJava programming language is an extension of the Java programming language.

Benefits of HPspmd Model
Translators are much easier to implement than HPF compilers. No compiler magic needed Attractive framework for library development, avoiding inconsistent parameterizations of distributed array arguments Better prospects for handling irregular problems – easier to fall back on specialized libraries as required Can directly call MPI functions from within an HPspmd program

HPspmd Architecture

Multidimensional Arrays
Java is an attractive language, but needs to be improved for large computational tasks Java provides an array of arrays => disadvantage Time consumption for out-of bounds checking The ability to alias rows of an array The cost of accessing an element HPJava introduces true multidimensional arrays and regular section For example int [[*,*]] a = new int [[5, 5]] ; for (int i=0; i<4; i++) a [i, i+1] = 19 ; foo ( a [[:, 0]] ) ;

Processes Proces2 p = new Procs(2, 3) ; on (p) { Range x = new BlockRange(N, p.dim(0)) ; Range y = new BlockRange(N, p.dim(1)) ; float [[-,-]] a = new float [[x, y]] ; float [[-,-]] b = new float [[x, y]] ; float [[-,-]] c = new float [[x, y]] ; … initialize ‘a’, ‘b’ overall (i=x for :) overall (j=y for :) c [i, j] = a [i, j] + b [i, j]; } An HPJava program is concurrently started on all members of some process collection – process groups on construct limits control to the active process group (APG), p The class BlockRange is a subclass of Range, representing an index range block-distributed over the process dimension passed to its constructor 1 2 1 p

Distributed arrays The most important feature of HPJava
A collective object shared by a number of processes Elements of a distributed array are distributed True multidimensional array Forms a regular section of an array When N = 8 in the previous example code, the distributed array, ‘a’ is distributed like:

Overall construct overall (i = x for l : u : s) { … }
A distributed parallel loop i – distributed index whose value is Location, which is a particular element of a particular distributed range Index triplet represents a lower bound, an upper bound, and a step – all of which are integer expressions The step is optional – the default step is 1 The lower bound may be omitted – the default is 0 The upper bound may be omitted – the default is N-1 An HPJava range object => a collection of location With a few exception, the subscript of a distributed array must be a distributed index, and the location should be an element of the range associated with the array dimension This restriction is an important feature, ensuring that referenced array elements are locally held

At construct at (i =x [4]) { … }
HPJava defines a distributed index when we want to update or access a single element of a distributed array rather than accessing a whole set of elements in parallel When we want to update a [1, 4] float [[-,-]] a = new float [[x, y]] ; // a [1, 4] = 19 ; <---- Not allowed since 1 and 4 are not distributed indices, // therefore, not legal subscripts … at (i = x [1]) at (j = y[4]) a [i, j] = 19; The operational semantics of at construct is similar to that of on construct i` - the back quote symbol is used as a postfix operator on a distributed index

Distribution format Range BlockRange CyclicRange ExtBlockRange IrregRange CollapsedRange Dimension HPJava provides further distribution formats for dimensions of distributed arrays without further extensions to the syntax Instead, the Range class hierarchy is extended BlockRange, CyclicRange, IrregRange, Dimension ExtBlockRange – a BlockRange distribution extended with ghost regions CollapsedRange – a range that is not distributed, i.e. all elements of the range mapped to a single process

Ghost regions Ghost region – extra space “around the edges” of the locally held block of distributed array elements These extra space can cache some of the element values properly belonging to adjacent processors With ghost regions, the inner loop of algorithms for stencil updates can be written in a simple way, since the edges of the block don’t need special treatment in accessing neighboring elements Shifted indices can locate the proper values cached in the ghost region e.g … a [i, j+1] …

Array Sections HPJava supports subarrays modeled on the array sections of Fortran 90 Whereas an element reference is a variable, an array section is an expression that represents a new distributed array object The new array section is a subset of the elements of the parent array Triplet subscript The rank of an array section is equal to the number of triplet subscripts e.g. float [[-,-]] a = new float [[x, y]] ; float [[-]] b = a [[0, :]] ; Subrange – the range of an array section e.g. Range u = x [0 : N-1 : 2] ;

Distributed Array Type
Type signature of a distributed array T [[attr0, …, attrR-1]] bras where R is the rank of the array and each term attrr is either a single hyphen, - or a single asterisk, *, the term bras is a string of zero or more bracket pairs, [] T can be any Java type other than an array type. This signature represents the type of a distributed array whose elements have Java type T bras A distributed array type is not treated as a class type It means that a distributed array cannot be an element of an ordinary Java array, nor can a distributed array reference be stored in a standard library class like Vector, which expects an Object If we say “distributed arrays have a class”, it would commits us to either extending the definition of class in Java language, or creating genuine Java classes for each type of HPJava array that might be need – impractical

HPspmd classes and APG HPJava translator try to distinguish HPJava code from Java code It introduces a special interface, hpjava.lang.HPspmd, which must be implemented by any class that uses the special syntax An HPspmd class is a class that implements the hpjava.lang.HPspmd interface. Any other class is a non-HPspmd class Many of the special operations in HPJava rely on the active process group – the APG APG is changed during the course of the program as distributed control constructs limit control to different subsets of the processors In the current HPJava translator, the value of APG is passed as a hidden argument to methods and constructors of HPspmd classes (like “this” reference)

Basic Translation Scheme
The HPJava system is not exactly a high-level parallel programming language – more like a tool to assist programmers generate SPMD parallel code This suggests the transformations the system applies should be relatively simple and well-documented, so programmers can exploit the tool more effectively We don’t expect the generated code to be human readable or modifiable, but at least the programmer should be able to work out what is going on The HPJava specification defines the basic translation scheme as a series of schema

Translation of a distributed array declaration
Source: T [[attr0, …, attrR-1]] a ; TRANSLATION: T [] a ’dat ; ArrayBase a ’bas ; DIMENSION_TYPE (attr0) a ’0 ; … DIMENSION_TYPE (attrR-1) a ’R-1 ; where DIMENSION_TYPE (attrr) ≡ ArrayDim if attrr is a hyphen, or DIMENSION_TYPE (attrr) ≡ SeqArrayDim if attrr is a asterisk e.g. float [[-,*]] var ;  float [] var__$DS ; ArrayBase var__$bas ; ArrayDim var__$0 ; SeqArrayDim var__$1 ;

Translation of the overall construct
SOURCE: overall (i = x for e lo : e hi : e stp) S TRANSLATION: Block b = x.localBlock(T [e lo], T [e hi], T [e stp]) ; Group p = ((Group) apg.clone()).restrict(x.dim()) ; for (int l = 0; l < b.count; l ++) { int sub = b.sub_bas + b.sub_stp * l ; int glb = b.glb_bas + b.glb_stp * l ; T [S | p] } where: i is an index name in the source program, x is a simple expression in the source program, e lo, e hi, and e stp are expressions in the source, S is a statement in the source program, and b, p, l, sub and glb are names of new variables

Important features of translation scheme
From the last slide, the basic translation scheme reduces overall constructs to simple local for loops Inside these loops, the only overheads relative to hand-coded local for loops is a proliferation of references to fields of simple classes like Block and ArrayDim These things can easily be lifted outside loops, strength reduction optimizations can be applied to the local subscript expressions, loops can be unrolled, remove redundant checks (run-time checks) etc These things can all be done easily by a slightly more optimized form of the translator

Optimization Strategies
Here we only consider strength reduction optimizations on the index expression Consider the nested overall and loop constructs overall (i=x for :) overall (j=y for :) { float sum = 0 ; for (int k=0; k<N; k++) sum += a [i, k] * b [k, j] ; c [i, j] = sum ; }

A correct but naive translation
Block bi = x.localBlock() ; for (int lx = 0; lx<bi.count; lx ++) { Block bj = y.localBlock() ; for (int ly = 0; ly<bj.count; ly ++) { float sum = 0 ; for (int k = 0; k<N; k ++) sum += a.dat() [a.bas() + (bi.sub_bas + bi.sub_stp * lx) * a.str(0) + k * a.str(1)] * b.dat() [b.bas() + (bj.sub_bas + bj.sub_stp * ly) * b.str(1) + k * b.str(0)] ; c.dat() [c.bas() + (bi.sub_bas + bi.sub_stp * lx) * c.str(0) + (bj.sub_bas + bj.sub_stp * ly) * c.str(1)] = sum; }

Strength-Reduction Optimization
The complexity of the associated terms in the subscript expressions The subscript expressions can be greatly simplified by application of strength-reduction optimization Eliminate complicated expressions involving multiplication from expressions in inner loops by introducing the induction variables: vai_ ≡ a.bas() + (bi.sub_bas + bi.sub_stp * lx) * a.str(0) vci_ ≡ c.bas() + (bi.sub_bas + bi.sub_stp * lx) * c.str(0) vb_j ≡ b.bas() + (bj.sub_bas + bj.sub_stp * ly) * b.str(1) vcij ≡ c.bas() + (bj.sub_bas + bj.sub_stp * ly) * c.str(0) + bj.sub_bas + bj.sub_stp * ly) * c.str(1) Which can be computed efficiently by increasing at suitable points with the induction increments: sia0 ≡ bi.sub_stp * a. str(0) sic0 ≡ bi.sub_stp * c. str(0) sjb0 ≡ bj.sub_stp * b. str(1) sjc1 ≡ bj.sub_stp * c. str(1)

Block bi = x.localBlock() ;
Translation of overall after applying strength reduction to distributed index subscript expression Block bi = x.localBlock() ; int vai_ = a.bas() + bi.sub_bas * a.str(0) ; int vci_ = c.bas() + bi.sub_bas * c.str(0) ; final int sia0 = bi.sub_stp * a.str(0), sic0 = bi.sub_stp * c.str(0); for (int lx = 0; lx < bi.count; lx ++) { Block bj = y.localBlock() ; int vb_j = b.bas() + bj.sub_bas * b.str(1) ; int vcij = vci_ bj.sub_bas * c.str(1) ; final int sjb1 = bj.sub_stp * b.str(1), sjc1 = bj.sub_stp * c.str(1); for (int ly = 0; ly < bj.count; ly ++) { float sum = 0; for (int k = 0 ; k < N; k ++) sum += a.dat() [vai_ + k * a.str(1)] * b.dat() [vb_j + k * b.str(0)] ; c.dat() [vcij] = sum ; vb_j += sia0 ; vcij += sjc1 ; } vai_ += sia0 ; vci_ += sic0 ;

Related Systems (1) Co-Array Fortran (formerly called F--) ZPL
A simple and small set of extensions to Fortran 95 for SPMD processing The logical model of communication is built-in HPJava follows MPI philosophy i.e. no communication primitives ZPL An array programming language designed from first principles for fast execution on both sequential and parallel computers A := A + B ; (where A and B are two dimensional arrays) Parallelism and communication is more implicit than HPJava HPJava provides lower-level access to parallel machine using mpiJava

Related Systems (2) Spar STAPL
A Java-based programming language for semi-automatic array-parallel programming Multidimensional arrays, array sections and parallel loop Similar in syntax, but semantically different to HPJava Suitable to shared memory computing systems HPJava targets massively parallel distributed memory computing STAPL A parallel C++ library designed as a super set of the ANSI C++ STL and executed on uni- or multi- processors for SPMD programming While STAPL and HPJava share a SPMD programming model, HPJava is more naturally suited to distributed memory systems since it is using the philosophy of distributed arrays

Java Performances Benchmarked on Linux Red Hats 7.2 (Pentium IV 1.5 GHZ) Linpack Compared Java with GNU cc and Fortran77 Seems like we don’t need loop unrolling for Java MFLOPS Unrolled Rolled cc –O5 403.68 376.45 g77 -O5 218.83 215.76 IMB Developer Kit 1.3 13.78 233.81 Sun JDK 1.4 beta 193.35 175.73 Blackdown JDK 1.3.1 199.11 158.08

Why Fortran is slower than C ?
Could say “performance of Fortran and C” are same But, depends upon compilers GNU Fortran 77 compiler generates more machine codes than GNU cc compiler does for main loop in Linpack

Conclusions Historical review of data-parallel languages such as HPF
Message-passing frameworks – p4, PARMACS, PVM and MPI standard High-level libraries for distributed arrays – PARTI, GA and Adlib HPspmd programming language model – SPMD framework for using libraries based on distributed arrays Specific syntax, new control constructs, basic translation schemes, and basic optimization strategies for HPJava Related systems – Co-Array Fortran, ZPL, Spar, and STAPL Current status of HPJava Collaborated with Bryan Carpenter, Geoffrey Fox, Guansong Zhang, Sang Lim and Zheng Qiang The first fully functional HPJava translator (written in Java) is now operational Parser – JavaCC and JTB tools Has been tested and debugged against small test suite and 800-line multigrid code Next stage – implement the optimization

Data Parallel SPMD Programming Environments: Fortran to Java

Similar presentations

Presentation on theme: "Data Parallel SPMD Programming Environments: Fortran to Java"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Parallel SPMD Programming Environments: Fortran to Java

Similar presentations

Presentation on theme: "Data Parallel SPMD Programming Environments: Fortran to Java"— Presentation transcript:

Similar presentations

About project

Feedback