1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 and Lessons from Java Gadget.

Slides:



Advertisements
Similar presentations
Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between.
Advertisements

MPI Message Passing Interface
Part IV: Memory Management
879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
The Assembly Language Level
TUPEC057 Advances With Merlin – A Beam Tracking Code J. Molson, R.J. Barlow, H.L. Owen, A. Toader MERLIN is a.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Java.  Java is an object-oriented programming language.  Java is important to us because Android programming uses Java.  However, Java is much more.
Chapter 8 Runtime Support. How program structures are implemented in a computer memory? The evolution of programming language design has led to the creation.
Reference: Message Passing Fundamentals.
13 June mpiJava. Related projects mpiJava (Syracuse) JavaMPI (Getov et al, Westminster) JMPI (MPI Software Technology) MPIJ.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Principles of Object-Oriented Software Development The language Java.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 Programming & Programming Languages Overview l Machine operations and machine language. l Example of machine language. l Different types of processor.
VB in Context Michael B. Spring Department of Information Science and Telecommunications University of Pittsburgh Pittsburgh, Pa 15260
Parallel Programming with Java
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
1 MPJ meets Gadget Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK March A Java Code for Cosmological.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Instruction Set Architecture
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Parallel Programming Dr Andy Evans. Parallel programming Various options, but a popular one is the Message Passing Interface (MPI). This is a standard.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Complexity of Algorithms
Lecture 1 Introduction Figures from Lewis, “C# Software Solutions”, Addison Wesley Richard Gesick.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Algorithms and data structures: basic definitions An algorithm is a precise set of instructions for solving a particular task. A data structure is any.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 5 Page 1 CS 111 Online Processes CS 111 On-Line MS Program Operating Systems Peter Reiher.
Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Object Oriented Software Development 4. C# data types, objects and references.
Today… “Hello World” ritual. Brief History of Java & How Java Works. Introduction to Java class structure. But first, next slide shows Java is No. 1 programming.
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
Introduction to OOP CPS235: Introduction.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Visit for more Learning Resources
Programming Models for SimMillennium
An Overview of Java.
Compiler Construction
Pluggable Architecture for Java HPC Messaging
Course Outline Introduction in algorithms and applications
MPJ: A Java-based Parallel Computing System
Data Structures & Algorithms
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February and Lessons from Java Gadget

2 Contents MPJ Express overview The Gadget 2 code Making a Java version of Gadget

3 Acknowledgements Work done in collaboration with Mark Baker and Aamir Shafir. Most work done by latter!

4 MPJ Express

5 Java for High Performance Computing? Various people have argued that Java provides a good programming platform for large scale computation (e.g. Java Grande, ~2003). Parallel computing has a long history of novel language concepts to support parallelism, multithreading etc. Java seems to incorporate at least some of these ideas, and has the benefit that it is a “mainstream” language. Java encourages better software engineering and is more portable than, say, Fortran or C. There is a large body of supporting software for Java.

6 Performance Modern JVMs perform compilation from byte code to native machine code, as the Java program is executed. Typical JVMs implement many of the optimizations used by compilers for “traditional” programming languages. Adaptive compilation may even allow optimizations that are impractical for static compilers, because run-time information is available? Quality of compilation comparable to many C or Fortran compilers. But Java has safety features that may limit performance. Difficulty of managing memory explicitly, exact exception processing, …

7 MPI Background MPI was introduced in 1994 as a standard message passing API for parallel scientific computing: Language bindings for C, C++, and Fortran. Original programming model was SPMD (Single Program Multiple Data) ― P copies of a single program run on a fixed number, P, of processors. Only interaction is by message exchange ― no sharing of memory between processors (c.f. CSP). Introduced idea of communicator as basic communication abstraction (c.f. channels). A communicator is shared by a group of processors. Supports library development and collective operations.

8 Communicators C.Send(…, 2) Proc 0 Proc 1 C.Recv(…, 0) Proc 2 Communicator C

9 Collective Communications C.Bcast(…, 0) Proc 0 Proc 1 C.Bcast(…, 0) Proc 2 Communicator C C.Bcast(…, 0)

10 MPJ Background Java Grande Message Passing Workgroup defined Java bindings for MPI in Previous efforts follow two approaches: Pure Java approaches: Remote Method Invocation (RMI), Java Sockets, Java Native Interface (JNI) approaches: mpiJava – Java wrapper to platform native MPI.

11 Minimal mpiJava Program import mpi.* class Hello { static public void main(String[] args) { MPI.Init(args) ; int myrank = MPI.COMM_WORLD.Rank() ; if(myrank == 0) { char[] message = “Hello, there”.toCharArray() ; MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ; } else { char[] message = new char [20] ; MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ; System.out.println(“received:” + new String(message) + “:”) ; } MPI.Finalize() ; }

12 MPJ Express Goals Modular design: unify earlier approaches by pluggable communication devices. Support high-level functionality of MPI-1.2 in pure Java code, as far as possible: Point to point communications, Collective communications, Groups, communicators, and contexts, Derived datatypes – buffering API. Thread safety ― support Java threads for parallelism within multicore nodes. Open source distribution:

13 MPJ Design

14 MPJ Demos There were a few existing demos for MPJ, inherited from the earlier mpiJava project, e.g. A 2d Fluid Flow example. Some Monte Carlo simulations of 2d spin systems from condensed matter physics. These have nice GUIs, but aren’t very representative of real codes used by computational scientists.

15 MPJ demo: CFD – inviscid flow

16 MPJ demo: Q-state Potts model

17 Gadget

18 Gadget-2 Gadget-2 is a free-software, production code for cosmological N-body (and hydrodynamic) computations. Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching. It is written in the C language – already parallelized using MPI. Versions have been used in various research papers in astrophysics literature, including the Millennium Simulation.

19 Millennium Simulation See the paper Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page. Follows evolution of dark matter “particles” from early Universe (z = 127) to current day. Performed on 512 nodes of IBM p690. Used 1TB of distributed memory. 350,000 CPU hours – 28 days elapsed time. Floating point performance around 0.2 TFLOPS. Around 20Tb total data generated.

20 Dynamics in Gadget Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity. Plus some hydrodynamic forces, but these don’t affect dominant Dark Matter. Co-moving coordinates take account of expansion of Universe according to General Relativity, but otherwise basically classical mechanics. Classical N-body problem.

21 Gadget Main Loop This is a slightly simplified view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later).

22 Computing Forces The function compute_accelerations() must compute forces experienced by all particles. In particular must compute gravitational force. Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N 2 ). with N ≈ 10 10, this is quite infeasible. Need some kind of approximation. Intuitive approach: when computing force on particle i, group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass.

23 Barnes Hut Tree First divide cubical region of 3d space into 2 3 = 8 regions, halving each dimension. For every sub-region that contains any particles, divide again recursively to make an oct-tree, until “leaf” regions have at most one particle.

24 Two Dimensional Example Picture borrowed from

25 Barnes Hut Force Computation To compute the force on a particle i, traverse tree starting from root: if a node n is “distant from” i, just add contribution to force on i from centre of mass of n – no need to visit children of n; if node n is “close to” i, visit children of n and recurse. Hinges on definition of distant from/close to. Basic idea is that a node representing some region of space is distant from a particle i if the angle it subtends is smaller than a threshold opening angle: i n θ

26 Complexity On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm. A huge win when N ≈

27 Domain Decomposition Need to divide space and/or particle set into domains, where each domain is handled by a single processor. Problem is that we can’t just divide space evenly, because some regions will have many more particle than others – poor load balancing. Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.

28 Peano-Hilbert Curve Warren and Salmon originally suggested using a space- filling curve: Picture borrowed from

29 Peano-Hilbert Key Gadget applies the recursion 20 times, logically dividing space into up to 2 20 × 2 20 × 2 20 cells on the Peano-Hilbert curve. Then can label each cell by its location along the Peano- Hilbert curve – 2 60 possible locations comfortably fit into a 64-bit word.

30 Decomposition based on P-H Curve Because number of cells > > number of particles, segments of linear Peano-Hilbert curve sparsely populated. Sort particles by their Peano-Hilbert key, then divide evenly into P domains. Intuitively – stretch out the P-H curve with particles dotted along it; segment it into P parts where each part has the same number of particles. Characteristics of this decomposition: Good load balancing. Domains simply connected and quite “compact” in real space, because particles that are close along P-H curve are close in real space (converse often but not always true). Domains have relatively simple mapping to BH octtree nodes.

31 Distribution of BH Tree in Gadget Ibid.

32 Distributed Representation of Tree Every processor holds a copy of the root nodes, and a copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudo- particles. To compute the force on a single local target particle, traverse tree from root as usual, and accumulate force contributions from locally held particles. Build an export list containing target particle and hosts of pseudo-particles encountered in walk.

33 Communication After computation for all locally held target particles, process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles. All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles. These contributions are returned to the original processor, and added into the accelerations for the target particles.

34 Other Communications Notably: The Domain Decomposition itself requires (in principle) a distributed sort of the particle list. Gadget only approximates this sort, but still fairly intricate communications. In Gadget all communication is using the standard MPI library. It makes fairly extensive use of collective communication routines.

35 Java Gadget

36 Timeline MPJ Express released Realized we needed a realistic exemplar code: both as a demo, and to drive further improvements of our software. First thoughts that an N-body simulation code would make sense – Summer Learned about Gadget in December that year. Effort to make a Java version of Gadget started seriously early February 2006 – expecting many months work. Non-trivial simulations running in March.

37 Gadget Code Statistics Public distribution is around 17,000 lines of C, distributed over about 30 source files. Dependencies on: MPI library for parallelization. GNU scientific library (but only a handful of functions) FFTW – library for parallel Fourier transforms. Comes with a set of initial conditions files that we can use to test our code, including “colliding galaxies” and “cluster formation”.

38 Translation of Gadget Code Manually translated, but in first cut, deliberately keep the Java code as similar to C as possible. e.g. nearly all of Java Gadget is currently implemented as static methods on one enormous class called Main. Small supporting classes corresponding to struct s of C code. Currently not supporting a number of features, including: PMTree algorithm (an optimization using Fourier Transforms on a mesh to compute long-range part of gravitational force), Multiple Initial Conditions files, COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, …

39 Handling Dependencies Replace MPI calls with MPJ calls (!) Use of FFTW avoided by disabling the PMTree algorithm (but may slow down some simulations a lot). Few GNU Scientific Library functions that were needed (numeric quadrature, etc) were hand translated to Java.

40 Test Cases Restrictions aside, we successfully ran Colliding Galaxies and Cluster Formation example simulations varying numbers of processors. These use pure Dark Matter – hydrodynamics code not yet tested. Results are indistinguishable from running the original Gadget.

41 Colliding Galaxies (Java Gadget)

42 Cluster Formation (Java Gadget)

43 Optimizing Java Gadget Initial Java version of Gadget was about 3 times slower than C. Optimizations: Replace communication of Object arrays (corresponding to C structs) with communication of component primitive elements (floats, ints, etc). Still wasn’t enough – final version actually expunges large Object arrays from program – flattens all struct arrays into arrays of primitive types (see later comments). Extended the MPJ Express API to manipulate Java ByteBuffers directly.

44 Flattening Object Arrays To preserve memory locality, we decided to “flatten” large arrays of small objects (ParticleData) – replace an array of objects with an array of doubles containing object component fields, for example. This makes the code less readable and maintainable than the original C code – clearly not acceptable. But we wanted speed. Need a much better approach here. One possibility is some kind of macro pre-processing (it would be better if the Java language had structs…).

45 Benchmarks Graphs taken from Aamir Shafi’s thesis – you will find many more results there. Some benchmarks are on up to 8 processors of Starbug cluster in Portsmouth Distributed Systems Group. Other benchmarks are on up to 32 processors from NW-GRID cluster at Daresbury Laboratory.

46 Starbug: Colliding Galaxies with 60,000 Particles

47 NW-Grid: Structure Formation with two million Particles

48 Conclusions Possible to do large scale numerical simulations in Java Speed is OK. Memory consumption still an issue (Java Gadget needs perhaps three times more memory than C Gadget). Made various improvements to MPJ Express implementation and API along the way. As things stand, the Java Gadget code is not very maintainable – C-like, and then optimized some. Needs more work to learn how to write maintainable code in Java that is also fast – e.g. how to preserve memory locality in arrays of objects.