Programming Parallel Hardware using MPJ Express

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
1 MPJ Express Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February and Lessons from Java Gadget.
Aamir Shafi, Bryan Carpenter, Mark Baker
MPI_REDUCE() Philip Madron Eric Remington. Basic Overview MPI_Reduce() simply applies an MPI operation to select local memory values on each process,
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Reference: Getting Started with MPI.
13 June mpiJava. Related projects mpiJava (Syracuse) JavaMPI (Getov et al, Westminster) JMPI (MPI Software Technology) MPIJ.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
1 Parallel Computing—Higher-level concepts of MPI.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Parallel Programming in C with MPI and OpenMP
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Parallel Programming with Java
1 Programming Multicore Processors Aamir Shafi High Performance Computing Lab
Parallel Programming in Java with Shared Memory Directives.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
MPI3 Hybrid Proposal Description
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
1 MPJ meets Gadget Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK March A Java Code for Cosmological.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.
Crossing The Line: Distributed Computing Across Network and Filesystem Boundaries.
Parallel Programming Dr Andy Evans. Parallel programming Various options, but a popular one is the Message Passing Interface (MPI). This is a standard.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
MPJ Express Alon Vice Ayal Ofaim. Contributors 2 Aamir Shafi Jawad Manzoor Kamran Hamid Mohsan Jameel Rizwan Hanif Amjad Aziz Bryan Carpenter Mark Baker.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
04 June Thoughts on a Java Reference Implementation for MPJ Mark Baker *, Bryan Carpenter  * University of Portsmouth  Florida.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Interconnection network network interface and a case study.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Programming Parallel Hardware using MPJ Express By A. Shafi.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
A Parallel Communication Infrastructure for STAPL
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Computer Science Department
Computer Science Department
IEEE BigData 2016 December 5-8, Washington D.C.
Pluggable Architecture for Java HPC Messaging
DISTRIBUTED COMPUTING
MPJ (Message Passing in Java): The past, present, and future
Aamir Shafi MPJ Express: An Implementation of Message Passing Interface (MPI) in Java Aamir Shafi.
HPML Conference, Lyon, Sept 2018
MPJ: A Java-based Parallel Computing System
Prof. Leonardo Mostarda University of Camerino
Computer Science Department
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Programming Parallel Hardware using MPJ Express Aamir Shafi ashafi@mit.edu http://mpj-express.org

Contributors Aamir Shafi Jawad Manzoor Kamran Hamid Mohsan Jameel Rizwan Hanif Amjad Aziz Bryan Carpenter Mark Baker Hong Ong Guillermo Taboada Sabela Ramos http://mpj-express.org

Programming Languages in the HPC World HPC programmers have two choices for writing large-scale scientific applications: Support parallelism by using message passing between processors using existing languages like Fortran, C, and C++ Adopt a new HPC language that provides parallelism constructs (for e.g UPC) The first approach led to the development of the Message Passing Interace (MPI) standard: Bindings for Fortran, C, and C++ Interest in developing bindings for other languages like Java: The Java Grande Forum—formed in late 90s—came up with an API called mpiJava 1.2 http://mpj-express.org

Why Java? Portability A popular language in colleges and software industry: Large pool of software developers A useful educational tool Higher programming abstractions including OO features Improved compile and runtime checking of the code Automatic garbage collection Support for multithreading Rich collection of support libraries

MPJ Express MPJ Express is an MPI-like library that supports execution of parallel Java applications Three existing approaches to Java messaging: Pure Java (Sockets based) Java Native Interface (JNI) Remote Method Invocation (RMI) Motivation for a new Java messaging system: Maintain compatibility with Java threads by providing thread-safety Handle contradicting issues of high-performance and portability Requires no change to the native standard JVM

MPJ Express Design http://mpj-express.org

“Hello World” MPJ Express Program 1 import mpi.*; 2 3 public class HelloWorld { 4 5 public static void main(String args[]) throws Exception { 6 7 MPI.Init(args); 8 int size = MPI.COMM_WORLD.Size(); 9 int rank = MPI.COMM_WORLD.Rank(); 10 11 System.out.println("I am process <"+rank+">"); 12 13 MPI.Finalize(); 14 } 15 } aamirshafi@velour:~/work/mpj-user$ mpjrun.sh -np 4 HelloWorld MPJ Express (0.38) is started in the multicore configuration I am process <1> I am process <0> I am process <3> I am process <2>

An Embarrassingly Parallel Toy Example Master Process Worker 0 Worker 1 Worker 2 Worker 3

aamirshafi@velour:~/work/mpj-user$ mpjrun.sh -np 5 ToyExample MPJ Express (0.38) is started in the multicore configuration 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4

Outline of the Presentation MPJ Express: Thread-safe point-to-point communication Collective communication Memory management The runtime system A case study from computational astrophysics Summary

Point-to-point Communication Standard Synchronous Ready Buffered Blocking Send() Recv() Ssend() Rsend() Bsend() Non-blocking Isend() Irecv() Issend() Irsend() Ibsend() Non-blocking methods return a Request object: Wait() //waits until communication completes Test() //test if the communication has finished

“Blocking” “Non Blocking” Sender Receiver Send() Recv() time Isend() CPU waits “Blocking” Send() Recv() Sender Receiver time “Non Blocking” Isend() Irecv() CPU does computation Wait()

Thread-safe Communication Thread-safe MPI libraries allow communication from multiple user threads inside a single process Such an implementation requires fine-grain locking: Incorrect implementations can deadlock Levels of Thread-Safety in MPI Libraries MPI_THREAD_SINGLE MPI_THREAD_FUNNELED MPI_THREAD_SERIALIZED MPI_THREAD_MULTIPLE 

Implementation of point-to-point communication Various modes of blocking and non-blocking communication primitives are implemented using two protocols Eager Send Rendezvous

Implementation of sockets communication device Non-blocking Blocking Process 0 Process 1  Send Queues   Recv Queues 

Eager Protocol (two cases) Sender Receiver Time -> isend irecv Message received directly in the user buffer Message received in a temporary buffer before being copied to user buffer First Case Second Case

Myrinet and Multithreaded Comm Devices MPJ Express works over Myrinet by proving JNI wrappers to the Myrinet eXpress library The software also runs in the multithreaded mode where each MPJ process is represented by a thread inside a single JVM Main Memory CPU 0 CPU 1 CPU 2 CPU 3 Proc 0 Proc 1 Proc 2 Proc 3

Performance Evaluation of Point to Point Communication Normally ping pong benchmarks are used to calculate: Latency: How long it takes to send N bytes from sender to receiver? Throughput: How much bandwidth is achieved? Latency is a useful measure for studying the performance of “small” messages Throughput is a useful measure for studying the performance of “large” messages RTT! Node A Node B

Latency Comparison on GigE

Throughput Comparison on GigE

Latency Comparison on Myrinet

Throughput Comparison on Myrinet

Latency Comparison on a multicore machine

Throughput Comparison on a multicore machine

Collective communications Provided as a convenience for application developers: Save significant development time Efficient algorithms may be used Stable (tested) Built on top of point-to-point communications

Image from MPI standard doc

Reduce collective operations Processes MPI.PROD MPI.SUM MPI.MIN MPI.MAX MPI.LAND MPI.BAND MPI.LOR MPI.BOR MPI.LXOR MPI.BXOR MPI.MINLOC MPI.MAXLOC

Toy Example with Collectives

Barrier with Tree Algorithm 7

Alternate Barrier Implementation

Broadcasting algorithm, total processes=8, root=0 7

Memory Management in MPJ Express MPJ Express explicitly manages memory for internal usage Each Send()and Recv()method internally creates a buffer: Such constant creation of buffers can be detrimental

Review of Buddy Algorithm

A buffer pooling strategy implemented

The Runtime System

Outline of the Presentation MPJ Express: Thread-safe point-to-point communication Collective communication Memory management The runtime system A case study from computational astrophysics Summary

A Case Study from Computational Astrophysics Various publications [BSPF01][FF97][NNS03][MMG00] suggest Java to be a good candidate for HPC This argument does not convince many, perhaps due to scarcity of high-profile number-crunching codes in Java Gadget-2 is a production-quality code for cosmological N-body (and hydrodynamic) computations: Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching

Porting Gadget-2 to Java Dependencies on: MPI library for parallelization: Replace MPI calls with MPJ Express, GNU scientific library (but only a handful of functions): The required methods were hand translated to Java, FFTW – library for parallel Fourier transforms: Not needed because we disabled TreePM algorithm, We have successfully run Colliding Galaxies and Cluster Formation example simulations: These use pure Dark Matter – hydrodynamics code not yet tested.

Java Optimizations Initial benchmarking revealed that the Java code is slower by a factor of 2-3, Custom serialization and de-serialization: Replacing Java object communication with primitive datatypes, Flattening sensitive data-structures in the hope of exploiting processor cache efficiently, Many data structures were Java objects Avoiding expensive array operations Improved collective algorithms in MPJ Express if(USE_P_OBJ_ARRAY) { P[i].Pos[k] = LowerBound[k] + Range[k] * drandom; P[i].Vel[k] = 0; } else { P_doubles[(P_Po+PDS*i)+k] = LowerBound[k] + Range[k] * drandom; P_doubles[(P_Ve+PDS*i)+k] = 0;

Execution Time for the Cluster Formation Simulation

Summary MPJ Express (www.mpj-express.org) is an environment for MPI-like parallel programming in Java It was conceived as having an expandable set of “devices”, allowing different underlying implementations of message passing The software explicitly manages internal memory used for sending and receiving messages We parallelized Gadget-2 using MPJ Express and managed to get good performance

Future Work MPJ Express performance can be improved: Improving or removing the intermediate buffering layer (this layer implies additional copying) Develop JNI wrappers to native MPI libraries Debugging and profiling tools for MPJ Express Support for other high performance interconnects including Quadrics, Infiniband, and 10 GigE A portable runtime system

References [SCB09a] Shafi A., Carpenter B., Baker M.Nested parallelism for multi-core HPC systems using Java(2009) Journal of Parallel and Distributed Computing, 69 (6), pp. 532-545. [SCB09b] Shafi A, Carpenter B, Baker M, and Hussain A, A Comparative Study of Java and C Performance in Two Large Scale Parallel Applications, Concurrency and Computation: Practice and Experience, pp 1882-1906, 21(15), October 2009 [BSPF01] Bull JM, Smith LA, Pottage L, Freeman R. Benchmarking Java against C and Fortran for Scientific Applications. ACM: New York, NY, U.S.A., 2001; 97–105. [FF97] Fox GC, Furmanski W. Java for parallel computing and as a general language for scientific and engineering simulation and modeling. Concurrency: Practice and Experience 1997; 9(6):415–425. [NNS03] Nikishkov GP, Nikishkov YG, Savchenko VV. Comparison of C and Java performance in finite element computations. Computers and Structures 2003; 81(24–25):2401–2408. [MMG00] Moreira JE, Midkiff SP, Gupta M. From flop to megaflops: Java for technical computing. ACM Transactions on Programming Languages and Systems 2000; 22(2):265–295. [BSPF01][FF97][NNS03][MMG00] http://mpj-express.org