CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

Slides:



Advertisements
Similar presentations
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Advertisements

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
OpenFOAM on a GPU-based Heterogeneous Cluster
Reference: Getting Started with MPI.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Topic Overview One-to-All Broadcast and All-to-One Reduction
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Mapping Techniques for Load Balancing
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Distributed-Memory Programming Using MPIGAP Vladimir Janjic International Workhsop “Parallel Programming in GAP” Aug 2013.
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Overview of Recent MCMD Developments Jarek Nieplocha CCA Forum Meeting San Francisco.
Combining Shared and Distributed Memory Models Approach and Evolution of the Global Arrays Toolkit Jarek Nieplocha Robert Harrison, Manoj Kumar Krishnan.
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Multilevel Parallelism using Processor Groups Bruce Palmer Jarek Nieplocha, Manoj Kumar Krishnan, Vinod Tipparaju Pacific Northwest National Laboratory.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Distributed Components for Integrating Large- Scale High Performance Computing Applications Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
CCA Common Component Architecture CCA Forum Tutorial Working Group Common Component Architecture.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
The Mach System Sri Ramkrishna.
MPI: Portable Parallel Programming for Scientific Computing
Parallel Programming By J. H. Wang May 2, 2017.
MPI Message Passing Interface
University of Technology
Alternative Processor Panel Results 2008
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific Northwest National Laboratory CCA Forum

CCA Common Component Architecture Overview  Global Arrays  Distributed Array Component  Core Capabilities  Applications  Future Work

CCA Common Component Architecture Global Arrays physically distributed dense array single, shared data structure global indexing  shared memory model in context of distributed dense arrays  complete environment for parallel code development  compatible with MPI  ~140 functions  data locality control similar to distributed memory/message passing model e.g., A(4,3) rather than buf(7) on task 2

CCA Common Component Architecture Global Array Model of Computations compute/update local memory Shared Object copy to local memory 1-sided communication get Shared Object copy to shared object local memory 1-sided communication put

CCA Common Component Architecture Structure of GA Message Passing process creation, run-time environment ARMCI portable 1-sided communication put,get, locks, etc distributed arrays layer memory management, index translation application interfaces Fortran 77, C, C++, Python, SIDL system specific interfaces LAPI, GM/Myrinet, Elan/Quadrics, threads, VIA,..

CCA Common Component Architecture Distributed Array Component  GAComponent  GAComponent : Classic and SIDL Interfaces  (direct+indirect) global arrays classic methods are available through GAClassicPort  GADADFPort provides methods, proposed by Data Working Group of CCA Forum, for creating array descriptors and templates GA Linear Algebra DADF GA Classic

CCA Common Component Architecture GA Classic Port –provides public interfaces for creating and accessing distributed arrays i.e., GlobalArray objects GlobalArray –encapsulate all details of the data distribution, addressing, and data access. –offers a set of operations for one-sided data transfer operations (get, put, scatter, gather, etc) collective array operations supportive operations for data locality control and queries

CCA Common Component Architecture class GAClassicPort: public virtual ::classic::gov::cca::Port { /* array creation methods, for example */ virtual GlobalArray* createGA(…) = 0; virtual GlobalArray* createGA_Ghosts(…)=0; /* utility operations like reduce, broadcast, etc.,. */ virtual void brdcst(void *buf, int lenbuf, int root)=0; /* cluster & process information e.g. rank, size*/ nnodes(),clusterNnodes(),clusterNodeid(),clusterNprocs /* Interprocess Synchronization: locks, barrier */ lock(), unlock(), sync(), fence(), createMutexes(), … } /* Total: 36 methods available thru’ this port */ Class GlobalArray { /* one-sided communication operations */ put(), get(), accumulate(), scatter, gather,... /* collective array operations (whole and patch) */ copy(), scale(), add(), gemm(), update_ghosts(),... /* element wise operations, ghost cell methods, matrix operatios etc… */ } /* Total: 98 methods available */

CCA Common Component Architecture Core Capabilities  Distributed array dense arrays 1-7 dimensions four data types: integer, real, double precision, double complex global rather than per-task view of data structures user control over data distribution: regular and irregular  Collective and shared-memory style operations  Support for ghost cells  Interfaces to third party parallel numerical libraries PeIGS, Scalapack, SUMMA, TAO

CCA Common Component Architecture GA DADF Port  Provides standard interface for defining, creating and querying distributed arrays –Supports creating, cloning and destruction of arrays, array templates and descriptors –DADF-Distributed Array Descriptor Factory by Data Working Group of CCA forum.  DADF Array –creates a distributed array  DADF Template: –Virtual multi-dimensional array to which one or more actual distributed arrays may be aligned  DADF Descriptor –To query an existing distributed array

CCA Common Component Architecture class DADFPort: : public virtual ::classic::gov::cca::Port { /* methods to create/clone/destroy dscr,array,templates*/ virtual DistArrayDescriptor * createDescriptor(..) = 0; virtual DistArray * createArray (…) = 0; virtual DistArrayTemplate* createTemplate(…) = 0;... } class DistArray { /** Set data type. */ virtual int setDataType(const enum DataType type) = 0; /** Associate this data object with distribution template. */ virtual int setTemplate(DistArrayTemplate * & templ) = 0; /** Sets this process's location in the process topology. */ virtual int setMyProcCoords(const int procCoords[] ) = 0; /** Align object to template with identity mapping. */ virtual int setIdentityAlignmentMap() = 0; /** Signal that data object is completely defined. */ virtual int commit() = 0;... /* set of query & miscellaneous functions */ }

CCA Common Component Architecture Class Hierarchy DistArrayTemplateDistArray DADFDescriptorDADFArrayDADFTemplate GAX Example DAs

CCA Common Component Architecture GATAO addProvidesPortregisterUsesPort CCA Services GA DADF CCA Services LA getPort(“ga”) GA/TAO Interoperability  TAO - optimization component (Toolkit for Advanced Optimization – ANL) provides advanced optimization algorithms  GA provides TAO core linear algebra support for manipulating vectors, matrices, and linear solvers thru’ LinearAlgebraPort (LA)

CCA Common Component Architecture GA Component in Applications (I) GALJMD addProvides Port registerUses Port CCA Services GA DADF CCA Services GA getPort(“ga”) LA  Lennard Jones Molecular Dynamics  Force decomposition method & dynamic load balancing (improves performance over the traditional message-passing version by S.Plimpton, Sandia)  Component overhead is negligible (<1%)  Good scaling (simulation of 12,000 atoms yields a speedup of 7.86 on 8 processors)

CCA Common Component Architecture  Chemistry: Molecular geometry optimization (between GA and TAO) GA Component in Applications (II)

CCA Common Component Architecture GA Solver addProvidesPort registerUsesPort CCA Services GA DADF CCA Services GA getPort(“ga”) LA CFD registerUsesPort CCA Services GA Visualization registerUsesPort CCA Services GA getPort(“ga”) getPort(“ga”)

CCA Common Component Architecture Applications Areas thermal flow simulation Visualization and image analysis electronic structure glass flow simulation material sciences molecular dynamics Others: financial security forecasting, astrophysics, geosciences biology

CCA Common Component Architecture Future Work  Additional capabilities in GA component including operations necessary for supporting more TAO optimization algorithms.  will also involve new nonblocking communication interfaces.  Implementation of component that interfaces secondary storage (parallel I/O).  Verify component usability for large apps  Study performance and overhead associated with CCA  ESI (or any generic solver) interfaces to distributed array component

CCA Common Component Architecture Feedback  Provide a generic distributed array component  We would like to know  Applications that need distributed array components  Functionalities expected from apps  Additions/modifications required  Suggestions to make it more generic  Communication interfaces in DADF (put/get)..?  Setting up priorities based on feedback