Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.

Slides:

Advertisements

Similar presentations

IBM Research: Software Technology © 2006 IBM Corporation 1 Programming Language X10 Christoph von Praun IBM Research HPC WPL Sandia National Labs December.

Advertisements

PERFORMANCE EVALUATION OF USER-REAXC PACKAGE Hasan Metin Aktulga Postdoctoral Researcher Scientific Computing Group Lawrence Berkeley National Laboratory.

IBM’s X10 Presentation by Isaac Dooley CS498LVK Spring 2006.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Chapel: The Cascade High Productivity Language Ting Yang University of Massachusetts.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Parallel Computing Overview CS 524 – High-Performance Computing.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu Faculty Mentor: Dr. Nancy Amato Supervisor: Dr. Mauro Bianco.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Leveraging Hierarchy Is this our Undiscovered Country? John T. Daly.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Center for Component Technology for Terascale Simulation Software 122 June 2002Workshop on Performance Optimization via High Level Languages and Libraries.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Overview of Recent MCMD Developments Jarek Nieplocha CCA Forum Meeting San Francisco.

A Language for Task-Level Executives Reid Simmons David Apfelbaum Carnegie Mellon University.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming Aniruddha G. Shet, James Dinan, Robert J.

Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.

CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

BioPSE NCRR SCIRun2 -THE PROJECT -OBJECTIVES -DEVELOPMENTS -TODAY -THE FUTURE.

Programmability Hiroshi Nakashima Thomas Sterling.

Global Hybrid Control and Cooperative Mobile Robotics Yi Guo Center for Engineering Science Advanced Research Computer Science and Mathematics Division.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Center for Component Technology for Terascale Simulation Software (CCTTSS) 110 April 2002CCA Forum, Townsend, TN This work has been sponsored by the Mathematics,

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Sub-fields of computer science. Sub-fields of computer science.

Accelerators to Applications

One Language to Rule Them All or ADA Strikes Back

Parallel Programming By J. H. Wang May 2, 2017.

3- Parallel Programming Models

Computer Engg, IIT(BHU)

Programming Models for SimMillennium

Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Computational issues Issues Solutions Large time scale

Presentation transcript:

Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison Computer Science & Mathematics Division Oak Ridge National Laboratory

2 Elwasif_0611 Guiding the development of the next generation HPC programming environments Candidate Languages:  Chapel (Cray)  Fortress (Sun)  X10 (IBM) Chart the course towards high productivity, high performance scientific programming on Petascale machines by the year Joint project with:  Argonne National Laboratory  Lawrence Berkeley National Laboratory  Rice University Part of DARPA HPCS program

3 Elwasif_0611 Revolutionary approach to large scale parallel programming  Emphasis on performance and productivity  Not SPMD  Lightweight “threads”, LOTS of them  Different approaches to locality awareness/management  High level (sequential) language constructs  Rich array data types (part of the base languages)  Strongly typed object oriented base design  Extensible language model  Generic programming

4 Elwasif_0611 Concurrency: the next generation  Single initial thread of control  Parallelism through language constructs  True global view of memory, one sided access model  Support task and data parallelism  “Threads” grouped by “memory locality”  Extensible, rich distributed array capability  Advanced concurrency constructs:  Parallel loops  Generator-based looping and distributions  Local and remote futures

5 Elwasif_0611 What about productivity?  Index sets/regions for arrays  “Array language” (Chapel, X10)  Safe(r) and more powerful language constructs  Atomic sections vs locks  Sync variables and futures  Clocks (X10)  Type inference  Leverage advanced IDE capabilities  Units and dimensions (Fortress)  Component management, testing, contracts (Fortress)  Math/science-based presentation (Fortress)

6 Elwasif_0611 An exercise in MADNESS!!  Multiresolution ADaptive NumErical Scientific Simulation  Multiresolution, multiwavelet representation for quantum chemistry and other domains  Adaptive mesh in 1-6+ dimensions  Very dynamic refinement  Distribution by subtrees == spatial decomposition  Sequential code very compactly written using recursion  Initial parallel code using MPI is about 10x larger

7 Elwasif_0611 The MADNESS continues Explicit MPI template void Function ::_refine(OctTreeTPtr& tree) { if (tree->islocalsubtreeparent() && isremote(tree)) { FOREACH_CHILD(OctTreeTPtr, tree, bool dorefine; comm()->Recv(dorefine, tree->rank(), REFINE_TAG); if (dorefine) { set_active(child); set_active(tree); _project(child); } _refine(child);); } else { TensorT* t = coeff(tree); if (t) { TensorT d = filter(*t); d(data->cdata->s0) = 0.0; bool dorefine = (d.normf() > truncate_tol(data->thresh,tree->n())); if (dorefine) unset_coeff(tree); FOREACH_REMOTE_CHILD(OctTreeTPtr, tree, comm()->Send(dorefine, child->rank(), REFINE_TAG); set_active(child);); FORIJK(OctTreeTPtr child = tree->child(i,j,k); if (!child && dorefine) child = tree->insert_local_child(i,j,k); if ( child && islocal(child)) { if (dorefine) { _project(child); set_active(child); } _refine(child); }); } else { FOREACH_REMOTE_CHILD(OctTreeTPtr, tree, comm()->Send(false, child->rank(), REFINE_TAG);); FOREACH_LOCAL_CHILD(OctTreeTPtr, tree, _refine(child);); } } } Explicit MPI template void Function ::_refine(OctTreeTPtr& tree) { if (tree->islocalsubtreeparent() && isremote(tree)) { FOREACH_CHILD(OctTreeTPtr, tree, bool dorefine; comm()->Recv(dorefine, tree->rank(), REFINE_TAG); if (dorefine) { set_active(child); set_active(tree); _project(child); } _refine(child);); } else { TensorT* t = coeff(tree); if (t) { TensorT d = filter(*t); d(data->cdata->s0) = 0.0; bool dorefine = (d.normf() > truncate_tol(data->thresh,tree->n())); if (dorefine) unset_coeff(tree); FOREACH_REMOTE_CHILD(OctTreeTPtr, tree, comm()->Send(dorefine, child->rank(), REFINE_TAG); set_active(child);); FORIJK(OctTreeTPtr child = tree->child(i,j,k); if (!child && dorefine) child = tree->insert_local_child(i,j,k); if ( child && islocal(child)) { if (dorefine) { _project(child); set_active(child); } _refine(child); }); } else { FOREACH_REMOTE_CHILD(OctTreeTPtr, tree, comm()->Send(false, child->rank(), REFINE_TAG);); FOREACH_LOCAL_CHILD(OctTreeTPtr, tree, _refine(child);); } } } Cilk-like multithreaded all of the HPLS solutions look like this Cilk-like multithreaded all of the HPLS solutions look like this The complexity of the MPI code mostly arises from using an SPMD model to maintain a consistent distributed data structure. The explicitly distributed code must work on active and inactive nodes – not necessary with global-view and remote activity creation. template void Function::_refine (OctTreeTPtr tree) { FOREACH_CHILD(OctTreeTPtr, tree, _project(child);); TensorT* t = coeff(tree); TensorT d = filter(*t); if (d.normf() > truncate_tol(data->thresh,tree->n())) { unset_coeff(tree); FOREACH_CHILD(OctTreeTPtr, tree, spawn _refine(child);); }

8 Elwasif_0611 Tradeoffs in HPLS language design  Emphasis on parallel safety (X10) vs. expressivity (Chapel, Fortress)  Locality control and awareness:  X10: explicit placement and access  Chapel: user-controlled placement, transparent access  Fortress: placement “guidance” only, local/remote access blurry (data may move!!!)  What about mental performance models?  Programming language representation:  Fortress: Allow math-like representation  Chapel, X10: Traditional programming language front end  How much do developers gain from mathematical representation?  Productivity/Performance tradeoff  Different users have different “sweet-spots”

9 Elwasif_0611 Remaining challenges  (Parallel) I/O model  Interoperability with (existing) languages and programming models  Better (preferably portable) performance models and scalable memory models  Especially for machines with 1M+ processors  Other considerations:  Viable gradual adoption strategy  Building a complete development ecosystem

10 Elwasif_0611 Contacts Wael R. Elwasif Computer Science and Mathematics Division (865) Elwasif_0611 David E. Bernholdt Computer Science and Mathematics Division (865) Robert J. Harrison Computer Science and Mathematics Division (865)