1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

INSPIRE The Insieme Parallel Intermediate Representation Herbert Jordan, Peter Thoman, Simone Pellegrini, Klaus Kofler, and Thomas Fahringer University.

Lawrence Livermore National Laboratory ROSE Compiler Project Computational Exascale Workshop December 2010 Dan Quinlan Chunhua Liao, Justin Too, Robb Matzke,

Parallel Processing with OpenMP

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.

University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Bronis R. de Supinski Center for Applied Scientific Computing Lawrence Livermore National Laboratory June 2, 2005 The Most Needed Feature(s) for OpenMP.

System Level Design: Orthogonalization of Concerns and Platform- Based Design K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli.

Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Parallel Programming Models and Paradigms

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Contemporary Languages in Parallel Computing Raymond Hummel.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Chapter 6 System Engineering - Computer-based system - System engineering process - “Business process” engineering - Product engineering (Source: Pressman,

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

Lawrence Livermore National Laboratory Manycore Optimizations: A Compiler and Language Independent ManyCore Runtime System ROSE Team Center for Applied.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

GPU Architecture and Programming

Chapter 8 Object Design Reuse and Patterns. Object Design Object design is the process of adding details to the requirements analysis and making implementation.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Productive Performance Tools for Heterogeneous Parallel Computing

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Programming Models for SimMillennium

Chapter 1 Introduction.

Toward a Unified HPC and Big Data Runtime

Presentation transcript:

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing* * Proposed for LDRD FY’12, initially funded by ASC/FRIC and now being moved back to LDRD LLNL-PRES

2 We are building a framework for creating node-level parallel programming models for exascale  Problem: Exascale machines: more challenges to programming models Parallel programming models: important but increasingly lag behind node-level architectures  Goal: Speedup designing/evolving/adopting programming models for exascale  Approach: Identify and implement common building blocks in node-level programming models so both researchers and developers can quickly construct or customize their own models  Deliverables: A node-level programming model framework (PMF) with building blocks at language, compiler, and library levels Example programming models built using the PMF

3 Programming models bridge algorithms and machines and are implemented through components of software stack Measures of success: Expressiveness Performance Programmability Portability Efficiency … Language Compiler Library Algorithm Application Abstract Machine Executable Real Machine Programming Model Express Execute Compile/link … Software Stack

4 Parallel programming models are built on top of sequential ones and use a combination of language/compiler/library support CPU Memory Abstract Machine (overly simplified) CPU Shared Memory CPU Memory CPU Memory Interconnect … Programming Model Sequential Parallel Shared Memory (e.g. OpenMP) Distributed Memory (e.g. MPI) … Software Stack: 1. Language 2. Compiler 3. Library General purpose Languages (GPL) C/C++/Fortran Sequential Compiler Optional Seq. Libs GPL + Directives Seq. Compiler + OpenMP support OpenMP Runtime Lib GPL + Call to MPI libs Seq. Compiler MPI library

5 Problem: programming models will become a limiting factor for exascale computing if no drastic measures are taken  Future exascale architectures Clusters of many-core nodes, abundant threads Deep memory hierarchy, CPU+GPU, … Power and resilience constraints, …  (Node level) programming models: Increasingly complex design space Conflicting goals: performance, power, productivity, expressiveness  Current situation: Programming model researchers: struggle to design/build individual models to find the right one in the huge design space Application developers: stuck with stale models: insufficient high-level models and tedious low-level ones

6 Solution: we are building a programming model framework (PMF) to address exascale challenges Compiler Support (ROSE) … Runtime Library … Language Ext. Compiler Sup. Runtime Lib. Programming model 1 Programming model 2 Compiler Sup. Runtime Lib. Programming model n … Language Extensions … A three-level, open framework to facilitate building node-level programming models for exascale architectures Tool 1 Tool n Function 1 Directive 1 Directive n Level 1 Level 2 Level 3 Reuse & Customize Runtime Lib.

7 We will serve both researchers and developers, engage lab applications, and target heterogeneous architectures  Users: Programming model researchers: explore design space Experienced application developers: build custom models targeting current and future machines  Scope of this project DOE/LLNL applications Heterogeneous architectures: CPUs + GPUs Example building blocks: parallelism, heterogeneity, data locality, power efficiency, thread scheduling, etc. Two major example programming models built using PMF The programming model framework vastly increases the flexibility in how the HPC stack can be used for application development.

8 Example 1: researchers use the programming model framework to extend a higher-level model (OpenMP) to support GPUs  OpenMP: a high level, popular node-level programming model for shared memory programming High demand for GPU support (within a node)  PMF: provides a set of selectable, customizable building blocks Language: directives, like #acc_region, #data_region, #acc_loop, #data_copy, #device, etc. Compiler: parser builder, outliner, loop tiling, loop collapsing, dependence analysis, etc., based on ROSE Runtime: thread management, task scheduling, data transferring, load balancing, etc.

9 Using PMF to extend OpenMP for GPUs Compiler Support (ROSE) … Runtime Library … #pragma omp acc region #pragma omp acc_loop #pragma omp acc_region_loop Pragma_parsing() Outlining_for_GPU() Insert_runtime_call() Optimize_memory() Dispatch_tasks() Balancing_load() Transfer_data() OpenMP Extended for GPUs Language Extensions … Tool 1 Tool n Function 1 Directive 1 Directive n Level 1 Level 2 Level 3 Reuse & Customize Programming model framework

10 Example 2: application developers use PMF to explore a lower level, domain-specific programming model  Target lab application: Lattice-Boltzmann algorithm with adaptive-mesh refinement for direct numerical simulation studies on how wall-roughness affects turbulence transition. Stencil operations on structured arrays  Requirements: Concurrent, balanced execution on CPU & GPU Users do not like translating OpenMP to GPU Want to have the power to express lower level details like data decomposition Exploit domain features: a box-based approach for describing data-layout and regions for numerical solvers Target current and future architectures

11 Using the PMF to implement the domain-specific programming model (ongoing work with many unknown details) C++ (main algorithm infrastructure) Pragmas (gluing and supplemental semantics) Cuda (describe kernels) Source-code that can be compiled using native compilers Executable Language feature Use a sequential language, CUDA, and pragmas to describe algorithms Compiler Support Building blocks Architecture B Architecture A Compiler (first compilation) Generate code to help chores Custom code generation for multiple architectures Final compilation using native compilers, linking with a runtime library * Scheduling among CPUs and GPUs

12 Summary  We are building a framework instead of a single programming model for exascale node architectures Building blocks : language, compiler, runtime Two major example programming models  Programming model researchers Quickly design and implementation solutions to exascale challenges Eg. Explore OpenMP extensions for GPUs  Experienced application developers Ability to directly change the software stack Eg. Compose domain-specific programming models

13 Thank you!