Higher Level Languages on Adaptive Run-Time System

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

MPI Message Passing Interface

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings Sid Sen.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.

7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Advanced / Other Programming Models Sathish Vadhiyar.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.

Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

Comparative Study of Techniques for Parallelization of the Grid Simulation Problem Prepared by Arthi Ramachandran Kansas State University.

1 Introduction to Engineering Spring 2007 Lecture 18: Digital Tools 2.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

A Parallel Communication Infrastructure for STAPL

Creating Simulations with POSE

OpenMosix, Open SSI, and LinuxPMI

CS427 Multicore Architecture and Parallel Computing

Conception of parallel algorithms

GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.

Spark Presentation.

Parallel Programming By J. H. Wang May 2, 2017.

Accelerating Large Charm++ Messages using RDMA

Pattern Parallel Programming

Parallel Objects: Virtualization & In-Process Components

Performance Evaluation of Adaptive MPI

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Using Multiphase Shared Arrays

Component Frameworks:

CSCE569 Parallel Computing

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Parallel Computation Patterns (Scan)

Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University

Parallel Computation Patterns (Reduction)

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Background and Motivation

ECE 498AL Lecture 15: Reductions and Their Implementation

Charisma: Orchestrating Migratable Parallel Objects

Introduction to Data Structure

BigSim: Simulating PetaFLOPS Supercomputers

Overview of Workflows: Why Use Them?

ECE 498AL Lecture 10: Control Flow

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

MapReduce: Simplified Data Processing on Large Clusters

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Higher Level Languages on Adaptive Run-Time System Chao Huang Parallel Programming Lab University of Illinois at Urbana-Champaign

Motivation Productivity and Performance Different algorithms  different tools Charm++/AMPI: powerful adaptive run-time system Local Data Global Data Local Control Charm++ MPI (AMPI) MSA Global OpenMP Charisma When the appropriate tool is used, the development time can be cut with losing much of the performance However, in some very complicated parallel programs, under this framework, we observed an obscure overall flow of control due to the huge amount of parallel objects and asynchronous method invocation. For some other parallel algorithms, global view of data is convenient for programming. 5/21/2019 Charm++ Workshop

Outline Motivation Expressing Flow of Control Charisma Multiphase Shared Array (MSA) Conclusions 5/21/2019 Charm++ Workshop

Example: MD Structure of a simple MD simulation patch (cell) MPI_Recv(angle_buf,…, ANGLE_SRC, ANGLE_TAG,…); /* calculate angle forces */ MPI_Recv(pair_left_buf,…, PAIR_LEFT_SRC, PAIR_LEFT_TAG,…); MPI_Recv(pair_right_buf,…, PAIR_RIGHT_SRC, PAIR_RIGHT_TAG,…); /* calculate pairwise forces */ Structure of a simple MD simulation MPI_Recv(buf,…, MPI_ANY_SOURCE, MPI_ANY_TAG,…); switch(GET_TYPE(buf)){ case (FOR_ANGLE): /* calculate angle forces */ case (FOR_PAIR_LEFT): /* calculate pairwise forces */ case (FOR_PAIR_RIGHT): } a few hundreds of patches, several hundreds of computes, thousands of PME objects Different # of different objects on one PE Manual allocation , load balancing is hard Wildcard is bad when there’s deeper flow of control: PME Charm++ clear abstract and separation of entities for load balancing patch (cell) compute (cellpair) 5/21/2019 Charm++ Workshop

Expressing Flow of Control Charm++: fragmented in object code (Add fig with arrows) MainChare::MainChare{ cell.sendCoords(); } MainChare::reduceEnergy(energy){ totalEnerty+= energy; if iter++ < MAX_ITER cells.sendCoords(); else CkExit(); } Cellpair::recvCoords(coords){ if not coords from both cells received buffer(coords); return; else // all coords ready force = calcForces(); for index in 2 cells cells(index).recvForces(forces); } Cell::sendCoords(){ for index in 26 neighbor cellpairs cellpairs(index).recvCoords(coords); } Cell::recvForces(forces){ totalforces += forces; if not all forces from all cellpairs received return; else // neighborhood reduction completed integrate(); mainProxy.reduceEnergy(energy); } Different communication patterns: broadcast, multicast, ptp, reduction 5/21/2019 Charm++ Workshop

Expressing Flow of Control (2) SDag: restricted in an object’s life cycle Array [3D] Cell{ . . . entry void runSim(..){ atomic { init(); } for(timeStep = 1 to MAX_ITER){ atomic { sendCoords(); } for(forceCnt = 1 to numForceMsg){ when recvForces(..){ atomic { addForces(..); } } atomic { integrate(); } } } }; Effort to express control flow, from the perspective of one object array (still local view) 5/21/2019 Charm++ Workshop

Expressing Flow of Control (2) SDag: restricted in an object’s life cycle Array [3D] Cell{ . . . entry void runSim(..){ atomic { init(); } for(timeStep = 1 to MAX_ITER){ atomic { sendCoords(); } for(forceCnt = 1 to numForceMsg){ when recvForces(..){ atomic { addForces(..); } } atomic { integrate(); } } } }; Effort to express control flow, from the perspective of one object array (still local view) 5/21/2019 Charm++ Workshop

Expressing Flow of Control (2) SDag: restricted in an object’s life cycle Array [3D] Cell{ . . . entry void runSim(..){ atomic { init(); } for(timeStep = 1 to MAX_ITER){ atomic { sendCoords(); } for(forceCnt = 1 to numForceMsg){ when recvForces(..){ atomic { addForces(..); } } atomic { integrate(); } } } }; Effort to express control flow, from the perspective of one object array (still local view) 5/21/2019 Charm++ Workshop

Expressing Flow of Control (3) Charisma: global view of control foreach i,j,k in cells <coords[i,j,k]> := cells[i,j,k].produceCoords(); end-foreach for iter := 1 to MAX_ITER foreach i1,j1,k1,i2,j2,k2 in cellpairs <+cforces[i1,j1,k1],+cforces[i2,j2,k2]> := cellpairs[i1,j1,k1,i2,j2,k2]. calcForces(coords[i1,j1,k1],coords[i2,j2,k2]); foreach i,j,k in cells <coords[i,j,k],+energy> := cells[i,j,k].integrate(cforces[i,j,k]); MDMain.updateEnergy(energy); end-for 5/21/2019 Charm++ Workshop

Expressing Flow of Control (3) Charisma: global view of control foreach i,j,k in cells <coords[i,j,k]> := cells[i,j,k].produceCoords(); end-foreach for iter := 1 to MAX_ITER foreach i1,j1,k1,i2,j2,k2 in cellpairs <+cforces[i1,j1,k1],+cforces[i2,j2,k2]> := cellpairs[i1,j1,k1,i2,j2,k2]. calcForces(coords[i1,j1,k1],coords[i2,j2,k2]); foreach i,j,k in cells <coords[i,j,k],+energy> := cells[i,j,k].integrate(cforces[i,j,k]); MDMain.updateEnergy(energy); end-for 5/21/2019 Charm++ Workshop

Expressing Flow of Control (3) Charisma: global view of control foreach i,j,k in cells <coords[i,j,k]> := cells[i,j,k].produceCoords(); end-foreach for iter := 1 to MAX_ITER foreach i1,j1,k1,i2,j2,k2 in cellpairs <+cforces[i1,j1,k1],+cforces[i2,j2,k2]> := cellpairs[i1,j1,k1,i2,j2,k2]. calcForces(coords[i1,j1,k1],coords[i2,j2,k2]); foreach i,j,k in cells <coords[i,j,k],+energy> := cells[i,j,k].integrate(cforces[i,j,k]); MDMain.updateEnergy(energy); end-for 5/21/2019 Charm++ Workshop

Charisma Expressing global view of control Features Implementation Parallel constructs in orchestration code Sequential code separately in user C++ code Features Object-level parallelism Producer-consumer model Communication patterns Implementation Static dependence analysis Generating parallel code, integrating sequential code Output Charm++ program User code has minimal parallel involvement producer-consumer model 5/21/2019 Charm++ Workshop

Charisma (2) foreach Statement Producer-Consumer Mode Invokes method on all elements: object-level parallelism Producer-Consumer Mode Sequential code unaware of source of input values and destination of output values Data is sent out as soon as it becomes available Capable of expressing various communication patterns Point-to-point, broadcast, reduction, multicast, scatter, gather and permutation operations foreach i in myWorkers myWorkers[i].doWork(); end-foreach foreach i in myWorkers <p[i]>:=workers[i].foo(); workers[i].bar(p[i+1]); end-foreach 5/21/2019 Charm++ Workshop

Charisma (3) Example: Parallel 3D FFT foreach x in planes1 <pencils[x,*]>:=planes1[x].fft1d(); end-foreach foreach y in planes2 planes2[y].fft2d(pencils[*,y]); 5/21/2019 Charm++ Workshop

Charisma (4) Experiment Results 2D Jacobi 3D FFT Overheads come from synchronization which could be eliminated in Charm++ 2D Jacobi (Size: 163842 on 4096 objects) 3D FFT (Size: 5123 on 256 objects) 5/21/2019 Charm++ Workshop

Multiphase Shared Array (MSA) Providing global view of data Features Phases: ReadOnly, WriteByOne, AccumulateOnly Explicit synchronization between phases Implementation An object array of pages (virtualized global storage) A per-processor object array managing the local buffer Interface between worker arrays and page array Can be used w/ Charm++ and AMPI, etc. Thus local control flow Vs. Global Array: GA has (remote) operations get, put, float accumulate, read-increment. No phases. 5/21/2019 Charm++ Workshop

MSA (2) Sample Code: Template instantiation (in .ci file) Declaration and creation Enrolling, Accessing, Synchronizing array [1D] MSA_PageArray<double,DefaultEntry<double>,ENTRIES_PER_PAGE>; group MSA_CacheGroup<double,DefaultEntry<double>,ENTRIES_PER_PAGE>; MSA1D <...> msa1d = new MSA1D <...> (NUM_ENTRIES,NUM_ENROLLERS,LOCAL_CACHE_SIZE); msa1d.enroll(NUM_ENROLLERS); // init double d = msa1d.get(2); // read double e = msa2d(i,j); // read msa1d.sync(); // sync (phase change) msa1d.set(3) = d; // write msa1d.accumulate(i,newValueI); // accumulate 5/21/2019 Charm++ Workshop

MSA (3) Example: ParFUM A global hashtable to store elements on shared edges Partitions contribute elements on a particular edge: accumulate mode Partitions read elements on a shared edge: multiple read mode Accumulate when elements collide with the hash function on edges 5/21/2019 Charm++ Workshop

Conclusions Multi-paradigm helps improving productivity Higher-level languages on ARTS are interoperable Support multi-paradigm on a common run-time retains performance benefits Marriage between migratable objects and higher level language can really provide larger flexibility 5/21/2019 Charm++ Workshop