Workshop on Charm++ and Applications Welcome and Introduction

Workshop on Charm++ and Applications Welcome and Introduction
Laxmikant Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 11/7/2018 IITB2003

PPL Mission and Approach
To enhance Performance and Productivity in programming complex parallel applications Performance: scalable to thousands of processors Productivity: of human programmers Complex: irregular structure, dynamic variations Approach: Application Oriented yet CS centered research Develop enabling technology, for a wide collection of apps. Develop, use and test it in the context of real applications How? Develop novel Parallel programming techniques Embody them into easy to use abstractions So, application scientist can use advanced techniques with ease Enabling technology: reused across many apps 11/7/2018 IITB2003

Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization Benefits System implementation User View 11/7/2018 IITB2003

Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI Benefits Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization MPI processes Virtual Processors (user-level migratable threads) Real Processors 11/7/2018 IITB2003

Adaptive overlap and modules
This is very nicely illustrated in a figure I borrowed from Attila Gursoy’s thesis. Module A needs to avail services from Module B and C. In a message-passing paradigm, these modules cannot execute concurrently, thus idle times in one modules cannot be substituted by computations from another module. This is possible in message-driven paradigm. We therefore base our component architecture on a message-driven runtime system, called Converse. We describe Converse next. SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995 11/7/2018 IITB2003

Processor Utilization against Time on 128 and 1024 processors
Refinement Load Balancing Load Balancing Aggressive Load Balancing Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. 11/7/2018 IITB2003

Shrink/Expand Problem: Availability of computing platform may change
Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600 11/7/2018 IITB2003

Optimizing all-to-all via Mesh
Organize processors in a 2D (virtual) grid Phase 1: Each processor sends messages within its row Phase 2: Each processor sends messages within its column Message from (x1,y1) to (x2,y2) goes via (x1,y2) messages instead of P-1 11/7/2018 IITB2003

76 bytes all-to-all on Lemieux
Optimized All-to-all “Surprise” Completion time vs. computation overhead 76 bytes all-to-all on Lemieux CPU is free during most of the time taken by a collective operation 100 200 300 400 500 600 700 800 900 100B 200B 900B 4KB 8KB Message Size (bytes) AAPC Completion Time(ms) Mesh Direct Radix Sort Led to the development of Asynchronous Collectives now supported in AMPI 11/7/2018 IITB2003

Performance on Large Machines
Problem: How to predict performance of applications on future machines? (E.g. BG/L) How to do performance tuning without continuous access to large machine? Solution: Leverage virtualization Develop machine emulator Simulator: accurate time modeling Run program on “100,000 processors” using only hundreds of processors Analysis: Use performance viz. suite (projections) Molecular Dynamics Benchmark ER-GRE: atoms 1.6 million objects 8 step simulation 16k BG processors 11/7/2018 IITB2003

Blue Gene Emulator/Simulator
Actually: BigSim, for simulation of any large machine using smaller parallel machines Emulator: Allows development of programming environment and algorithms before the machine is built Allowed us to port Charm++ to real BG/L in 1-2 days Simulator: Allows performance analysis of programs running on large machines, without access to the large machines Uses Parallel Discrete Event Simulation 11/7/2018 IITB2003

Big Network Simulation
Simulate network behavior: packetization, routing, contention, etc. Incorporate with post-mortem simulation Switches are connected in torus network BGSIM Emulator POSE Timestamp Correction BG Log Files (tasks & dependencies) Timestamp-corrected Tasks BigNetSim 11/7/2018 IITB2003

Projections: Performance visualization
11/7/2018 IITB2003

Multi-Cluster Co-Scheduling
Job co-scheduled to run across two clusters to provide access to large numbers of processors But cross cluster latencies are large! Virtualization within Charm++ masks high inter-cluster latency by allowing overlap of communication with computation Cluster A Cluster B Intra-cluster latency (microseconds) Inter-cluster latency (milliseconds) 11/7/2018 IITB2003

Multi-Cluster Co-Scheduling
11/7/2018 IITB2003

Fault Tolerance Automatic Checkpointing “Impending Fault” Response
Migrate objects to disk In-memory checkpointing as an option Automatic fault detection and restart “Impending Fault” Response Migrate objects to other processors Adjust processor-level parallel data structures Scalable fault tolerance When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! Sender-side message logging Latency tolerance helps mitigate costs Restart can be speeded up by spreading out objects from failed processor 11/7/2018 IITB2003

Grid: Progress on Faucets
Cluster Bartering via Faucets Allows single-point job submission Grid level resource allocation Quality of service contracts Clusters participate in bartering mode Cluster Scheduler AMPI jobs can be made to change number of processors used, at runtime Scheduler exploits such adaptive jobs to maximize throughput, and reduce response time (especially for high priority jobs) Tutorial on Wednesday 11/7/2018 IITB2003

AQS:Adaptive Queuing System
Multithreaded Reliable and robust Supports most features of standard queuing systems Exploits adaptive jobs Charm++ and MPI Shrink/Expand set of procs Handles regular jobs i.e. non-adaptive In routine use on the “architecture cluster” 11/7/2018 IITB2003

Faucets: Optimizing Utilization Within/across Clusters
Job Specs Cluster Job Submission Bids File Upload Job Specs File Upload Job Id Cluster Job Id Efficiency metrics are profit and utilization Job Monitor Cluster 11/7/2018 IITB2003

Higher level programming
Multiphase Shared Arrays Provides a disciplined use of shared address space Each array can be accessed only in one of the following modes: ReadOnly, Write-by-One-Thread, Accumulate-only Access mode can change from phase to phase Phases delineated by per-array “sync” Orchestration language Allows expressing global control flow in a charm program HPF like flavor, but Charm++-like processor virtualization, and explicit communication 11/7/2018 IITB2003

Other Ongoing and Planned System Projects
Parallel languages JADE Orchestration Parallel Debugger Automatic out-of-core execution Parallel Multigrid Support 11/7/2018 IITB2003

Quantum Chemistry (QM/MM) Computational Cosmology
Enabling CS technology of parallel objects and intelligent runtime systems (Charm++ and AMPI) has led to several collaborative applications in CSE Quantum Chemistry (QM/MM) Protein Folding Molecular Dynamics Computational Cosmology Crack Propagation Parallel Objects, Adaptive Runtime System Libraries and Tools Space-time meshes Rocket Simulation Dendritic Growth 11/7/2018 IITB2003

Some Active Collaborations
Biophysics: Molecular Dynamics (NIH, ..) Long standing, 91-, Klaus Schulten, Bob Skeel Gordon bell award in 2002, Production program used by biophysicists Quantum Chemistry (NSF) QM/MM via Car-Parinello method + Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman, Nick Nystrom, Josep Torrelas, Laxmikant Kale Material simulation (NSF) Dendritic growth, quenching, space-time meshes, QM/FEM R. Haber, D. Johnson, J. Dantzig, + Rocket simulation (DOE) DOE, funded ASCI center Mike Heath, +30 faculty Computational Cosmology (NSF, NASA) Simulation: Scalable Visualization: Others Simulation of Plasma Electromagnetics 11/7/2018 IITB2003

NAMD: A Production MD program
Fully featured program NIH-funded development Distributed free of charge (~5000 downloads so far) Binaries and source code Installed at NSF centers User training and support Large published simulations 11/7/2018 IITB2003

11/7/2018 IITB2003

Profile and Timeline views of a 3000 processor run.
Future work Profile and Timeline views of a 3000 processor run. White area: Idle time Blue shades: force computations, red: integration Observation: load balancing is still a possible issue So, even further performance improvement may be possible! 11/7/2018 IITB2003

Bluegene-L Performance
Two modes of operation Virtual Node Mode Processor responsible for all its communication Co-Processor Mode One of the processors in a node becomes a co-processor and handles communication 11/7/2018 IITB2003

Scaling NAMD to BlueGene/L
With Virtual node mode: Cascading network blocking Frequent flush of receive buffers Every 16K processor cycles Added network flushes to NAMD serial loops Benchmark: Apolipoprotein 1 (92 k atoms) 11/7/2018 IITB2003

NAMD Performance Step Time (ms) Processors 11/7/2018 IITB2003

Domain Specific Frameworks
Motivation Reduce tedium of parallel programming for commonly used paradigms and parallel data structures Encapsulate parallel data structures and algorithms Provide easy to use interface Used to build concurrenltly composible parallel modules Frameworks Unstructured Grids Generalized ghost regions Used in Rocfrac, Rocflu at rocket center, and Outside CSAR Fast collision detection Multiblock framework Structured grids Automates communication AMR Common for both above Particles Multiphase flows MD, tree codes 11/7/2018 IITB2003

Robert Fielder, Center for Simulation of Advanced Rockets
Rocket Simulation Dynamic, coupled physics simulation in 3D Finite-element solids on unstructured tet mesh Finite-volume fluids on structured hex mesh Coupling every timestep via a least-squares data transfer Challenges: Multiple modules Dynamic behavior: burning surface, mesh adaptation Robert Fielder, Center for Simulation of Advanced Rockets Collaboration with M. Heath, P. Geubelle, others 11/7/2018 IITB2003

Dynamic load balancing in Crack Propagation
11/7/2018 IITB2003

CPSD:Spacetime Discontinuous Galerkin Method
Collaboration with: Bob Haber, Jeff Erickson, Mike Garland, .. NSF funded center Space-time mesh is generated at runtime Mesh generation is an advancing front algorithm Adds an independent set of elements called patches to the mesh Each patch depends only on inflow elements (cone constraint) Completed: Sequential mesh generation interleaved with parallel solution Ongoing: Parallel Mesh generation Planned: non-linear cone constraints, adaptive refinements 11/7/2018 IITB2003

CPSD: Dendritic Growth
Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid Adaptive refinement and coarsening of grid involves re-partitioning Jon Dantzig et al with O. Lawlor and Others from PPL 11/7/2018 IITB2003

High computational cost—Accurate simulation requires more grid points.
Parallelization of Level Set Method for Solving Solidification Problems Laxmikant Kale1, Jon Dantzig2, Kai Wang1, and Anthony Chang 2, University of Illinois 1 Department of Computer Science, 2 Department of Mechanical and Industrial Engineering Research Objective: When simulating solidification, the level set method is used to track the interface. We do it in parallel since Long process—The simulation of the solidification process usually needs several hundreds time steps. High computational cost—Accurate simulation requires more grid points. Approach: Virtualization concept is implemented during the parallelization. Significant results: The parallel performance is improved because of Adaptive overlap of communication and computation. Better cache performance Broader impacts: Virtualization may bring benefits to moving boundary problems which need automatic load balancing. A paper is in progress. The Simulation of Solidification Problem Using Level Set Method The relationship of CPU time with respect to the degree of virtualization. 500*500 Grid on 16 Processors 500*500 Grid on 32 Processors 11/7/2018 IITB2003

Computational Cosmology
N body Simulation N particles (1 million to 1 billion), in a periodic box Move under gravitation Organized in a tree (oct, binary (k-d), ..) Output data Analysis: in parallel Particles are read in parallel Interactive Analysis Issues: Load balancing, fine-grained communication, tolerating communication latencies. Multiple-time stepping Collaboration with T. Quinn, Y. Staedel, M. Winslett, others 11/7/2018 IITB2003

QM/MM Quantum Chemistry (NSF) Current Steps: Planned:
QM/MM via Car-Parinello method + Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman, Nick Nystrom, Josep Torrelas, Laxmikant Kale Current Steps: Take the core methods in PinyMD (Martyna/Tuckerman) Reimplement them in Charm++ Study effective parallelization techniques Planned: LeanMD (Classical MD core..) Full QM/MM Integrate with other code 11/7/2018 IITB2003

Summary and Messages We at PPL have advanced migratable objects technology We are committed to supporting applications We grow our base of reusable techniques via such collaborations Try using our technology: AMPI, (Charm++), Faucets, FEM framework, .. Available via the web ( 11/7/2018 IITB2003

Over the next 3 days Applications System progress talks Tutorials
Keynote: Manish Gupta, IBM IBM BlueGene/L Computer Applications Molecular Dynamics Quantum Chemistry (QM/MM) Space-time meshes Rocket Simulation Computational Cosmology Graphics Animation System progress talks Adaptive MPI BigSim: Performance prediction Performance Visualization Fault Tolerance Communication Optimizations Multi-cluster applications Tutorials Charm++ AMPI Projections FEM framework Faucets BigSim 11/7/2018 IITB2003

Workshop on Charm++ and Applications Welcome and Introduction

Similar presentations

Presentation on theme: "Workshop on Charm++ and Applications Welcome and Introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Workshop on Charm++ and Applications Welcome and Introduction

Similar presentations

Presentation on theme: "Workshop on Charm++ and Applications Welcome and Introduction"— Presentation transcript:

Similar presentations

About project

Feedback