Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

IT253: Computer Organization

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Reference: Message Passing Fundamentals.

1: Operating Systems Overview

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.

Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.

Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Example: Sorting on Distributed Computing Environment Apr 20,

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Server to Server Communication Redis as an enabler Orion Free

CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Background Computer System Architectures Computer System Software.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.

Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Parallel Objects: Virtualization & In-Process Components

The University of Adelaide, School of Computer Science

Performance Evaluation of Adaptive MPI

CSCI1600: Embedded and Real Time Software

Component Frameworks:

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

CSCI1600: Embedded and Real Time Software

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber, S. Thite, J. Palaniappan This research made possible via NSF grant DMR

Parallel Programming Lab Led by Professor Laxmikant Kale Application-oriented –Research is driven by real applications and the needs of real applications NAMD CSAR Rocket Simulation (Roc*) Spacetime Discontinuous Galerkin Petaflops Performance Prediction (Blue Gene) –Focus on scaleable performance for real applications

Charm++ Overview In development for roughly ten years Based on C++ Runs on many platforms –Desktops –Clusters –Supercomputers Overlays a C layer called Converse –Allows multiple languages to work together

Charm++: Programmer View System of objects Asynchronous communication via method invocation Use an object identifier to refer to an object. User sees each object execute its methods atomically –As if on its own processor Processor Object/Task

Charm++: System View Set of objects invoked by messages Set of processors of the physical machine Keeps track of object to processor mapping Routes messages between objects Processor Object/Task

Charm++ Benefits Program is not tied to a fixed number of processors –No problem if program needs 128 processors and only 45 available –Called processor virtualization Load balancing accomplished automatically –User writes a short routine to transfer object between processors

Load Balancing - Green Process Starts Heavy Computation A B C

Yellow Processes Migrate Away – System Handles Message Routing A B C A B C

Load Balancing Load balancing isn’t solely dependant on CPU usage Balancers consider network usage as well –Can move objects to lessen network bandwidth usage Migrating an object to disk instead of another processor gives checkpoint/restart, out-of-core execution

Parallel Spacetime Discontinuous Galerkin Mesh generation is an advancing front algorithm –Adds an independent set of elements called patches to the mesh Spacetime methods are setup in such a way they are easy to parallelize –Each patch depends only on inflow elements Cone constraint insures no other dependencies –Amount of data per patch is small Inexpensive to send a patch and its inflow elements to another processor

Mesh Generation Unsolved Patches

Mesh Generation Solved Patches Unsolved Patches

Mesh Generation Solved Patches Unsolved Patches Refinement

Parallelization Method (1D) Master-Slave method –Centralized mesh generation –Distributed physics solver code –Simplistic implementation But fast to get running Provides object migration sanity check No “time-step” –as soon as a patch returns the master generates any new patches it can and sends them off to be solved

Results - Patches / Second

Scaling Problems Speedup is ideal at 4 slave processors After 4 slaves, diminishing speedup occurs Possible sources: –Network bandwidth overload –Charm++ system overhead (grainsize control) –Mesh generator overload Problem doesn’t scale-down –More processors don’t slow the computation down

Network Bandwidth Size of a patch to send both ways is 2048 bytes (very conservative estimate) Can compute 36 patches/(second*CPU) Each CPU needs 72kbytes/second 100Mbit Ethernet provides 10Mbyte/sec Network can support ~130 CPUs –Must not be a lack of network bandwidth

Charm++ System Overhead (Grainsize Control) Grainsize is a measure of the smallest unit of work Too small and overhead dominates –Network latency overhead –Object creation overhead Each patch takes 1.7ms to setup the connection to send (both ways) Can send ~550 patches/sec to remote processors –Again, higher than observed patch/second rate Grainsize can be reduced by sending multiple patches at once –Speeds up the computation but speedup still flattens out after 8 processors

Mesh Generation With 0 slave processors, 31ms/patch With 1 slave processor, 27ms/patch Geometry code takes 4ms to generate a patch –Mesh generator needs a bit more time due to Charm++ message sending overhead Leads to less than 250 patches/second Can’t trivially speed this up –Would have to parallelize mesh generation –Parallel mesh generation also would lighten network load if the mesh were fully distributed to slave nodes

Testing the Mesh Generator Bottleneck Does speeding up the mesh generator give better results? Leaves the question how to speed up the mesh generator –The cluster used is a P3 Xeon 500Mhz –So run the mesh generator on something faster (a P4 2.8Ghz) –Everything still on 100Mbit network

Fast Mesh Generator Results

Future Directions Parallelize geometry/mesh generation –Easy to do in theory –More complex in practice with refinement, coarsening –Lessens network bandwidth consumption Only have to send border elements of all meshes Compared to all elements sent right now –Better cache performance

More Future Directions Send only necessary data –Currently send everything, needed or not Use migration to balance load rather than slaves –Means we’ll also get checkpoint/restart and out-of- core execution for free –Also means we can load balance away some of the network communication Integrate 2D mesh generation/physics code –Nothing in the parallel code knows the dimensionality