Presentation is loading. Please wait.

Presentation is loading. Please wait.

Welcome and Introduction “State of Charm++” Laxmikant Kale 6th Annual Workshop on Charm++ and Applications.

Similar presentations


Presentation on theme: "Welcome and Introduction “State of Charm++” Laxmikant Kale 6th Annual Workshop on Charm++ and Applications."— Presentation transcript:

1 Welcome and Introduction “State of Charm++” Laxmikant Kale 6th Annual Workshop on Charm++ and Applications

2 May 1st, 2008 6th Annual Charm++ Workshop 2 Parallel Programming Laboratory GRANTS DOE/ORNL (NSF: ITR) Chemistry Car- Parinello MD, QM/MM Sr.STAFF ENABLING PROJECTS Fault-Tolerance: Checkpointing, Fault-Recovery, Proc.Evacuation Load-Balance: Centralized, Distributed, Hybrid Faucets : Dynamic Resource Management for Grids ParFUM: Supporting Unstructured Meshes (Comp.Geometry) Charm++ and Converse AMPI Adaptive MPI Projections: Performance Analysis Higher Level (Deterministic) Parallel Languages BigSim: Simulating Big Machines and Networks NASA Computational Cosmology and Visualization DOE HPC-Colony Services and Interfaces for Large Computers DOE CSAR Rocket Simulation NSF: ITR CSE ParFUM Applications NIH Biophysics NAMD NSF + Blue Waters BigSim

3 May 1st, 2008 6th Annual Charm++ Workshop 3 A Glance at History 1987: Chare Kernel arose from parallel Prolog work –Dynamic load balancing for state-space search, Prolog,.. 1992: Charm++ 1994: Position Paper: –Application Oriented yet CS Centered Research –NAMD : 1994, 1996 Charm++ in almost current form: 1996-1998 –Chare arrays, –Measurement Based Dynamic Load balancing 1997 : Rocket Center: a trigger for AMPI 2001: Era of ITRs: –Quantum Chemistry collaboration Computational Astronomy collaboration: ChaNGa 2008: Multicore meets PetaFLOPs

4 May 1st, 2008 6th Annual Charm++ Workshop 4 PPL Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of processors –Productivity: of human programmers –Complex: irregular structure, dynamic variations Approach: Application Oriented yet CS centered research –Develop enabling technology, for a wide collection of apps. –Develop, use and test it in the context of real applications How? –Develop novel Parallel programming techniques –Embody them into easy to use abstractions –So, application scientist can use advanced techniques with ease –Enabling technology: reused across many apps

5 May 1st, 2008 6th Annual Charm++ Workshop 5 Migratable Objects (aka Processor Virtualization) Programmer : [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI Software engineering –Number of virtual processors can be independently controlled –Separate VPs for different modules Message driven execution –Adaptive overlap of communication –Predictability : Automatic out-of-core –Asynchronous reductions Dynamic mapping –Heterogeneous clusters Vacate, adjust to speed, share –Automatic checkpointing –Change set of processors used –Automatic dynamic load balancing –Communication optimization Benefits User View System View

6 May 1st, 2008 6th Annual Charm++ Workshop 6 Adaptive overlap and modules SPMD and Message-Driven Modules ( From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994) Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Fransisco, 1995

7 May 1st, 2008 6th Annual Charm++ Workshop 7 Realization: Charm++’s Object Arrays A collection of data-driven objects –With a single global name for the collection –Each member addressed by an index [sparse] 1D, 2D, 3D, tree, string,... –Mapping of element objects to processors handled by the system A[0]A[1]A[2]A[3]A[..] User’s view

8 May 1st, 2008 6th Annual Charm++ Workshop 8 Realization: Charm++’s Object Arrays A collection of data-driven objects –With a single global name for the collection –Each member addressed by an index [sparse] 1D, 2D, 3D, tree, string,... –Mapping of element objects to processors handled by the system A[0]A[1]A[2]A[3]A[..] A[3]A[0] User’s view System view

9 May 1st, 2008 6th Annual Charm++ Workshop 9 Charm++: Object Arrays A collection of data-driven objects –With a single global name for the collection –Each member addressed by an index [sparse] 1D, 2D, 3D, tree, string,... –Mapping of element objects to processors handled by the system A[0]A[1]A[2]A[3]A[..] A[3]A[0] User’s view System view

10 May 1st, 2008 6th Annual Charm++ Workshop 10 AMPI: Adaptive MPI

11 May 1st, 2008 6th Annual Charm++ Workshop 11 So, What’s new?

12 May 1st, 2008 6th Annual Charm++ Workshop 12 A large number of collaborations and grants got started

13 May 1st, 2008 6th Annual Charm++ Workshop 13 New projects Those of you who attended last year may remember –A major concern was many grants were ending last year Good news: – We have secured adequate funding New grants, projects, and collaborations – DOE/FAST-OS – ORNL/DOE – NSF: BigSim – NASA: Cosmology – IACAT – UPCRC – Blue Waters: NCSA/NSF

14 May 1st, 2008 6th Annual Charm++ Workshop 14 Blue Waters NSF Track 1 System at NCSA/UIUC Headed by: – Thom Dunning, – Rob Pennington Over a year and half of effort – Pre-proposal, proposal, contracts and sub-contract PPL participation NAMD scaling BigSim deployment – Early application development – Performance Prediction

15 May 1st, 2008 6th Annual Charm++ Workshop 15 IBM Power7 based machine It is an exciting machine Let me tell you about some details

16 May 1st, 2008 6th Annual Charm++ Workshop 16 IACAT Institute for Advanced Computing Applications and Technologies Headed by Thom Dunning Three themes funded this year PPL is participating an co-leading a theme on petascale applications – Participation by UIUC/CS computer scientists Vikram Adve Ralph Johnson Sanjay Kale David Padua.. – Application sientists Duane Johnson (co-lead) David Ceperly Klaus Schulten

17 May 1st, 2008 6th Annual Charm++ Workshop 17 BigSim Title: –Performance Prediction for Petascale Machines and Applications Funded by NFS Goal –develop a parallel simulator that can be used for petascale machines consisting of over a million processors.

18 May 1st, 2008 6th Annual Charm++ Workshop 18 Colony Project Overview Lawrence Livermore National Laboratory Terry Jones University of Illinois at Urbana-Champaign Laxmikant Kale Celso Mendes Sayantan Chakravorty International Business Machines Jose Moreira, Andrew Tauferner, Todd Inglett Haifa Lab: Gregory Chockler, Eliezer Dekel, Roie Melamed  Parallel Resource Instrumentation Framework  Scalable Load Balancing  OS mechanisms for Migration  Processor Virtualization for Fault Tolerance  Single system management space  Parallel Awareness and Coordinated Scheduling of Services  Linux OS for cellular architecture  Overlay networks Services and Interfaces to Support Systems with Very Large Numbers of Processors Collaborators Topics Title

19 May 1st, 2008 6th Annual Charm++ Workshop 19 Terascale Impostors Support: NASA Participants: –Prof. Thomas Quinn (Dep. Astronomy, Univ. Washington) –Prof. Orion Lawlor (Dep. CS, Univ. Alaska) Goal –Use innovations from parallel computing and from graphics computing to enable the interactive rendering of very large cosmological datasets Example: screenshot of 800-million particle dataset. Client running Salsa at Univ. Washington, connected to a server running on 256 processors of lemieux, at PSC. The GUI allows interactively moving across the space and renders a frame every few seconds, but this is much below the “fusion frequency” required to get a 3D “feel” of the simulation. Source: Tom Quinn

20 May 1st, 2008 6th Annual Charm++ Workshop 20 QMMM=NAMD + OpenAtom + ORNL LCF Scalable Atomistic Modeling Tools with Chemical Reactivity for Life Sciences –M Tuckerman (NYU)‏ –G. Martyna (IBM)‏ –L. Kale (UIUC)‏ –K. Schulten (UIUC)‏ –J. Dongarra (UTK)‏ DOE 2 year grant Objectives – Tune NAMD @ ORNL – Tune OpenAtom @ ORNL – Combine OpenAtom with NAMD – ATLAS for Cray Baker – Integrate Optimized Collective Communication library with Charm++

21 May 1st, 2008 6th Annual Charm++ Workshop 21 UPCRC Intel and Microsoft funded center Involves 20 faculty from CS/ECE at UIUC –Co-directors: Marc Snir, Wen-Mei Hwu –Research director: Sarita Adve PPL participation – Being decided – Managed execution: adaptivre runtime system – But our past work is relevant to Deterministic languages to simplify programming Domain specific frameworks – ParFUM Applications!

22 May 1st, 2008 6th Annual Charm++ Workshop 22 Our Applications Achieved Unprecedented Speedups

23 May 1st, 2008 6th Annual Charm++ Workshop 23 Applications and Charm++ Application Charm++ Other Applications Issues Techniques & libraries Synergy between Computer Science Research and Biophysics has been beneficial to both

24 May 1st, 2008 6th Annual Charm++ Workshop 24 Charm++ and Applications Synergy between Computer Science Research and Biophysics has been beneficial to both

25 May 1st, 2008 6th Annual Charm++ Workshop 25 Parallel Objects, Adaptive Runtime System Libraries and Tools The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE Crack Propagation Space-time meshes Computational Cosmology Rocket Simulation Protein Folding Dendritic Growth Quantum Chemistry LeanCP Develop abstractions in context of full-scale applications NAMD: Molecular Dynamics STMV virus simulation

26 May 1st, 2008 6th Annual Charm++ Workshop 26

27 May 1st, 2008 6th Annual Charm++ Workshop 27 NAMD XT4 at ORNL

28 May 1st, 2008 6th Annual Charm++ Workshop 28 Shallow valleys, high peaks, nicely overlapped PME green: communication Red: integration Blue/Purple: electrostatics turquoise: angle/dihedral Orange: PME 94% efficiency Apo-A1, on BlueGene/L, 1024 procs Charm++’s “Projections” Analysis too Time intervals on x axis, activity added across processors on Y axisl

29 May 1st, 2008 6th Annual Charm++ Workshop 29 ChaNGa

30 May 1st, 2008 6th Annual Charm++ Workshop 30 CSE: Wave Propagation and Dynamic Fracturing Support: UIUC-CSE Fellowship Participants: –Prof. Glaucio Paulino (Dep. Civil and Environmental Engineering) –Kyoungsoo Park (CEE grad student, CSE fellow) Goals –Integrate ParFUM and TopS and apply to the study of graded materials Source: Kyoungsoo ParkScaling on NCSA’s Abe

31 May 1st, 2008 6th Annual Charm++ Workshop 31 Load Balancing on Very Large Machines Existing load balancing strategies don’t scale on extremely large machines –Consider an application with 1M objects on 64K processors Centralized –Object load data are sent to processor 0 –Integrate to a complete object graph –Migration decision is broadcast from processor 0 –Global barrier Distributed –Load balancing among neighboring processors –Build partial object graph –Migration decision is sent to its neighbors –No global barrier Topology-aware − On 3D Torus/Mesh topologies

32 May 1st, 2008 6th Annual Charm++ Workshop 32 Fault Tolerance Automatic Checkpointing –Migrate objects to disk –In-memory checkpointing as an option –Automatic fault detection and restart Proactive Fault Tolerance –“Impending Fault” Response –Migrate objects to other processors –Adjust processor-level parallel data structures Scalable fault tolerance – When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! – Sender-side message logging – Latency tolerance helps mitigate costs – Restart can be speeded up by spreading out objects from failed processor

33 May 1st, 2008 6th Annual Charm++ Workshop 33 Fault Tolerance with Fast Recovery Recovery time –Time for a crashed processor to regain its pre-crash state –Existing solutions: Time between last checkpoint and crash –Aim: Lower recovery time Advantages of a lower recovery time –Lower execution time –Tolerate higher rate of faults Reduce the work on the recovering processor by distributing among others Other processors can not themselves be re-executing –Should not rollback all processors Object based virtualization to divide the work Leverage message logging and object-based virtualization to obtain fast recovery

34 May 1st, 2008 6th Annual Charm++ Workshop 34 Virtualization and Message Logging Charm++ objects are the communicating entities After a crash, objects must process messages in the same sequence as before Sender side pessimistic protocol modified to work with objects At Restart distribute objects on the recovering processor among other processor to parallelize restart 34Sayantan Chakravorty Sender Receiver SN33 TN121 Message Data SN Sequence Number TN Ticket Number Message Identify a message uniquely before and after a crash Receiver processes messages in increasing consecutive order of TN Meta-Data

35 May 1st, 2008 6th Annual Charm++ Workshop 35 Load Balancing and Fault Tolerance Load balance is important for scaling applications to large numbers of processors Load balancing needs to work with message logging –Effect of Object Migration on reliability –Crashes during the load balancing step –Load balancing and the fast restart protocol Fast restart creates a load imbalance of its own

36 May 1st, 2008 6th Annual Charm++ Workshop 36 Performance: Progress Sayantan Chakravorty36 2D stencil in Charm++ 4096 VPs on 512 processors Load balance/checkpoint every 200 timesteps Recovery With in-memory checkpoint Recovery takes 280s and 204s With fast restart Recovery takes 68.6s and 65s Skip to Summary

37 May 1st, 2008 6th Annual Charm++ Workshop 37 BigSim Simulating very large parallel machines –Using smaller parallel machines Reasons –Predict performance on future machines –Predict performance obstacles for future machines –Do performance tuning on existing machines that are difficult to get allocations on Idea: –Emulation run using virtual processor processors (AMPI) Get traces –Detailed machine simulation using traces

38 May 1st, 2008 6th Annual Charm++ Workshop 38 Objectives and Simualtion Model Objectives: –Develop techniques to facilitate the development of efficient peta-scale applications –Based on performance prediction of applications on large simulated parallel machines Simulation-based Performance Prediction: –Focus on Charm++ and AMPI programming models Performance prediction based on PDES –Supports varying levels of fidelity processor prediction, network prediction. –Modes of execution : online and post-mortem mode

39 May 1st, 2008 6th Annual Charm++ Workshop 39 Big Network Simulation Simulate network behavior: packetization, routing, contention, etc. Incorporate with post-mortem simulation Switches are connected in torus network BGSIM Emulator POSE Timestamp Correction BG Log Files (tasks & dependencies) Timestamp-corrected Tasks BigNetSim

40 May 1st, 2008 6th Annual Charm++ Workshop 40 Projections: Performance visualization

41 May 1st, 2008 6th Annual Charm++ Workshop 41 Architecture of BigNetSim

42 May 1st, 2008 6th Annual Charm++ Workshop 42 Performance Prediction (contd.) Predicting time of sequential code: –User supplied time for every code block –Wall-clock measurements on simulating machine can be used via a suitable multiplier –Hardware performance counters to count floating point, integer, branch instructions, etc Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio –Instruction level simulation (not implemented) Predicting Network performance: –No contention, time based on topology & other network parameters –Back-patching, modifies comm time using amount of comm activity –Network-simulation, modelling the netowrk entirely

43 May 1st, 2008 6th Annual Charm++ Workshop 43 Multicore/SMP issues Charm++ has supported multicore/SMP for a long time –However, practically no one was using them! –It was always faster to use a separate process for each core –But this is not sustainable when the number of cores increases Need shared software cache for ChaNGA, for example Need to avoid unnecessary communication (in multicasts) –During the last year, we paid attention to the anatomy of these performance problems

44 May 1st, 2008 6th Annual Charm++ Workshop 44 Multicore/SMP issues Charm++ advantage –User program involved no locking –The Charm model minimize false sharing Issues identified in the RTS –Locks : strength redeuction –Locks vs. Fences –False Sharing Performance of k-way sends and replies

45 May 1st, 2008 6th Annual Charm++ Workshop 45 Multicore/SMP issues Charm++ advantage –User program involved no locking –The Charm model minimize false sharing Issues identified in the RTS –Locks : strength redeuction –Locks vs. Fences –False Sharing Performance of k-way sends and replies

46 May 1st, 2008 6th Annual Charm++ Workshop 46 Domain Specific Frameworks Motivation Reduce tedium of parallel programming for commonly used paradigms and parallel data structures Encapsulate parallel data structures and algorithms Provide easy to use interface Used to build concurrently composible parallel modules Frameworks Unstructured Meshes:ParFUM –Generalized ghost regions –Used in Rocfrac, Rocflu at rocket center, and Outside CSAR –Fast collision detection Multiblock framework –Structured Meshes –Automates communication AMR –Common for both above Particles –Multiphase flows –MD, tree codes

47 May 1st, 2008 6th Annual Charm++ Workshop 47 Other Ongoing Projects Parallel Debugger Scalable Performance Analysis Automatic out-of-core execution Challenges of exploiting multicores New collaborations being explored

48 May 1st, 2008 6th Annual Charm++ Workshop 48 Summary and Messages We at PPL have advanced migratable objects technology –We are committed to supporting applications –We grow our base of reusable techniques via such collaborations Try using our technology: –AMPI, Charm++, Faucets, ParFUM,.. –Available via the web http:// charm.cs.uiuc.edu

49 May 1st, 2008 6th Annual Charm++ Workshop 49 Workshop 0verview System progress talks Adaptive MPI BigSim: Performance prediction Scalable Performance Analysis Fault Tolerance Cell Processor Grid Multi-cluster applications Applications Molecular Dynamics Quantum Chemistry Computational Cosmology Structural Mechanics Tutorials Charm++ AMPI ParFUM BigSim Keynote: Rick Stevens Exascale Computational Science Invited Talk: Bill Gropp The Evolution of MPI


Download ppt "Welcome and Introduction “State of Charm++” Laxmikant Kale 6th Annual Workshop on Charm++ and Applications."

Similar presentations


Ads by Google