Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Spark: Cluster Computing with Working Sets
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Adaptive MPI Milind A. Bhandarkar
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Advanced / Other Programming Models Sathish Vadhiyar.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Programming for Performance CS433 Spring 2001 Laxmikant Kale.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Parallel Application Paradigms CS433 Spring 2001 Laxmikant Kale.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Parallel Computing Presented by Justin Reschke
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Dynamic Load Balancing Tree and Structured Computations.
Programming for Performance Laxmikant Kale CS 433.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Processes and threads.
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Course Outline Introduction in algorithms and applications
Faucets: Efficient Utilization of Multiple Clusters
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
Support for Adaptivity in ARMCI Using Migratable Objects
Threads CSE 2431: Introduction to Operating Systems
Presentation transcript:

Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000

Parallel Programming models We studied: –MPI/Message passing, Shared Memory, Charm++/shared objs –loop-parallel: openMP, Other languages/paradigms: –Loop parallelism on distributed memory machines: HPF –Linda, Cid, Chant –Several others: Acceptance barrier I will assign reading assignments: –papers on the above languages, available on the web. –Pointers on course web page soon.

High Performance Fortran: Loop parallelism (mostly explicit) on distributed memory machines –Arrays are the primary data structure (1 or multi-dimensional) –How do decide which data lives where? Provide “distribute” and “align primitives distribute A[block, cyclic] (notation difffers) Align B with A: same distribution –Who does which part of the loop iteration? “Owner computes” A(I,J) = E

Linda Shared tuple space: –Specialization of shared memory Operations: –read, in, out [eval] –Pattern matching ( in [2,x] -> reads x in, and removes tuple Tuple analysis

Cid Derived from Id, a data-flow language Basic constructs: –threads –create new threads –wait for data from other threads User level vs. system level thread –What is a thread?: stack, PC,.. –Preemptive vs non-preemptive

Cid Multiple threads on each processor –Benefits: adaptive overlap –Need a scheduler: use the OS scheduler? –All threads on one PE share address space Thread mapping –At creation time, one may ask the system to map it to a PE –No migration after a thread starts running Global pointers –Threads on different processors can exchange data via these –(In addition to fork/join data exchange)

Cid Global pointers: –register any C structure as a global object (to get a globalID) –“get” operation gets a local copy of a given object in read or write mode –asynchronous “get”s are also supported get doesn’t wait for data to arrive HPF style global arrays Grainsize control –Especially for tree structure computations –Create a thread, if other processors are idle (for example)

Chant Threads that send messages to each other –Message passing can be MPI style –User level threads Simple implementation in Charm++ is available

CRL Cache coherence techniques with software-only support –release consistency –get(Read/Write, data), work on data, release(data) –get makes a local copy –data-exchange protocols underneath provide the (simplified) consistency

Multi-paradigm interoperabilty Which one of these paradigms is “the best”? –Depends on the application, algorithm or module –Doesn’t matter anyway, as we must use MPI (openMP) acceptance barrier Idea: –allow multiple modules to be written in different paradigms Difficulty: –Each paradigm has its own view of how to schedule processors –Comes down to scheduler Solution: have a common scheduler

Converse Common scheduler Components for easily implementing new paradigms –User level threads separates 3 functions of a thread package –message passing support –“Futures” (origin: Halstead: MultiLisp) What is a “future” data, ready-or-not, caller blocks on access –Several other features

Other models

Object based load balancing Load balancing is a resource management problem Two sources of imbalances –Intrinsic: application-induced –External: environment induced

Object based load balancing Application induced imbalances: –Abrupt, but infrequent, or –Slow, cumulative –rarely: frequent, large changes Principle of peristence –Extension of principle of locality –Behavior, including computational load and communication patterns, of objects tend to persist over time We have implemented strategies that exploit this automatically!

Crack propagation example: Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

MPI-F90 original Charm++ framework(all C++) F90 + charm++ library Cross-approach comparison

Load balancer in action

Cluster: handling intrusion

Applying to other languages Need: MPI on Charm++ –threaded MPI: multiple threads run on each PE –threads can be migrated! –Uses the load balancer framework Non-threaded irecv/waitall library –More work, but more efficient Currently rocket simulation program components –rocflo, rocsolid are being ported via this approach

What next? Timeshared parallel clusters Web submission via appspector, and extension to “faucets” New applications: –CSE simulations –Operations Research –Biological problems –New applications?? More info: –

Using Global Loads Idea: –For even a moderately large number of processors, collecting a vector of load on each PE is not much more expensive than the collecting the total (per message cost dominates) –How can we use this vector without creating serial bottleneck? –Each processor know if it is overloaded compared with avg. Also knows which Pes are underloaded But need an algorithm that allows each processor to decide whom to send work to without global coordination, beyond getting the vector –Insight: everyone has the same vector –Also, assumption: there are sufficient fine-grained work pieces

Global vector scheme: contd Global algorithm: if we were able to make the decision centrally: Receiver = nextUnderLoaded(0); For (I=0, I<P; I++) { if (load[I] > average) { assign excess work to receiver, advancing receiver to the next as needed; } To make a distribued algorithm run the same algorithm on each processor! Except ignore any reassignment that doesn’t involve me.

Tree structured computations Examples: –Divide-and-conquer –State-space search: –Game-tree search –Bidirectional search –Branch-and-bound Issues: –Grainsize control –Dynamic Load Balancing –Prioritization

State Space Search Definition: – start state, operators, goal-state (implicit/explicit) –Either search for goal state or for a path leading to one If we are looking for all solutions: –same as divide and conquer, except no backward communication Search for any solution: –Use the same algorithm as above? –Problems: inconsistent and not monotonically increasing speedups,

State Space Search Using priorities: –bitvector priorities –Let root have 0 prio –Prio of child: – parent + my rank p01 p02 p03 p

Effect of Prioritization Let us consider shared memory machines for simplicity: –Search directed to left part of the tree –Memory usage: let B be branching factor of tree, D its depth: O(D*B + P) nodes in the queue at a time With stack: O(D*P*B) –Consistent and monotonic speedups

Need prioritized load balancing On non shared memory machines? Centralized solution: –Memory bottleneck too! Fully distributed solutions: Hierarchical solution: –Token idea

Bidirectional Search Goal state is explicitly known and operators can be inverted –Sequential: –Parallel?

Game tree search Tricky problem: alpha beta, negamax

Scalability The Program should scale up to use a large number of processors. –But what does that mean? An individual simulation isn’t truly scalable Better definition of scalability: –If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Isoefficiency Quantify scalability How much increase in problem size is needed to retain the same efficiency on a larger machine? Efficiency : Seq. Time/ (P · Parallel Time) –parallel time = computation + communication + idle