Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.

Slides:



Advertisements
Similar presentations
Sharks and Fishes – The problem
Advertisements

EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Game of Life Courtesy: Dr. David Walker, Cardiff University.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
EECC756 - Shaaban #1 lec # 7 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Hardware/Software Performance Tradeoffs (plus Msg Passing Finish) CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
1 Parallel Programs. 2 Why Bother with Programs? They’re what runs on the machines we design Helps make design decisions Helps evaluate systems tradeoffs.
Computer Architecture II 1 Computer architecture II Programming for performance.
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Computer Architecture II 1 Computer architecture II Introduction.
Computer Architecture II 1 Computer architecture II Introduction.
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Programming for Performance
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
EECC756 - Shaaban #1 lec # 5 Spring Parallel Programming for Performance A process of Successive Refinement Partitioning for Performance:Partitioning.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.
Parallel Simulation of Continuous Systems: A Brief Introduction
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
#1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance Goal of the.
CMPE655 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Steps in Creating a Parallel Program
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Perspective on Parallel Programming
Parallel Application Case Studies
Course Outline Introduction in algorithms and applications
CS 584.
Parallel Programming: Performance Todd C
Mattan Erez The University of Texas at Austin
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming Model Case Studies –Ocean simulation –Barnes-Hut N-body simulation

CS 740 F’00– 2 – Motivating Problems Simulating Ocean Currents Regular structure, scientific computing Simulating the Evolution of Galaxies Irregular structure, scientific computing Rendering Scenes by Ray Tracing Irregular structure, computer graphics Not discussed here (read in book)

CS 740 F’00– 3 – Simulating Ocean Currents Model as two-dimensional grids Discretize in space and time –finer spatial and temporal resolution => greater accuracy Many different computations per time step –set up and solve equations Concurrency across and within grid computations (a) Cross sections(b) Spatial discretization of a cross section

CS 740 F’00– 4 – Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive O(n 2 ) brute force approach Hierarchical Methods take advantage of force law: G m1m2m1m2 r2r2 Many time-steps, plenty of concurrency across stars within one

CS 740 F’00– 5 – Rendering Scenes by Ray Tracing Shoot rays into scene through pixels in image plane Follow their paths –they bounce around as they strike objects –they generate new rays: ray tree per input ray Result is color and opacity for that pixel Parallelism across rays All case studies have abundant concurrency

CS 740 F’00– 6 – Parallel Programming Task Break up computation into tasks assign tasks to processors Break up data into chunks assign chunks to memories Introduce synchronization for: mutual exclusion event ordering

CS 740 F’00– 7 – Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...) Issues are the same, so assume programmer does it all explicitly

CS 740 F’00– 8 – Partitioning for Performance Balancing the workload and reducing wait time at synch points Reducing inherent communication Reducing extra work Even these algorithmic issues trade off: Minimize comm. => run on 1 processor => extreme load imbalance Maximize load balance => random assignment of tiny tasks => no control over communication Good partition may imply extra work to compute or manage it Goal is to compromise Fortunately, often not difficult in practice

CS 740 F’00– 9 – Load Balance and Synch Wait Time Limit on speedup: Speedup problem (p) < Work includes data access and other costs Not just equal work, but must be busy at same time Four parts to load balance and reducing synch wait time: 1. Identify enough concurrency 2. Decide how to manage it 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization Sequential Work Max Work on any Processor

CS 740 F’00– 10 – Deciding How to Manage Concurrency Static versus Dynamic techniques Static: Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in multiprogrammed/heterogeneous environment) Dynamic: Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads

CS 740 F’00– 11 – Dynamic Assignment Profile-based (semi-static): Profile work distribution at runtime, and repartition dynamically Applicable in many computations, e.g. Barnes-Hut, some graphics Dynamic Tasking: Deal with unpredictability in program or environment (e.g. Raytrace) –computation, communication, and memory system interactions –multiprogramming and heterogeneity –used by runtime systems and OS too Pool of tasks; take and add tasks until done E.g. “self-scheduling” of loop iterations (shared loop counter)

CS 740 F’00– 12 – Dynamic Tasking with Task Queues Centralized versus distributed queues Task stealing with distributed queues Can compromise comm and locality, and increase synchronization Whom to steal from, how many tasks to steal,... Termination detection Maximum imbalance related to size of task

CS 740 F’00– 13 – Determining Task Granularity Task granularity: amount of work associated with a task General rule: Coarse-grained => often less load balance Fine-grained => more overhead; often more communication and contention Communication and contention actually affected by assignment, not size Overhead by size itself too, particularly with task queues

CS 740 F’00– 14 – Reducing Serialization Careful about assignment and orchestration (including scheduling) Event synchronization Reduce use of conservative synchronization –e.g. point-to-point instead of barriers, or granularity of pt-to-pt But fine-grained synch more difficult to program, more synch ops. Mutual exclusion Separate locks for separate data –e.g. locking records in a database: lock per process, record, or field –lock per task in task queue, not per queue –finer grain => less contention/serialization, more space, less reuse Smaller, less frequent critical sections –don’t do reading/testing in critical section, only modification –e.g. searching for task to dequeue in task queue, building tree Stagger critical sections in time

CS 740 F’00– 15 – Reducing Inherent Communication Communication is expensive! Measure: communication to computation ratio Focus here on inherent communication Determined by assignment of tasks to processes Later see that actual communication can be greater Assign tasks that access same data to same process Solving communication and load balance NP-hard in general case But simple heuristic solutions work well in practice Applications have structure!

CS 740 F’00– 16 – Domain Decomposition Works well for scientific, engineering, graphics,... applications Exploits local-biased nature of physical problems Information requirements often short-range Or long-range but fall off with distance Simple example: nearest-neighbor grid computation Perimeter to Area comm-to-comp ratio (area to volume in 3D) Depends on n,p: decreases with n, increases with p

CS 740 F’00– 17 – Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup < Reducing Extra Work Common sources of extra work: Computing a good partition –e.g. partitioning in Barnes-Hut or sparse matrix Using redundant computation to avoid communication Task, data and process management overhead –applications, languages, runtime systems, OS Imposing structure on communication –coalescing messages, allowing effective naming Architectural Implications: Reduce need by making communication and orchestration efficient

CS 740 F’00– 18 – Summary of Tradeoffs Different goals often have conflicting demands Load Balance –fine-grain tasks –random or dynamic assignment Communication –usually coarse grain tasks –decompose to obtain locality: not random/dynamic Extra Work –coarse grain tasks –simple assignment Communication Cost: –big transfers: amortize overhead and latency –small transfers: reduce contention

CS 740 F’00– 19 – Impact of Programming Model Example: LocusRoute (standard cell router) while (route_density_improvement > threshold) { for (i = 1 to num_wires) do { - rip old wire route out - explore new routes - place wire using best new route } routing channel COST ARRAY gives cost of routing through this channel cell STANDARD CELL CIRCUIT standard cell rows

CS 740 F’00– 20 – Shared-Memory Implementation Shared memory algorithm: Divide cost-array into regions (assign regions to PEs) Assign wires to PEs based on the region in which center lies Do load balancing using stealing when local queue empty Good points: Good load balancing Mostly local accesses High cache-hit ratio

CS 740 F’00– 21 – Message-Passing Implementations Solution-1: Distribute wires and cost-array regions as in sh-mem implementation Big overhead when wire-path crosses to remote region –send computation to remote PE, or –send messages to access remote data Solution-2: Wires distributed as in sh-mem implementation Each PE has copy of full cost array –one owned region, plus potentially stale copy of others –send frequent updates so that copies not too stale Consequences: –waste of memory in replication –stale data => poorer quality results or more iterations => In either case, lots of thinking needed on the programmer's part

CS 740 F’00– 22 – Case Studies Simulating Ocean Currents Regular structure, scientific computing Simulating the Evolution of Galaxies Irregular structure, scientific computing

CS 740 F’00– 23 – Case 1: Simulating Ocean Currents Model as two-dimensional grids Discretize in space and time –finer spatial and temporal resolution => greater accuracy Many different computations per time step –set up and solve equations Concurrency across and within grid computations

CS 740 F’00– 24 – Time Step in Ocean Simulation

CS 740 F’00– 25 – Partitioning Exploit data parallelism Function parallelism only to reduce synchronization Static partitioning within a grid computation Block versus strip –inherent communication versus spatial locality in communication Load imbalance due to border elements and number of boundaries Solver has greater overheads than other computations

CS 740 F’00– 26 – Two Static Partitioning Schemes Which approach is better? StripBlock

CS 740 F’00– 27 – Impact of Memory Locality algorithmic = perfect memory system; No Locality = dynamic assignment of columns to processors; Locality = static subgrid assigment (infinite caches)

CS 740 F’00– 28 – Impact of Line Size & Data Distribution no-alloc = round-robin page allocation; otherwise, data assigned to local memory. L = cache line size.

CS 740 F’00– 29 – Case 2: Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive O(n 2 ) brute force approach Hierarchical Methods take advantage of force law: G m1m2m1m2 r2r2 Many time-steps, plenty of concurrency across stars within one Star on which forces are being computed Star too close to approximate Small group far enough away to approximate to center of mass Large group far enough away to approximate

CS 740 F’00– 30 – Barnes-Hut Locality Goal: particles close together in space should be on same processor Difficulties: nonuniform, dynamically changing

CS 740 F’00– 31 – Application Structure Main data structures: array of bodies, of cells, and of pointers to them –Each body/cell has several fields: mass, position, pointers to others –pointers are assigned to processes

CS 740 F’00– 32 – Partitioning Decomposition: bodies in most phases, cells in computing moments Challenges for assignment: Nonuniform body distribution => work and comm. nonuniform –Cannot assign by inspection Distribution changes dynamically across time-steps –Cannot assign statically Information needs fall off with distance from body –Partitions should be spatially contiguous for locality Different phases have different work distributions across bodies –No single assignment ideal for all –Focus on force calculation phase Communication needs naturally fine-grained and irregular

CS 740 F’00– 33 – Load Balancing Equal particles  equal work. –Solution: Assign costs to particles based on the work they do Work unknown and changes with time-steps –Insight : System evolves slowly –Solution: Count work per particle, and use as cost for next time- step. Powerful technique for evolving physical systems

CS 740 F’00– 34 – A Partitioning Approach: ORB Orthogonal Recursive Bisection: Recursively bisect space into subspaces with equal work –Work is associated with bodies, as before Continue until one partition per processor High overhead for large number of processors

CS 740 F’00– 35 – Another Approach: Costzones Insight: Tree already contains an encoding of spatial locality. Costzones is low-overhead and very easy to program

CS 740 F’00– 36 – Barnes-Hut Performance Speedups on simulated multiprocessor Extra work in ORB is the key difference Ideal Costzones ORB