Perspective on Parallel Programming

Perspective on Parallel Programming
CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley CS258 S99

Outline for Today Motivating Problems (application case studies)
Process of creating a parallel program What a simple parallel program looks like three major programming models What primitives must a system support? Later: Performance issues and architectural interactions 11/11/2018 CS258 S99

Simulating Ocean Currents
(a) Cross sections (b) Spatial discretization of a cross section Model as two-dimensional grids Discretize in space and time finer spatial and temporal resolution => greater accuracy Many different computations per time step set up and solve equations Concurrency across and within grid computations Static and regular 11/11/2018 CS258 S99

Simulating Galaxy Evolution
Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2) brute force approach Hierarchical Methods take advantage of force law: G m1m2 r2 Many time-steps, plenty of concurrency across stars within one 11/11/2018 CS258 S99

Rendering Scenes by Ray Tracing
Shoot rays into scene through pixels in image plane Follow their paths they bounce around as they strike objects they generate new rays: ray tree per input ray Result is color and opacity for that pixel Parallelism across rays How much concurrency in these examples? 11/11/2018 CS258 S99

Creating a Parallel Program
Pieces of the job: Identify work that can be done in parallel work includes computation, data access and I/O Partition work and perhaps data among processes Manage data access, communication and synchronization 11/11/2018 CS258 S99

Definitions Task: Process (thread): Processor:
Arbitrary piece of work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Processor: Physical engine on which process executes Processes virtualize machine to programmer write program in terms of processes, then map to processors 11/11/2018 CS258 S99

4 Steps in Creating a Parallel Program
Decomposition of computation in tasks Assignment of tasks to processes Orchestration of data access, comm, synch. Mapping processes to processors 11/11/2018 CS258 S99

Decomposition Identify concurrency and decide level at which to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically No. of available tasks may vary with time Goal: Enough tasks to keep processes busy, but not too many Number of tasks available at a time is upper bound on achievable speedup 11/11/2018 CS258 S99

Limited Concurrency: Amdahl’s Law
Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup <= 1/s Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum Time for first phase = n2/p Second phase serialized at global variable, so time = n2 Speedup <= or at most 2 Trick: divide second phase into two accumulate into private sum during sweep add per-process private sum into global sum Parallel time is n2/p + n2/p + p, and speedup at best 2n2 n2 p + n2 2n2 2n2 + p2 11/11/2018 CS258 S99

Understanding Amdahl’s Law
1 p n2/p n2 work done concurrently Time (c) (b) (a) 11/11/2018 CS258 S99

Concurrency Profiles  fk k fk k p 1 s + 1-s
Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors) Speedup is the ratio: , base case: Amdahl’s law applies to any overhead, not just limited concurrency fk k fk k p  k=1  1 s + 1-s 11/11/2018 CS258 S99

Assignment Specify mechanism to divide work up among processes
E.g. which process computes forces on which stars, or which rays Balance workload, reduce communication and management cost Structured approaches usually work well Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment As programmers, we worry about partitioning first Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions 11/11/2018 CS258 S99

Orchestration Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Goals Reduce cost of communication and synch. Preserve locality of data reference Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Choices depend on Prog. Model., comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently 11/11/2018 CS258 S99

Mapping Two aspects: space-sharing System allocation Real world
Which process runs on which particular processor? mapping to a network topology Will multiple processes run on same processor? space-sharing Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS System allocation Real world User specifies desires in some aspects, system handles some Usually adopt the view: process <-> processor 11/11/2018 CS258 S99

Parallelizing Computation vs. Data
Computation is decomposed and assigned (partitioned) Partitioning Data is often a natural view too Computation follows data: owner computes Grid example; data mining; Distinction between comp. and data stronger in many applications Barnes-Hut Raytrace 11/11/2018 CS258 S99

Architect’s Perspective
What can be addressed by better hardware design? What is fundamentally a programming issue? 11/11/2018 CS258 S99

High-level Goals High performance (speedup over sequential program)
But low resource usage and development effort Implications for algorithm designers and architects? 11/11/2018 CS258 S99

What Parallel Programs Look Like

Example: iterative equation solver
Simplified version of a piece of Ocean simulation Illustrate program in low-level parallel language C-like pseudocode with simple extensions for parallelism Expose basic comm. and synch. primitives State of most real parallel programming today 11/11/2018 CS258 S99

Grid Solver Gauss-Seidel (near-neighbor) sweeps to convergence
interior n-by-n points of (n+2)-by-(n+2) updated in each sweep updates done in-place in grid difference from previous value computed accumulate partial diffs into global diff at end of every sweep check if has converged to within a tolerance parameter 11/11/2018 CS258 S99

Sequential Version 11/11/2018 CS258 S99

Decomposition Simple way to identify concurrency is to look at loop iterations dependence analysis; if not enough concurrency, then look further Not much concurrency here at this level (all loops sequential) Examine fundamental dependences Concurrency O(n) along anti-diagonals, serialization O(n) along diag. Retain loop structure, use pt-to-pt synch; Problem: too many synch ops. Restructure loops, use global synch; imbalance and too much synch 11/11/2018 CS258 S99

Exploit Application Knowledge
Reorder grid traversal: red-black ordering Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient) Ocean uses red-black We use simpler, asynchronous one to illustrate no red-black, simply ignore dependences within sweep parallel program nondeterministic 11/11/2018 CS258 S99

Decomposition Decomposition into elements: degree of concurrency n2
Decompose into rows? Degree ? for_all assignment ?? 11/11/2018 CS258 S99

Assignment Static assignment: decomposition into rows
block assignment of rows: Row i is assigned to process cyclic assignment of rows: process i is assigned rows i, i+p, ... Dynamic assignment get a row index, work on the row, get a new row, ... What is the mechanism? Concurrency? Volume of Communication? i p 11/11/2018 CS258 S99

Data Parallel Solver 11/11/2018 CS258 S99

Shared Address Space Solver
Single Program Multiple Data (SPMD) Assignment controlled by values of variables used as loop bounds 11/11/2018 CS258 S99

Generating Threads 11/11/2018 CS258 S99

Assignment Mechanism 11/11/2018 CS258 S99

SAS Program SPMD: not lockstep. Not necessarily same instructions
Assignment controlled by values of variables used as loop bounds unique pid per process, used to control assignment done condition evaluated redundantly by all Code that does the update identical to sequential program each process has private mydiff variable Most interesting special operations are for synchronization accumulations into shared diff have to be mutually exclusive why the need for all the barriers? Good global reduction? Utility of this parallel accumulate??? 11/11/2018 CS258 S99

Mutual Exclusion Why is it needed?
Provided by LOCK-UNLOCK around critical section Set of operations we want to execute atomically Implementation of LOCK/UNLOCK must guarantee mutual excl. Serialization? Contention? Non-local accesses in critical section? use private mydiff for partial accumulation! 11/11/2018 CS258 S99

Global Event Synchronization
BARRIER(nprocs): wait here till nprocs processes get here Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation Process P_1 Process P_2 Process P_nprocs set up eqn system set up eqn system set up eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) solve eqn system solve eqn system solve eqn system apply results apply results apply results Conservative form of preserving dependences, but easy to use WAIT_FOR_END (nprocs-1) 11/11/2018 CS258 S99

Pt-to-pt Event Synch (Not Used Here)
One process notifies another of an event so it can proceed Common example: producer-consumer (bounded buffer) Concurrent programming on uniprocessor: semaphores Shared address space parallel programs: semaphores, or use ordinary variables as flags P P 1 2 A = 1; a: while (flag is 0) do nothing; b: flag = 1; print A; Busy-waiting or spinning 11/11/2018 CS258 S99

Group Event Synchronization
Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers Major types: Single-producer, multiple-consumer Multiple-producer, single-consumer 11/11/2018 CS258 S99

Message Passing Grid Solver
Cannot declare A to be global shared array compose it logically from per-process private arrays usually allocated in accordance with the assignment of work process assigned a set of rows allocates them locally Transfers of entire rows between traversals Structurally similar to SPMD SAS Orchestration different data structures and data access/naming communication synchronization Ghost rows 11/11/2018 CS258 S99

Data Layout and Orchestration
P 2 4 1 Data partition allocated per processor P 1 2 4 Send edges to neighbors Add ghost rows to hold boundary data Receive into ghost rows Compute as in sequential program 11/11/2018 CS258 S99

11/11/2018 CS258 S99

Notes on Message Passing Program
Use of ghost rows Receive does not transfer data, send does unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives Update of global diff and event synch for done condition Could implement locks and barriers with messages Can use REDUCE and BROADCAST library calls to simplify code 11/11/2018 CS258 S99

Send and Receive Alternatives
Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned Affect when data structures or buffers can be reused at either end Send/Receive Synchronous Asynchronous Blocking asynch. Nonblocking asynch. Affect event synch (mutual excl. by fiat: only one process touches data) Affect ease of programming and performance Synchronous messages provide built-in synch. through match Separate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix? 11/11/2018 CS258 S99

Orchestration: Summary
Shared address space Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data communication Message passing Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch. case) mutual exclusion by fiat 11/11/2018 CS258 S99

Correctness in Grid Solver Program
Decomposition and Assignment similar in SAS and message-passing Orchestration is different Data structures, data access/naming, communication, synchronization Performance? 11/11/2018 CS258 S99

Perspective on Parallel Programming

Similar presentations

Presentation on theme: "Perspective on Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Perspective on Parallel Programming

Similar presentations

Presentation on theme: "Perspective on Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback