1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.

1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR

2 Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration Mapping to processors Examples Outline of the lecture

3 In general a very creative process Only methodical frameworks available Usually more alternatives to be considered The best parallel solution may differ from suggestions of the sequential approach Design of parallel algorithms

4 Introduced in Ian Foster’s Designing and Building Parallel Programs [Foster 1995] –http://www-unix.mcs.anl.gov/dbpp Represents a parallel computation as set of tasks –task is a program, its local memory and a collection of I/O ports task can send local data values to other tasks via output ports task can receive data values from other tasks via input ports The tasks may interact with each other by sending messages through channels –channel is a message queue that connects one task’s output port with another task’s input port –nonblocking asynchronous send and blocking receive is supposed An abstraction close to the message passing model Task/channel model (1)

5 Task/channel model (2) after [Quinn 2004] Task Channel Program Output port Input port Directed graph of tasks (vertices) and channels (edges)

6 Design stages: 1.partitioning into concurrent tasks 2.communication analysis to coordinate tasks 3.agglomeration into larger tasks with respect to the target platform 4.mapping of tasks to processors 1, 2 conceptual level, 3, 4 implementation dependent In practice often considered simultaneously Foster’s methodology [Foster 1995]

7 Process of dividing the computation and the data into pieces – primitive tasks Goal: Expose the opportunities for parallel processing Maximal (fine-grained) decomposition for greater flexibility Complementary techniques: –domain decomposition (data centric approach) –functional decomposition (computation centric approach) Combinations possible –usual scenario: primary decomposition – functional secondary decomposition – domain Partitioning (decomposition)

8 Primary object of decomposition: processed data –first, data associated with the problem is divided into pieces focus on the largest and/or most frequently accessed data pieces should be of comparable size –next, the computation is partitioned according to the data on which it operates usually the same code for each task (SPMD – Single Program Multiple Data) may be non-trivial, may bring up complex mathematical problems Most often used technique in parallel programming Domain (data) decomposition 3D grid data: one-, two-, three-dimensional decomposition [Foster 1995]

9 Primary object of decomposition: computation –first, computation is decomposed into disjoint tasks different codes of the tasks (MPMD – Multiple Program Multiple Data) methodological benefits: implies program structuring –gives rise to simpler modules with interfaces –c.f. object oriented programming, etc. –next, data is partitioned according to the requirements of the tasks data requirements may be disjoint, or overlap (  communication) Sources of parallelism: –concurrent processing of independent tasks –concurrent processing of a stream of data through pipelining a stream of data is passed on through a succession of tasks, each of which perform some operation on it MPSD – Multiple Program Single Data The number of task usually does not scale with the problem size – for greater scalability combine with domain decomposition on the subtasks Functional (task) decomposition Climate model [Foster 1995]

10 More tasks (at least by order of magnitude) then processors –if not: little flexibility No redundancy in processing and data –if not: little scalability Comparable size of tasks –if not: difficult load balancing Number of task proportional to the size of the problem – if not: problems utilizing additional processors Alternate partitions available? Good decomposition

11 Calculation of π by the standard numerical integration formula Consider numerical integration based on the rectangle method –integral is approximated by the area of evenly spaced rectangular strips –height of the strips is calculated as the value of the integrated function at the midpoint of the strips Example: PI calculation 1.0 0.0 2.0 4.0 F(x) = 4/(1+x 2 ) x

12 Seqential pseudocode set n (number of strips) for each strip calculate the height y of the strip (rectangle) at its midpoint sum all y to the result S endfor multiply S by the width of the strips print result PI calculation – sequential algorithm

13 Parallel pseudocode (for the task/channel model) if master then set n (number of strips) send n to the workers else // worker receive n from the master endif for each strip assigned to this task calculate the height y of the strip (rectangle) at midpoint sum all y to the (partial) result S endfor if master then receive S from all workers sum all S and multiply by the width of the strips print result else // worker send S to the master endif PI calculation – parallel algorithm

14 Domain decomposition: –primitive task– calculation of one strip height Functional decomposition: –manager task: controls the computation worker task(s): perform the main calculation manager/worker technique (also called control decomposition) more or less technical decomposition A perfectly/embarrassingly parallel problem: the (worker) processes are (almost) independent Parallel PI calculation – partitioning

15 Determination of the communication pattern among the primitive tasks Goal: Expose the information flow The tasks generated by partitioning are as a rule not independent – they cooperate by exchanging data Communication means overhead – minimize! –not included in the sequential algorithm Efficient communication may be difficult to organize –especially in domain-decomposed problems Communication analysis

16 Cathegorization local: between small number of “neighbours” global: many “distant” tasks participate structured: regular and repeated communication patterns in place and time unstructured: communication networks are arbitrary graphs static: communication partners do not change over time dynamic: communication depends on the computation history and changes at runtime synchronous: communication partners cooperate in data transfer operations asynchronous: producers are not able to determine data requests of consumers The first items are to be preferred in parallel programs Parallel communication

17 Preferably no communication involved in parallel algorithm –if not: overhead decreasing parallel efficiency Tasks have comparable communication demands –if not: little scalability Tasks communicate only with a small number of neighbours –if not: loss of parallel efficiency Communication operations and computation in different tasks can proceed concurrently communication and computation can overlap –if not: inefficient and nonscalable algorithm Good communication

18 Example: Jacobi differences Jacobi finite difference method Repeated update (in timesteps) of values assigned to points of a multidimensional grid In 2-D, the grid point i, j may get in timestep t+1 a value given by the formula (weighted mean) [Foster 1995]

19 Jacobi: parallel algorithm Decomposition (domain): – primitive task – calculation of the weighted mean in one grid point Parallel code main loop for each timestep t send X i,j (t) to each neighbour receive X i-1,j (t), X i+1,j (t), X i,j-1 (t), X i,j+1 (t) from neighbours calculate X i,j (t+1) endfor Communication: – communication channels between neighbours – local, structured, static, synchronous

20 Example: Gauss-Seidel scheme More efficient in sequential computing Not easy to parallelize [Foster 1995]

21 Process of gouping primitive tasks into larger tasks Goal: revision of the (abstract, conceptual) partitioning and communication to improve performance –choose granularity appropriate to the target parallel computer Large number of fine-grained tasks tend to be inefficient because of great communication cost task creation cost –spawn operation rather expensive (and to simplify programming demands) Agglomeration increases granularity –potential conflict with retaining flexibility and scalability [next slides] Closely related with mapping to processors Agglomeration

22 Measure characterizing the size and quantity of tasks Increasing granularity by combining several tasks into larger ones –reduces communication cost less communication (a) fewer, but larger messages (b) –reduces task creation cost less processes Agglomerate tasks that –frequently communicate with each other increases locality –cannot execute concurrently Consider also [next slides] –surface-to-volume effects –replication of computation/data Agglomeration & granularity [Quinn 2004]

23 The communication/computation ratio decreases with increasing granularity: –computation cost is proportional to the “volume” of the subdomain –communication cost is proportional to the “surface” Agglomeration in all dimension is most efficient –reduces surface for given volume –in practice is more difficult to code Difficult with unstructured communication Ex.: Jacobi finite differences [next slide] Surface-to-volume effects (1)

24 Surface-to-volume effects (2) [Foster 1995] No agglomeration: Agglomeration 4 x 4: Ex.: Jacobi finite differences – agglomeration >

25 Ability to make use of diverse computing environments –good parallel programs are resilient to changes in processor count scalability - ability to employ increasing number of tasks Too coarse granularity reduces flexibility Usual practical design: agglomerate one task per processor –can be controlled by a compile-time or runtime parameters –with some MPS (PVM, MPI-2) on-the-fly (dynamic spawn ) But consider also creating more tasks than processors: –wh en tasks often wait for remote data: several tasks mapped to one processor permit overlapping computation and communication –greater scope for mapping strategies that balance computational load over available processors a rule of thumb: an order of magnitude more tasks Optimal number of tasks: determined by a combination of analytic modelling and empirical studies Agglomeration & flexibility

To reduce communication requirements, the same computation is repeated in several tasks –compute once & distribute vs. compute repeatedly & don’t communicate – a trade off Redundant computation pays off when its computational cost is less then the communication cost –moreover it removes dependences Ex.: summation of numbers (located on separate processors) with distribution of the result Replicating computation sss ddd s s s s Without replication: 2(n – 1) steps (n – 1) additions necessary minimum With replication: (n – 1) steps n (n – 1) additions (n – 1) 2 redundant

27 Increased locality of communication Beneficial replication of computation Replication of data does not compromise scalability Similar computation and communication costs of the agglomerated tasks Number of tasks can scale with the problem size Fewer larger-grained tasks is usually more efficient than more fine-grained tasks Good agglomeration

28 Process of assigning (agglomerated) tasks to processors for execution Goal: Maximize processor utilization, minimize interprocessor communication –load balancing Concerns multicomputers only –multiprocessors: automatic task scheduling Guidelines to minimize execution time: –concurrent task place on different processors (increase concurrency) –tasks with frequent communication place on the same processor (enhance locality) Optimal mapping is generally an NP- complete problem –strategies, heuristics for special classes of problems available Mapping conflicting

29 Basic mapping strategies [Quinn 2004]

30 Mapping strategy with the aim to keep all processors busy during the execution of the parallel program –minimization of the idle time In heterogeneous computing environment every parallel application may need (dynamic) load balancing Static load balancing –performed before the program enters the solution phase Dynamic load balancing –needed when task created/destroyed at run-time and/or comm./comp requirements of tasks vary widely –invoked occasionally during the execution of the parallel program analyses the current computation and rebalances it may imply significant overhead! Load balancing Bad load balancing [LLNL 2010] barrier

31 Most appropriate for domain decomposed problems Representative examples [next slides] –recursive bisection –probabilistic methods –local algorithms Load-balancing algorithms

32 Recursive cuts into subdomains of nearly equal computational cost while attempting to minimize communication –allows the partitioning algorithm itself to be executed in parallel Recursive bisection Coordinate bisection: for irregular grids with local communication cuts into halves based on physical coordinates of grid points simple, but does care for communication unbalanced bisection: does not necessarily divide into halves to reduce communication a lot of variants e.g. recursive graph bisection Irregular grid for a superconductivity simulation [Foster 1995] 

33 Allocate tasks randomly on processors –about the same computation load can be expected for large number of tasks typically at least ten times as many tasks as processors required Communication is usually not considered –appropriate for tasks with little communication and/or little locality in communication Simple, low cost, scalable Variant: cyclic mapping for spatial locality in load levels –each of p processors is allocated every p th task Variant: block cyclic distributions –blocks of tasks are allocated to processors Probabilistic methods 1.0 0.0 2.0 4.0 F(x) = 4/(1+x 2 ) x proc. #1 proc. #2 proc. #3 

34 Compensate for changes in computational load using only local information obtained from a small number of “neighbouring” tasks –do not require expensive global knowledge of computational state If imbalance exists (threshold), some computation load is transferred to the less loaded neighbour Simple, but less efficient then global algorithms –slow when adjusting major changes in load characteristics Advantageous for dynamic load balancing Local algorithms  Local algorithm for a grid problem [Foster 1995]

35 Suitable for a pool of independent tasks –represent stand-alone problems, contain solution code + data –can be conceived as special kind of data Often obtained from functional decomposition –many tasks with weak locality Centralized or distributed variants Dynamic load balancing by default Examples: –(hierarchical) manager/worker –decentralized schemes Task-scheduling algorithms

36 Simple task scheduling scheme –sometimes called “master/slave” Central manager task is responsible for problem allocation –maintains a pool (queue) of problems e.g. a search in a particular tree branch Workers run on separate processors and repeatedly request and solve assigned problems –may also sent new problems to the manager Efficiency: –consider cost of problem transfer prefetching, caching applicable –manager must not become a bottleneck Hierarchical manager/worker variant –introduces a layer of submanagers responsible for subset of workers Manager/worker [Wilkinson 1999]

37 Task-scheduling without global management Task pool is a data structure distributed among many processors The pool is accessed asynchronously by idle workers –various access polices: neighbours, by random, etc. Termination detection may be difficult Decentralized schemes

38 In general: Try to balance conflicting requirements for equitable load distribution and low communication cost When possible, use static mapping allocating each process to a single processor Dynamic load balancing / task scheduling can be appropriate when the number or size of tasks is variable or not known until runtime With centralized load-balancing schemes verify that the manager will not become a bottleneck Consider implementation cost Good mapping

39 Foster’s design methodology is conveniently applicable –in [Quinn 2004] made use of for the design of many parallel programs in MPI (OpenMP) In practice, all phases often considered in parallel In bad practice, conceptual phases skipped –machine-dependent design from the very beginning Some kind of a “life-belt” (“fix point”) when the development comes into troubles Conclusions

40 Further study [Foster 1995] Designing and Building Parallel Programs [Quinn 2004] Parallel Programming in C with MPI and OpenMP In most textbooks a chapter like “Principles of parallel algorithm design” –often concentrated on the mapping step

42 snad ok Comments on the lecture

43 Domain decomposition, fixed and equal tasks, structured comm.: –minimize interprocess communication –agglomerate tasks mapped to the same processor Domain decomposition, variable work per task and/or structured comm.: –employ load balancing algorithms (seek efficient agglomeration, mapping) –e.g. based on recursive bisection, probabilistic methods –weigh the overhead of load balancing against the benefits of reduced execution time Domain decomposition, variable number of tasks and/or dynamic changes in communication, computation per task: –dynamic load balancing (periodically revise agglomeration, mapping) –local algorithms preferred (not as much expensive) Functional decomposition: –often short-lived tasks interacting at the start/end of execution –task-scheduling algorithms (allocate tasks to idle processors) Basic mapping strategies

44 Graph bisection for complex unstructured grids based on connectivity graphs assigns vertices to subdomains according to the graph distance Many variants possible. Recursive bisection (2/2) coordinategraph

45 Example tree search

1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.

Similar presentations

Presentation on theme: "1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.

Similar presentations

Presentation on theme: "1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR."— Presentation transcript:

Similar presentations

About project

Feedback