DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15 Ext : 6532 FSKTM, UPM

Programming Models for Parallel Computing First look at an example of parallelizing a real-world task, taken from Solving Problems on Concurrent Processors, Fox et al. Hadrian's Wall was built by the ancient Romans to keep the marauding Scottish barbarians out of Roman England. It was originally 120 km long and 2 meters high. How would you build such a huge structure in the shortest possible time? Clearly the sequential approach | a single bricklayer building the entire wall | would be too slow. Can have modest task parallelism by having different specialist workers concurrently:  making the bricks  delivering the bricks  laying the bricks But unless we can replicate these workers, this only givesus a threefold speedup of the process

Data Parallelism To really speed up completion of the task, need many workers laying the bricks concurrently. In general there are two ways to do this:  pipelining (vectorization)  replication (parallelization) Concurrent execution of a task requires assigning different sections of the problem (processes and data) to different processors. We will concentrate on data parallelism, where the processors work concurrently on their own section of the whole data set, or domain. Splitting up the domain between processors is known as domain decomposition. Each processor has their own sub-domain, or grain, to work on. Parallelism may be described as fine-grained (lots of very small domains) or coarse-grained (a smaller number of larger domains). Deciding how the domain decomposition is done is a key issue in implementing efficient parallel processing. For the most efficient (least time) execution of the task, need to:  minimize communication of data between processors  distribute workload equally among the processors (known as load balancing)

Vectorization (Pipelining) One possible approach is to allocate a different bricklayer to each row of the wall, i.e. a horizontal decomposition of the problem domain. This is a pipelined approach - each bricklayer has to wait until the row underneath them has been started, so there is some inherent inefficiency. Once all rows have been started (the pipeline is full) all the bricklayers (processors) are working effciently at the same time, until the end of the task, when there is some overhead (idle workers) while the upper levels are completed (the pipeline is flushed).

Parallelization (Replication) Another approach is to do a vertical decomposition of the problem domain, so each bricklayer gets a vertical section of the wall to complete. In this case, the workers must communicate and synchronize their actions at the edges where the sub-domains meet. In general this communication and synchronization will incur some overhead, so there is some inefficiency. However each worker has an inner section of wall within their sub-domain that is completely independent of the others, which they can build just as efficiently as if there were no other workers. As long as the time taken to build this inner section is much longer than the time taken up by the communication and synchronization overhead, then the parallelization will be efficient and give good speedup over using a single worker.

Parallel I/O For large tasks with lots of data, need efficient means to pass the appropriate data to each processor. In building a wall, need to keep each bricklayer supplied with bricks. Simple approach has a single host processor connected to outside network and handling all I/O. Passes data to other processors through internal comms network of the machine. Host processor is a sequential bottleneck for I/O. Bandwidth is limited to a single network connection. Better approach is to do I/O in parallel, so each node (or group of nodes) has a direct I/O channel to a disk array.

Domain Decompositions Some standard domain decompositions of a regular 2D grid (array) include: BLOCK - contiguous chunks of rows or columns of data on each processor. BLOCK-BLOCK - block decomposition in both dimensions. CYCLIC - data is assigned to processors like cards dealt to poker players, so neighbouring points are on different processors. This can be good for load balancing applications with varying workloads that have certain types of communication, e.g. very little, or a lot (global sums or all-to-all), or strided. BLOCK-CYCLIC - a block decomposition in one dimension, cyclic in the other. SCATTERED - points are scattered randomly across processors. This can be good for load balancing applications with little (or lots of) communication. The human brain seems to work this way - neighboring sections may control widely separated parts of the body.

Domain Decompositions Illustrated The squares represent the 2D data array, the colours represent the processors where the data elements are stored. BLOCKBLOCK-BLOCK CYCLICBLOCK-CYCLIC

Static Load Balancing For maximum efficiency, domain decomposition should give equal work to each processor. In building the wall, can just give each bricklayer an equal length segment. But things can become much more complicated:  What if some bricklayers are faster than others? (this is like an inhomogeneous cluster of different workstations)  What if there are guard towers every few hundred meters, which require more work to construct? (in some applications, more work is required in certain parts of the domain) If we know in advance  1. the relative speed of the processors, and  2. the relative amount of processing required for each part of the problem then we can do a domain decomposition that takes this into account, so that different processors may have different sized domains, but the time to process them will be about the same. This is static load balancing, and can be done at compile-time. For some applications, maintaining load balance while simultaneously minimising communication can be a very diffcult optimisation problem

Irregular Doman Decomposition In this figure the airflow over an aeroplane wing is modeled on an irregular triangulated mesh. The grid is finer in areas where there is the most change in the airflow (e.g. turbulent regions) and coarsest where the flow is more regular (laminar). The domain is distributed among processors given by different colours.

Dynamic Load Balancing In some cases we do not know in advance one (or both) of: the effective performance of the processors - may be sharing the processors with other applications, so the load and available CPU may vary the amount of work required for each part of the domain - many applications are adaptive or dynamic, and the workload is only known at runtime (e.g. dynamic irregular mesh for CFD, or varying convergence rates for PDE solvers in different sections of a regular grid) In this case we need to dynamically change the domain decomposition by periodically repartitioning the data between processors. This is dynamic load balancing, and it can involve substantial overheads in: Figuring out how to best repartition the data - may need to use a fast method that gives a good (but not optimal) domain decomposition, to reduce computation. Moving the data between processors - could restrict this to local (neighbouring processor) moves to reduce communication. Usually repartition as infrequently as possible (e.g. every few iterations instead of every iteration). There is a tradeoff between performance improvement and repartitioning overhead.

Causes of Inefficiency in Parallel Programs Communication Overhead In some cases communication can be overlapped with computation, i.e. needed data can be prefetched at the same time as useful computation is being done. But often processors must wait for data, causing inefficiency (like a cache miss on a sequential machine, but worse). Load Imbalance Overhead Any load imbalance will cause some processors to be idle some of the time. In some problems the domain will change during the computation, requiring expensive dynamic load balancing. Algorithmic Overhead The best algorithm to solve a problem in parallel is often slightly dfferent (or in some cases very different) to the best sequential algorithm. Any excess cycles caused by the difference in algorithm (more operations or more iterations) gives an algorithmic overhead. Note that to calculate speedup honestly, should not use the time for the parallel algorithm on 1 processor, but the time for the best sequential algorithm. Sequential Overhead The program may have parts that are not parallelizable, so that each processor replicates the same calculation.

Parallel Speedup If N workers or processors are used, expect to be able to finish the task N times faster. Speedup = Time on 1 processor Time on N processors Speedup will (usually) be at most N on N processors. (Question: How can it be more?) N.B. Usually better to plot speedup vs number of processors, rather than time taken (1/speedup) vs #procs, since it is much harder to judge deviation from a 1/x curve than from a straight line. Any problem of fixed size will have a maximum speedup, beyond which adding more processors will not reduce (and will usually increase) the time taken. Linear Speedup (Perfect Speedup) Actual Speedup No. of Processors Speedup

Superlinear Speedup Answer to previous question..... It is possible to get speedups greater than N on N processors if the parallel algorithm is very efficient (no load imbalance or algorithmic overhead, and any communication is overlapped with computation), and if splitting the data among processors allows a greater proportion of the data to _t into cache memory on each processor, resulting in faster sequential performance of the program on each processor. Same idea applies to programs that would be out-of-core on 1 processor (i.e. require costly paging to disk) but can have all data in core memory if they are spread over multiple processors. Some parallel implementations of tree search applications (e.g. branch and bound optimisation algorithms) can also have superlinear speedup, by exploring multiple branches in parallel, allowing better pruning and thus finding the solution faster. However the same efect can be achieved by running multiple processes (or threads) on a single processor.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15.

Similar presentations

Presentation on theme: "DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15.

Similar presentations

Presentation on theme: "DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15."— Presentation transcript:

Similar presentations

About project

Feedback