Parallel Programming Sathish S. Vadhiyar Course Web Page:

Parallel Programming Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009

Motivation for Parallel Programming Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. Certain class of algorithms lend themselves Aggregate bandwidth to memory/disk. Increase in data throughput. Clock rate improvement in the past decade – 40% Memory access time improvement in the past decade – 10% Grand challenge problems (more later)

Challenges / Problems in Parallel Algorithms  Building efficient algorithms.  Avoiding Communication delay Idling Synchronization

Challenges P0 P1 Idle time Computation Communication Synchronization

How do we evaluate a parallel program?  Execution time, T p  Speedup, S S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)  Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1  Scalability – Limitations in parallel computing, relation to n and p.

Speedups and efficiency Ideal p S Practical p E

Limitations on speedup – Amdahl’s law  Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.  Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.  Places a limit on the speedup due to parallelism.  Speedup = 1 (f s + (f p /P))

Amdahl’s law Illustration Courtesy: http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm S = 1 / (s + (1-s)/p)

Amdahl’s law analysis fP=1P=4P=8P=16P=32 1.001.04.008.0016.0032.00 0.991.03.887.4813.9124.43 0.981.03.777.0212.3119.75 0.961.03.576.2510.0014.29 For the same fraction, speedup numbers keep moving away from processor size. Thus Amdahl’s law is a bit depressing for parallel programming. In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Gustafson’s Law  Amdahl’s law – keep the parallel work fixed  Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time  For a particular number of processors, find the problem size for which parallel time is equal to the constant time  For that problem size, find the sequential time and the corresponding speedup  Thus speedup is scaled or scaled speedup

Metrics (Contd..) NP=1P=4P=8P=16P=32 641.00.800.570.33 1921.00.920.800.60 5121.00.970.910.80 Table 5.1: Efficiency as a function of n and p.

Scalability  Efficiency decreases with increasing P; increases with increasing N  How effectively the parallel algorithm can use an increasing number of processors  How the amount of computation performed must scale with P to keep E constant  This function of computation in terms of P is called isoefficiency function.  An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

Scalability Analysis – Finite Difference algorithm with 1D decomposition Hence isoefficiency function = O(P 2 ) since computation is O(N 2 ) Can be satisfied with N = P, except for small P. For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.

Scalability Analysis – Finite Difference algorithm with 2D decomposition Hence isoefficiency function = O(P) Can be satisfied with N = sqroot(P) 2D algorithm is more scalable than 1D

Parallel Algorithm Design

Steps  Decomposition – Splitting the problem into tasks or modules  Mapping – Assigning tasks to processor  Mapping’s contradictory objectives To minimize idle times To reduce communications

Mapping  Static mapping Mapping based on Data partitioning  Applicable to dense matrix computations  Block distribution  Block-cyclic distribution Graph partitioning based mapping  Applicable for sparse matrix computations Mapping based on task partitioning 000111222 010120122

Based on Task Partitioning  Based on task dependency graph  In general the problem is NP complete 0 04 0246 01234567

Mapping  Dynamic Mapping A process/global memory can hold a set of tasks Distribute some tasks to all processes Once a process completes its tasks, it asks the coordinator process for more tasks Referred to as self-scheduling, work- stealing

Interaction Overheads  In spite of the best efforts in mapping, there can be interaction overheads  Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc.  Some techniques can be used to minimize interactions

Parallel Algorithm Design - Containing Interaction Overheads  Maximizing data locality Minimizing volume of data exchange  Using higher dimensional mapping  Not communicating intermediate results Minimizing frequency of interactions  Minimizing contention and hot spots Do not use the same communication pattern with the other processes in all the processes

Parallel Algorithm Design - Containing Interaction Overheads  Overlapping computations with interactions Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) Initiate communication for type 1; During communication, perform type 2  Overlapping interactions with interactions  Replicating data or computations Balancing the extra computation or storage cost with the gain due to less communication

Parallel Algorithm Classification – Types - Models

Parallel Algorithm Types  Divide and conquer  Data partitioning / decomposition  Pipelining

Divide-and-Conquer  Recursive in structure Divide the problem into sub-problems that are similar to the original, smaller in size Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner Combine the solutions to create a solution to the original problem

Divide-and-Conquer Example: Merge Sort  Problem: Sort a sequence of n elements  Divide the sequence into two subsequences of n/2 elements each  Conquer: Sort the two subsequences recursively using merge sort  Combine: Merge the two sorted subsequences to produce sorted answer

Partitioning 1.Breaking up the given problem into p independent subproblems of almost equal sizes 2.Solving the p subproblems concurrently  Mostly splitting the input or output into non-overlapping pieces  Example: Matrix multiplication  Either the inputs (A or B) or output (C) can be partitioned.

Pipelining Occurs with image processing applications where a number of images undergoes a sequence of transformations.

Parallel Algorithm Models  Data parallel model Processes perform identical tasks on different data  Task parallel model Different processes perform different tasks on same or different data – based on task dependency graph  Work pool model Any task can be performed by any process. Tasks are added to a work pool dynamically  Pipeline model A stream of data passes through a chain of processes – stream parallelism

Parallel Program Classification - Models - Structure - Paradigms

Parallel Program Models  Single Program Multiple Data (SPMD)  Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Parallel Program Structure Types  Master-Worker / parameter sweep / task farming  Embarassingly/plea singly parallel  Pipleline / systolic / wavefront  Tightly-coupled  Workflow P0 P1P2P3P4 P0P1P2P3P4

Programming Paradigms  Shared memory model – Threads, OpenMP  Message passing model – MPI  Data parallel model – HPF Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks

Classification of Architectures – Flynn’s classification  Single Instruction Single Data (SISD): Serial Computers  Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Flynn’s classification  Multiple Instruction Single Data (MISD): Not popular  Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Based on Memory  Shared memory  2 types – UMA and NUMA UMA NUMA Examples: HP- Exemplar, SGI Origin, Sequent NUMA-Q Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Based on Memory  Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/  Recently multi-cores  Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Cache Coherence - for details, read 2.4.6 of book Interconnection networks - for details, read 2.4.2-2.4.5 of book

Cache Coherence in SMPs Main Memory CPU0CPU1CPU2CPU3 cache0 cache1cache2cache3 a aaaa All processes read variable ‘x’ residing in cache line ‘a’ Each process updates ‘x’ at different points of time Challenge: To maintain consistent view of the data Protocols: Write update Write invalidate

Caches Coherence Protocols and Implementations  Write update – propagate cache line to other processors on every write to a processor  Write invalidate – each processor get the updated cache line whenever it reads stale data  Which is better??

Caches –False sharing Main Memory CPU0 CPU1 cache1 A0 – A8 A1, A3, A5… A9 – A15 cache0 A0, A2, A4… Modify the algorithm to change the stride Different processors update different parts of the same cache line Leads to ping-pong of cache lines between processors Situation better in update protocols than invalidate protocols. Why?

Caches Coherence using invalidate protocols  3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0  Implementations Snoopy  for bus based architectures  Memory operations are propagated over the bus and snooped  Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors Directory-based  A central directory maintains states of cache blocks, associated processors  Implemented with presence bits

Interconnection Networks  An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network  Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches

Interconnection Networks  Static Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D), Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network  Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.  For more details, and evaluation of topologies, refer to book

Evaluating Interconnection topologies  Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube -  Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

Evaluating Interconnection topologies  bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P 2 /4 P/2

Evaluating Interconnection topologies  channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes  channel rate – performance of a single physical wire  channel bandwidth – channel rate times channel width  bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

 END

Parallel Programming Sathish S. Vadhiyar Course Web Page:

Similar presentations

Presentation on theme: "Parallel Programming Sathish S. Vadhiyar Course Web Page:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming Sathish S. Vadhiyar Course Web Page:

Similar presentations

Presentation on theme: "Parallel Programming Sathish S. Vadhiyar Course Web Page:"— Presentation transcript:

Similar presentations

About project

Feedback