Presentation on theme: "CIS 270 - December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University."— Presentation transcript:
CIS 270 - December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University
CIS 270 - December '99 Parallel Computers Purpose - speed Divide a problem among processors Let each processor work on its portion of problem in parallel (simultaneously) with other processors Ideal - if p is the number of processors, get solution in 1/p of the time used by a computer of 1 processor Actual - rarely get that much speedup, due to delays for interprocessor communications
CIS 270 - December '99 Graphs of relevant functions
CIS 270 - December '99 Architectural issues Communications diameter - how many communication steps are necessary to send data from processor that has it to processor that needs it - large is bad Bisection width - how many wires must be cut to cut network in half - measure of how fast massive amounts of data can be moved through network - large is good Degree of network - important to scalability (ability to expand number of processors) - large is bad Limitations on speed: Limitation on expansion:
CIS 270 - December '99 PRAM - Parallel Random Access Machine Shared memory yields fast communications Source processor writes data to memory Destination processor reads data from memory Fast communications make this model theoretical ideal for fastest possible parallel algorithms for given # of processors Impractical - too many wires if lots of processors Any processor can send data to any other processor in time as follows:
Notice the tree structure of the previous algorithm:
CIS 270 - December '99 Linear array architecture Degree of network: 2 - easily expanded Bisection width: 1 - can’t move large amounts of data efficiently across network Communication diameter: n-1 - won’t perform global communication operations efficiently
CIS 270 - December '99 Total on linear array: Assume 1 item per processor Communications diameter implies Since this is the time required to total n items on a RAM, there is no asymptotic benefit to using a linear array for this problem
CIS 270 - December '99 Input-based sorting on a linear array The algorithm illustrated is a version of Selection Sort - each processor selects the smallest value it sees and passes others to the right. Time is proportional to communication diameter,
CIS 270 - December '99 Mesh architecture Square grid of processors Each processor connected by communication link to N, S, E, W neighbors Degree of network: 4 - makes expansion easy - can introduce adjacent meshes and connect border processors
CIS 270 - December '99 Application: sorting Could have initial data all in “wrong half” of mesh, as shown. Since all n items must get to correct half-mesh, time required to sort is In 1 time unit, amt. of data that can cross into correct half of mesh:
CIS 270 - December '99 In a mesh, each of these steps takes time. Hence, time for broadcast is
CIS 270 - December '99 Semigroup operation (e.g., total) in mesh 1. “Roll up” columns in parallel, totaling each column in last row by sending data downward. 2. Roll up last row to get total in a corner. 3. Broadcast total from corner to all processors. Time:
CIS 270 - December '99 Previous algorithm could run in approximately half the time by gathering total in a center, than corner, processor. Mesh total algorithm - continued However, running time is still i.e., still approximately proportional to communication diameter (with smaller constant of proportionality).
CIS 270 - December '99 Hypercube Number n of processors is a power of 2 Processors are numbered from 0 to n-1 Connected processors are those whose binary labels differ in exactly 1 bit.
CIS 270 - December '99 Illustration of total operation in hypercube. Reverse direction of arrows to broadcast result Time:
CIS 270 - December '99 Coarse-grained parallelism Most of previous discussion was of fine-grained parallelism - # of processors comparable to # of data items Realistically, few budgets accommodate such expensive computers - more likely to use coarse-grained parallelism with relatively few processors compared with # of data items. Coarse grained algorithms often based on each processor boiling its share of data down to single partial result, then using fine-grained algorithm to combine these partial results
CIS 270 - December '99 Example: coarse-grained total Suppose n data are distributed evenly (n/p per processor among p processors) 1. In parallel, each processor totals its share of the data. Time: Θ(n/p) 2. Use a fine-grained algorithm to add the partial sums (total residing in one processor) and broadcast result to all processors. In case of mesh, time: Total time for mesh: Since, this is Θ(n/p) - optimal.
CIS 270 - December '99 More info: Algorithms Sequential and Parallel by Russ Miller and Laurence Boxer Prentice-Hall, 2000 (available December, 1999)