Presentation is loading. Please wait.

Presentation is loading. Please wait.

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7 - 20131 FIT5174 Distributed & Parallel Systems Lecture 7 Parallel Computer System Architectures.

Similar presentations


Presentation on theme: "FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7 - 20131 FIT5174 Distributed & Parallel Systems Lecture 7 Parallel Computer System Architectures."— Presentation transcript:

1 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 7 Parallel Computer System Architectures

2 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Acknowledgement These slides are based on slides and material by: Carlo Kopp

3 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computing Parallel computing is a form of computation in which many instructions are carried out simultaneously It operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (i.e. at the same time) There are several different forms of parallel computing: bit-level parallelism, instruction-level parallelism, data parallelism, and task parallelism. Serial computingParallel computing

4 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computing Contemporary computer applications require the processing of large amounts of data in sophisticated ways. Example include: parallel databases, data mining oil exploration web search engines, web based business services computer-aided diagnosis in medicine management of national and multi-national corporations advanced graphics and virtual reality, particularly in the entertainment industry networked video and multi-media technologies collaborative work environments Ultimately, parallel computing is an attempt to minimise time required to compute a problem, despite the performance limitations of individual CPUs / cores.

5 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computing Terminology There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. Flynn's taxonomy distinguishes multi-processor computer architectures according to two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. The 4 possible classifications according to Flynn. S I S D : Single Instruction, Single Data S I M D : Single Instruction, Multiple Data M I S D : Multiple Instruction, Single Data M I M D : Multiple Instruction, Multiple Data

6 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Concepts and Terminology At the executable machine code level, programs are seen by the processor or core as a series of machine instructions, in some machine specific binary code; The common format of any instruction is that of an “operation code” or “opcode” and some “operands’, which are arguments the processor/core can understand; Typically, operands are held in registers in the processor/core which store several bytes of data, or memory addresses pointing to locations in the machine’s main memory; In a “conventional” or “general purpose” processor/core a single instruction combines one opcode with two or three operands, e.g. ADD R1, R2, R3 – add contents of R1 and R2, put result into R3

7 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Flynn’s Classification

8 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Flynn’s Classification - SISD Single Instruction, Single Data (SISD): A serial (non-parallel or “conventional”) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and until recently, the most prevalent form of computer Examples: most PCs, single CPU workstations and mainframes

9 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Flynn’s Classification - SIMD Single Instruction, Multiple Data (SIMD): A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element This type of machine typically has an instruction dispatcher, a very high- bandwidth internal network, and a very large array of very small-capacity instruction units. Best suited for specialized problems characterized by a high degree of regularity, such as image processing, matrix algebra etc. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines Examples: Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2 Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

10 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Flynn’s Classification - SIMD

11 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Flynn’s Classification - MISD Multiple Instruction, Single Data (MISD): A single data stream is fed into multiple processing units. Each processing unit operates on data independently via independent instruction streams. Few actual examples of this class of parallel computer have ever existed. One was the experimental Carnegie- Mellon computer Some conceivable uses might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message.

12 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Flynn’s Classification - MIMD Multiple Instruction, Multiple Data (MIMD): Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic Examples: most current supercomputers, networked parallel computer "grids" and multi-processor SMP computers - including some types of PCs.

13 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computer Memory Architectures Broadly divided into three categories –Shared memory –Distributed memory –Hybrid Shared Memory Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA; Uniform Memory Access vs Non-Uniform Memory Access models.

14 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computer - Shared Memory

15 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computer - Distributed Memory Distributed Memory Distributed memory systems require a communication network to connect inter-processor memory. Processors have their own local memory. There is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of “cache coherency” does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network “fabric” used for data transfers varies widely, though it can be as simple as Ethernet, or as complexed as a specialised bus or switching device.

16 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computer - Distributed Memory

17 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computer - Hybrid Memory Hybrid: The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future. Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures.

18 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computer - Hybrid Memory

19 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Programming Models Overview There are several parallel programming models in common use: –Shared Memory –Threads –Message Passing –Data Parallel –Hybrid Parallel programming models exist as an abstraction above hardware and memory architectures. Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware.

20 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Computing Performance General Speed-up formula Execution time components Inherently sequential computations:  (n) Potentially parallel computations:  (n) Communication operations:  (n,p)

21 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Speed-up Formula  (n)/p (n,p)(n,p)  (n)/p +  (n,p) Speed-up Computations Comps + Comms Communication s

22 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture AmDahl’s Law of Speed-up It states that a small portion of the program which cannot be parallelized will limit the overall speed-up available from parallelization. Any large mathematical or engineering problem will typically consist of several parallelizable parts and several non- parallelizable (sequential) parts. This relationship is given by the equation: where S is the speed-up of the program (as a factor of its original sequential runtime), and P is the fraction that is parallelizable.

23 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Interesting Amdahl Observation If the sequential portion of a program is 10% of the runtime, we can get no more than a 10 x speed-up, regardless of how many processors are added. This puts an upper limit on the usefulness of adding more parallel execution units.

24 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Amdahl’s Law

25 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Efficiency Efficiency 0   (n,p)  1 Amdahl’s law Let f =  (n)/(  (n) +  (n)); i.e., f is the fraction of the code which is inherently sequential Processors Speedup Efficiency 

26 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Examples 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program?

27 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Amdahl’s Law limitations Limitations of Amdahl’s Law Ignores  (n,p) - overestimates speedup Assumes f constant, so underestimates speedup achievable Amdahl Effect Typically  (n,p) has lower complexity than  (n)/p As n increases,  (n)/p dominates  (n,p) As n increases, speedup increases As n increases, sequential fraction f decreases. n = 100 n = 1,000 n = 10,000 Speedup Processors

28 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Gustafson’s Law Gustafson's Law (also known as Gustafson-Barsis' law, 1988) states that any sufficiently large problem can be efficiently parallelized. Gustafson's Law is closely related to Amdahl's law, which gives a limit to the degree to which a program can be sped up due to parallelization. S(P) = P − α * (P − 1). where P is the number of processors, S is the speedup, and α the non-parallelizable part of the process Gustafson's law addresses the shortcomings of Amdahl's law, which cannot scale to match availability of computing power as the machine size increases.

29 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Gustafson’s Law Also, It removes the fixed problem size or fixed computation load on the parallel processors: instead, he proposes a fixed time concept which leads to scaled speed up. Amdahl's law is based on fixed workload or fixed problem size. It implies that the sequential part of a program does not change with respect to machine size (i.e, the number of processors). However the parallel part is evenly distributed by n processors.

30 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Performance Summary Performance terms –Speedup –Efficiency What prevents linear speedup? –Serial operations –Communication operations –Process start-up –Imbalanced workloads –Architectural limitations Analyzing parallel performance –Amdahl’s Law –Gustafson-Barsis’ Law

31 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Programming Examples This example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements. The serial program calculates one element at a time in sequential order. Serial code could be of the form: do j = 1, n do i = 1, m a(i,j) = fcn(i,j) end do The calculation of elements is independent of one another - leads to an embarrassingly parallel situation. The problem should be computationally intensive

32 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Parallel Programming -2D example Arrays elements are distributed so that each processor owns a portion of an array (subarray). Independent calculation of array elements insures there is no need for communication between tasks. Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage. After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example: do j = mystart, myend do i = 1,m a(i,j) = fcn(i,j) end do Notice that only the outer loop variables are different from the serial solution.

33 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Pseudo-code find out if I am MASTER or WORKER if I am MASTER initialize the array send each WORKER info on part of array it owns send each WORKER its portion of initial array receive from each WORKER results else if I am WORKER receive from MASTER info on part of array I own receive from MASTER my portion of initial array # calculate my portion of array do j = my_first_column, my_last_column do i = 1,n a(i,j) = fcn(i,j) end do send MASTER results endif

34 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Pi Calculation : Serial solution PI Calculation The value of PI can be calculated in a number of ways. Consider the following method of approximating PI –Inscribe a circle in a square –Randomly generate points in the square –Determine the number of points in the square that are also in the circle –Let r be the number of points in the circle divided by the number of points in the square –PI ~ 4 r –Note that the more points generated, the better the approximation

35 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Pi Calculation : Serial solution Serial pseudo code for this procedure: npoints = circle_count = 0 do j = 1,npoints generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints Note that most of the time in running this program would be spent executing the loop Leads to an embarrassingly parallel solution –Computationally intensive –Minimal communication –Minimal I/O

36 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Pi Calculation : Parallel Solution Parallel Solution Parallel strategy: break the loop into portions that can be executed by the tasks. For the task of approximating Pi : –Each task executes its portion of the loop a number of times. –Each task can do its work without requiring any information from the other tasks (there are no data dependencies). –Uses the SPMD** model. One task acts as master and collects the results. Pseudo code solution: red highlights changes for parallelism. [**SPMD: (Single Process, Multiple Data) or (Single Program, Multiple Data) Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster. SPMD is the most common style of parallel programming. It is a subcategory of MIMD of Flynn’s Taxonomy]

37 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture Pi Calculation : Parallel Solution pseudocode npoints = circle_count = 0 p = number of tasks num = npoints/p find out if I am MASTER or WORKER do j = 1,num generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do if I am MASTER receive from WORKERS their circle_counts compute PI (use MASTER and WORKER calculations) else if I am WORKER send to MASTER circle_count endif

38 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture D Wave Equation Parallel Solution Implement as an SPMD model The entire amplitude array is partitioned and distributed as sub-arrays to all tasks. Each task owns a portion of the total array. Load balancing: all points require equal work, so the points should be divided equally A block decomposition would have the work partitioned into the number of tasks as chunks, allowing each task to own mostly contiguous data points.

39 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture D Wave Equation Parallel Solution Communication need only occur on data borders. The larger the block size the less the communication. The equation to be solved is the one-dimensional wave equation: A(i, t+1) = (2.0 * A(i, t)) - A(i, t-1) + (c * (A(i-1, t) - (2.0 * A(i, t)) + A(i+1, t))) where c is a constant Note that amplitude will depend on previous timesteps (t, t-1) and neighboring points (i-1, i+1). Data dependence will mean that a parallel solution will involve communications.

40 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture D Wave Equation Parallel Solution find out number of tasks and task identities #Identify left and right neighbors left_neighbor = mytaskid – 1; right_neighbor = mytaskid +1 if mytaskid = first then left_neigbor = last if mytaskid = last then right_neighbor = first find out if I am MASTER or WORKER if I am MASTER initialize array ; send each WORKER starting info and subarray else if I am WORKER receive starting info and subarray from MASTER endif #Update values for each point along string #In this example the master participates in calculations do t = 1, nsteps send left endpoint to left neighbor ; receive left endpoint from right neighbor send right endpoint to right neighbor ; receive right endpoint from left neighbor #Update points along line do i = 1, npoints newval(i) = (2.0 * values(i)) - oldval(i) + (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1))) end do #Collect results and write to file if I am MASTER receive results from each WORKER write results to file else if I am WORKER send results to MASTER endif


Download ppt "FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7 - 20131 FIT5174 Distributed & Parallel Systems Lecture 7 Parallel Computer System Architectures."

Similar presentations


Ads by Google