Presentation is loading. Please wait.

Presentation is loading. Please wait.

PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003

Similar presentations


Presentation on theme: "PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003"— Presentation transcript:

1 PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003
TOPICS: Parallel computing requires an understanding of parallel algorithms, parallel languages, and parallel architecture, all of which are covered in this class for the following topics. Fundamental concepts in parallel computation. Synchronous Computation SIMD, Vector, Pipeline Computing Associative Computing Fortran 90 ASC Language MultiC Language Asynchronous (MIMD or Multiprocessors) Shared Memory Computation OpenMP language Distributed Memory MIMD/Multiprocessor Computation Sometimes called Multicomputers Programming using Message Passing MPI language Interconnection Networks (SIMD and MIMD) Comparison of MIMD and SIMD Computation in Real Time Computation Applications GRADING FOR PDC Five or six homework assignments Midterm and Final Examination Grading: Homework 50%, Midterm 20%, Final 30% Parallel Computers

2 Introduction to Parallel Computing (Chapter One)
References: [1] - [4] given below. Chapter 1, “Parallel Programming ” by Wilkinson, el. Chapter 1, “Parallel Computation” by Akl Chapter 1-2, “Parallel Computing” by Quinn, 1994 Chapter 2, “Parallel Processing & Parallel Algorithms” by Roosta Need for Parallelism Numerical modeling and simulation of scientific and engineering problems. Solution for problems with deadlines Command & Control problems like ATC. Grand Challenge Problems Sequential solutions may take months or years. Weather Prediction - Grand Challenge Problem Atmosphere is divided into 3D cells. Data such as temperature, pressure, humidity, wind speed and direction, etc. are recorded at regular time-intervals in each cell. There are about 5108 cells of (1 mile) 3 . It would take a modern computer over 100 days to perform necessary calculations for 10 day forecast. Parallel Programming - a viable way to increase computational speed. Overall problem can be split into parts, each of which are solved by a single processor. Parallel Computers

3 Such gains in computational power is rare, due to reasons such as
Ideally, n processors would have n times the computational power of one processor, with each doing 1/nth of the computation. Such gains in computational power is rare, due to reasons such as Inability to partition the problem perfectly into n parts of the same computational size. Necessary data transfer between processors Necessary synchronizing of processors Two major styles of partitioning problems (Job) Control parallel programming Problem is divided into the different, non-identical tasks that have to be performed. The tasks are divided among the processors so that their work load is roughly balanced. This is considered to be coarse grained parallelism. Data parallel programming Each processor performs the same computation on different data sets. Computations do not necessarily have to be synchronous. This is considered to be fine grained parallelism. Parallel Computers

4 Shared Memory Multiprocessors (SMPs)
All processors have access to all memory locations . The processors access memory through some type of interconnection network. This type of memory access is called uniform memory access (UMA) . A data parallel programming language, based on a language like FORTRAN or C/C++ may be available. Alternately, programming using threads is sometimes used. More programming details will be discussed later. Difficulty for the SMP architecture to provide fast access to all memory locations result in most SMPs having hierarchical or distributed memory systems. This type of memory access is called nonuniform memory access (NUMA). Normally, fast cache is used with NUMA systems to reduce the problem of different memory access time for PEs. This creates the problem of ensuring that all copies of the same date in different memory locations are identical. Numerous complex algorithms have been designed for this problem. Parallel Computers

5 Message-Passing Multiprocessors (Multicomputers)
Processors are connected by an interconnection network (which will be discussed later in chapter). Each processor has a local memory and can only access its own local memory. Data is passed between processors using messages, as dictated by the program. Note: If the processors run in SIMD mode (i.e., synchronously), then the movement of the data movements over the network can be synchronous: Movement of the data can be controlled by program steps. Much of the message-passing overhead (e.g., routing, hot-spots, headers, etc.) can be avoided. A common approach to programming multiprocessors is to use message-passing library routines in addition to conventional sequential programs (e.g., MPI, PVM) The problem is divided into processes that can be executed concurrently on individual processors. A processor is normally assigned multiple processes. Multicomputers can be scaled to larger sizes much easier than shared memory multiprocessors. Parallel Computers

6 Multicomputers (cont.)
Programming disadvantages of message-passing Programmers must make explicit message-passing calls in the code This is low-level programming and is error prone. Data is not shared but copied, which increases the total data size. Data Integrity: difficulty in maintaining correctness of multiple copies of data item. Programming advantages of message-passing No problem with simultaneous access to data. Allows different PCs to operate on the same data independently. Allows PCs on a network to be easily upgraded when faster processors become available. Mixed “distributed shared memory” systems Lots of current interest in a cluster of SMPs. See Dr. David Bader’s or Dr. Joseph JaJa’s website Other mixed systems have been developed. Parallel Computers

7 Flynn’s Classification Scheme
SISD - single instruction stream, single data stream Primarily sequential processors MIMD - multiple instruction stream, multiple data stream. Includes SMPs and multicomputers. processors are asynchronous, since they can independently execute different programs on different data sets. Considered by most researchers to contain the most powerful, least restricted computers. Have very serious message passing (or shared memory) problems that are often ignored when compared to SIMDs when computing algorithmic complexity May be programmed using a multiple programs, multiple data (MPMD) technique. A common way to program MIMDs is to use a single program, multiple data (SPMD) method Normal technique when the number of processors are large. Data Parallel programming style for MIMDs SIMD: single instruction and multiple data streams. One instruction stream is broadcast to all processors. Parallel Computers

8 Flynn’s Taxonomy (cont.)
SIMD (cont.) Each processor (also called a processing element or PE) is very simplistic and is essentially an ALU; PEs do not store a copy of the program nor have a program control unit. Individual processors can be inhibited from participating in an instruction (based on a data test). All active processor executes the same instruction synchronously, but on different data On a memory access, all active processors must access the same location in their local memory. The data items form an array and an instruction can act on the complete array in one cycle. MISD - Multiple Instruction streams, single data stream. This category is not used very often. Some include pipelined architectures in this category. Parallel Computers

9 Interconnection Network Overview
References: Texts [1-4] discuss these network examples, but reference 3 (Quinn) is particularly good. Only an overview of interconnection networks is included here. It will be covered in greater depth later. The PEs (processing elements) are called nodes. A link is the connection between two nodes. bidirectional or use two directional links . Either one wire to carry one bit or parallel wires (one wire for each bit in word) can be used. The above choices do not have a major impact on the concepts presented in this course. The diameter is the minimal number of links between the two farthest nodes in the network. The diameter of a network gives the maximal distance a single message may have to travel. Completely Connected Network Each of n nodes has a link to every other node. Requires n(n-1)/2 links Impractical, unless very few processors Line/Ring Network A line consists of a row of n nodes, with connection to adjacent nodes. Called a ring when a link is added to connect the two end nodes of a line. The line/ring networks have many applications. Diameter of a line is n-1 and of a ring is n/2. Parallel Computers

10 The Mesh Interconnection Network
The nodes are in rows and columns in a rectangle. The nodes are connected by links that form a 2D mesh. (Give diagram on board.) Each interior node in a 2D mesh is connected to its four nearest neighbors. A square mesh with n nodes has n rows and n columns The diameter of a n n mesh is 2(n - 1) If the horizonal and vertical ends of a mesh to the opposite sides, the network is called a torus. Meshes have been used more on actual computers than any other network. A 3D mesh is a generalization of a 2D mesh and has been used in several computers. The fact that 2D and 3D meshes model physical space make them useful for many scientific and engineering problems. Binary Tree Network A binary tree network is normally assumed to be a complete binary tree. It has a root node, and each interior node has two links connecting it to nodes in the level below it. The height of the tree is lg n and its diameter is 2 lg n . Parallel Computers

11 Metrics for Evaluating Parallelism
References: All references cover most topics in this section and have useful information not contained in others. Ref. [2, Akl] includes new research and is the main reference used, although others (esp. [3, Quinn] and [1,Wilkinson]) are also used. Granularity: Amount of computation done between communication or synchronization steps and is ranked as fine, intermediate, and coarse. SIMDs are built for efficient communications and handle fine-grained solutions well. SMPs or message passing MIMDS handle communications less efficiently than SIMDs but more efficiently than clusters and can handle intermediate-grained solutions well. Cluster of workstations or distributed systems have slower communications among PEs and are better suited for coarse grain computations. For asynchronous computations, increasing the granularity reduces expensive communications reduces costs of process creation but reduces the nr of concurrent processes Parallel Computers

12 Parallel Metrics (continued)
Speedup A measure of the increase in running time due to parallelism. Based on running times, S(n) = ts/tp , where ts is the execution time on a single processor, using the fastest known sequential algorithm tp is the execution time using a parallel processor. In theoretical analysis, S(n) = ts/tp where ts is the worst case running time for of the fastest known sequential algorithm for the problem tp is the worst case running time of the parallel algorithm using n PEs. Parallel Computers

13 Parallel Metrics (continued)
Linear Speedup is optimal for most problems Claim: The maximum possible speedup for parallel computers with n PEs for ‘normal problems’ is n. Proof of claim Assume a computation is partitioned perfectly into n processes of equal duration. Assume no overhead is incurred as a result of this partitioning of the computation. Then, under these ideal conditions, the parallel computation will execute n times faster than the sequential computation. The parallel running time is ts /n. Then the parallel speedup of this computation is S(n) = ts /(ts /n) = n. We shall later see that this “proof” is not valid for certain types of nontraditional problems. Unfortunately, the best speedup possible for most applications is much less than n, as Above assumptions are usually invalid. Usually some parts of programs are sequential and only one PE is active. Sometimes a large number of processors are idle for certain portions of the program. E.g., during parts of the execution, many PEs may be waiting to receive or to send data. Parallel Computers

14 Parallel Metrics (cont)
Superlinear speedup (i.e., when S(n) > n): Most texts besides [2,3] argue that Linear speedup is the maximum speedup obtainable. The preceding “proof” is used to argue that superlinearity is impossible. Occasionally speedup that appears to be superlinear may occur, but can be explained by other reasons such as the extra memory in parallel system. a sub-optimal sequential algorithm used. luck, in case of algorithm that has a random aspect in its design (e.g., random selection) Selim Akl has shown that for some less standard problems, superlinear algorithms can be given. Some problems cannot be solved without use of parallel computation. Some problems are natural to solve using parallelism and sequential solutions are inefficient. The final chapter of Akl’s textbook and several journal papers have been written to establish these claims are valid, but it may still be a long time before they are fully accepted. Superlinearity has been a hotly debated topic for too long to be accepted quickly. Parallel Computers

15 Amdahl’s Law Assumes that the speedup is not superliner; i.e.,
S(n) = ts/ tp  n Assumption only valid for traditional problems. By Figure 1.29 in [1] (or slide #40), if f denotes the fraction of the computation that must be sequential, tp  f ts + (1-f) ts /n Substituting above inequality into the above equation for S(n) and simplifying (see slide #41 or book) yields Amdahl’s “law”: S(n)  1/f, where f is as above. See Slide #41 or Fig for related details. Note that S(n) never exceed 1/f and approaches 1/f as n increases. Example: If only 5% of the computation is serial, the maximum speedup is 20, no matter how many processors are used. Observations: Amdahl’s law limitations to parallelism: For a long time, Amdahl’s law was viewed as a fatal limit to the usefulness of parallelism. Parallel Computers

16 Other limitations in applying Amdahl’s Law:
Amdahl’s law is valid and some textbooks discuss how it can be used to increase the efficient of many parallel algorithms. Shows that efforts required to further reduce the fraction of the code that is sequential may pay off in large performance gains. Hardware that allows even a small decrease in the percent of things executed sequentially may be considerably more efficient. A key flaw in past arguments that Amdahl’s law is a fatal limit to the future of parallelism is Gustafon’s Law: The proportion of the computations that are sequential normally decreases as the problem size increases. Other limitations in applying Amdahl’s Law: Its proof focuses on the steps in a particular algorithm, and does not consider that other algorithms with more parallelism may exist. Amdahl’s law applies only to ‘standard’ problems were superlinearity doesn’t occurs. For more details on superlinearity, see [2] “Parallel Computation: Models and Methods”, Selim Akl, pgs (Speedup Folklore Theorem) and Chapter 12. Parallel Computers

17 More Metrics for Parallelism
Efficiency is defined by Efficiency give the percentage of full utilization of parallel processors on computation, assuming a speedup of n is the best possible. Cost: The cost of a parallel algorithm or parallel execution is defined by Cost = (running time) (Nr. of PEs) = tp  n Observe that Cost allows the quality of parallel algorithms to be compared to that of sequential algorithms. Compare cost of parallel algorithm to running time of sequential algorithm The advantage that parallel algorithms have in using multiple processors is removed by multiplying their running time by the number n of processors they are using. If a parallel algorithm requires exactly 1/n the running time of a sequential algorithm, then the parallel cost is the same as the sequential running time. Parallel Computers

18 parallel cost = O(f(t)),
More Metrics (cont.) Cost-Optimal Parallel Algorithm: A parallel algorithm for a problem is said to be cost-optimal if its cost is proportional to the running time of an optimal sequential algorithm for the same problem. By proportional, we means that cost = tp  n = k  ts where k is a constant. (See pg 67 of [1]). Equivalently, a parallel algorithm is optimal if parallel cost = O(f(t)), where f(t) is the running time of an optimal sequential algorithm. In cases where no optimal sequential algorithm is known, then the “fastest known” sequential algorithm is often used instead. Parallel Computers

19 Sieve of Eratosthenes (A Data-Parallel vs Control-Parallel Example)
Today = Outside Material (Quinn 1.4) Bring in Figures 1-9 , 1-11, 1-12, 1-14 in Quinn Bring in photocopy of this section!! Show example using 30 sqrt (30) = 5.4 Sqrt(n) (Not the best of algorithms, but good for illustrative purposes) Data structures we need Reference [3, Quinn, Ch. 1], pages 10-17 A prime number is a positive integer with exactly two factors, itself and 1. Sieve (siv) algorithm finds the prime numbers less than or equal to some positive integer n Begin with a list of natural numbers 2, 3, 4, …, n Remove composite numbers from the list by striking multiples of 2, 3, 5, and successive primes After each striking, the next unmarked natural number is prime Sieve terminates after multiples of largest prime less than or equal to have been struck from the list. Sequential Implementation uses 3 data structures Boolean array index for the numbers being sieved. An integer holding latest prime found so far. An loop index that is incremented as multiple of current prime are marked as composite nrs. Parallel Computers

20 A Control-Parallel Approach
With the square of the prime One processor marks multiples of 2 beginning at 4, another marks multiples of 3 beginning at 9, etc. Control parallelism involves applying a different sequence of operations to different data elements Useful for Shared-memory MIMD Distributed-memory MIMD Asynchronous PRAM Control-parallel sieve Each processor works with a different prime, and is responsible for striking multiples of that prime and identifying a new prime number Each processor starts marking… Shared memory contain boolean array containing numbers being sieved, integer corresponding largest prime found so far PE’s local memories contain local loop indexes keeping track of multiples of its current prime (since each is working with different prime). Parallel Computers

21 A Control-Parallel Approach (cont.)
Algorithm still works, but is wasteful Processor is sieving 2, another starts sieving 3, yet another starts sieving 4 (because first processor has not yet marked it) Again, wasteful See Figure 1-9 on Quinn, p. 13 Problems and inefficiencies Algorithm for Shared Memory MIMD Processor accesses variable holding current prime searches for next unmarked value updates variable containing current prime Must avoid having two processors doing this at same time A processor could waste time sieving multiples of a composite number How much speedup can we get? Suppose n = 1000 Sequential algorithm Time to strike out multiples of prime p is (n+1- p2)/p Multiples of 2: ((1000+1) –4)/2=997/2=498 Multiples of 3: ((1000+1) –9)/3=992/3=330 Total Sum = 1411 (number of “steps”) 2 PEs gives speedup 1411/706=2.00 3 PEs gives speedup 1411/499=2.83 3 PEs require 499 strikeout time units, so no more speedup is possible using additrional PEs Multiples of 2’s dominate with 498 strikeout steps Parallel Computers

22 A Data-Parallel Approach
n=1,000, sqrt(n) = 1,000 Let’s say p = 50 1,000,000/50 = 20,000 Data parallelism refers to using multiple PEs to apply the same sequence of operations to different data elements. Useful for following types of parallel operations SIMD PRAM Shared-memory MIMD, Distributed-memory MIMD. Generally not useful for pipeline operations. A data-parallel sieve algorithm: Each processor works with a same prime, and is responsible for striking multiples of that prime from a segment of the array of natural numbers Assume we have k processors, where k << , (i.e., k is much less than ). Each processor gets no more than n/k natural numbers All primes less than , as well as first prime greater than are in list controlled by first processor Parallel Computers

23 A Data-Parallel Approach (cont.)
Seems to be assuming sequential communication Data-parallel Sieve (cont.) Distributed-memory MIMD Algorithm Processor 1 finds next prime, broadcasts it to all PEs Each PE goes through their part of the array, striking multiples of that prime (performing same operation) Continues until first processor reaches a prime greater than sqrt(n) How much speedup can we get? Suppose n = 1,000,000 and we have k PEs There are 168 primes less than 1,000, the largest of which is 997 Maximum execution time to strike out primes  1,000,000/ k/2 +  1,000,000/ k/3 +  1,000,000/50/5 + … +  1,000,000/ k/997 etime The sequential execution time is the above sum with k = 1. The communication time = 168(k–1) ctime. We assume that each communication takes 100 times longer than the time to execute a strikeout Parallel Computers

24 A Data-Parallel Approach (cont.)
See Figure 1-11 on Quinn, p. 13 See Figure 1-12 on Quinn, p. 13 See Figure 1-14 on Quinn, p. 13 Go over material on page 17 if possible How much speedup can we get? (cont.) Speedup is not directly proportional to the number of PEs — it’s highest at 11 PEs Computation time is inversely proportional to the number of processors used Communication time increases linearly After 11 processors, the increase in the communication time is higher than the decrease in computation time, and total execution time increases. Study Figures 1-11 and 1-12 on pg 15 of [3]. How about parallel I/O time? Practically, the primes generated must be stored on an external device Assume access to device is sequential. I/O time is constant because output must be performed sequentially This sequential code severely limits the parallel speedup according to Amdahl’s law the fraction of operations that must be performed sequentially limits the maximum speedup possible. Parallel I/O is an important topic in parallel computing. Parallel Computers

25 Future Additions Needed
To be added: The work metric and work optimal concept. The speedup and slowdown results from [2]. Data parallel vs control parallel example Possibly a simpler example than sieve in Quinn Look at chapter 2 of [8, Jordan] for possible new information. Parallel Computers

26 n or Parallel Computers


Download ppt "PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003"

Similar presentations


Ads by Google