Parallel Programs Conditions of Parallelism: Data Dependence

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)
SE-292 High Performance Computing
Programmability Issues
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
EECC756 - Shaaban #1 lec # 3 Spring Parallel Program Issues Dependency Analysis:Dependency Analysis: –Types of dependency –Dependency Graphs.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
EECC722 - Shaaban #1 lec # 3 Fall Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Introduction to Analysis of Algorithms
Parallel Computation/Program Issues
CSCI 8150 Advanced Computer Architecture
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
EECC756 - Shaaban #1 lec # 3 Spring Parallel Program Issues Dependency Analysis:Dependency Analysis: –Types of dependency –Dependency Graphs.
Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
EECC756 - Shaaban #1 lec # 3 Spring Parallel Programs Conditions of Parallelism:Conditions of Parallelism: –Data Dependence –Control Dependence.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Chapter 2.6 Comparison of Algorithms modified from Clifford A. Shaffer and George Bebis.
Week 2 CS 361: Advanced Data Structures and Algorithms
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Chapter 3 Sec 3.3 With Question/Answer Animations 1.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Classic Model of Parallel Processing
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
#1 lec # 3 Fall Parallel Computation/Program Issues Dependency Analysis:Dependency Analysis: –Types of dependency –Dependency Graphs –Bernstein’s.
CMPE655 - Shaaban #1 lec # 3 Spring Parallel Computation/Program Issues Dependency Analysis:Dependency Analysis: –Types of dependency –Dependency.
Algorithm Analysis Part of slides are borrowed from UST.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Data Structures and Algorithms in Parallel Computing Lecture 1.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Computation/Program Issues
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Parallel Programming in C with MPI and OpenMP
Perspective on Parallel Programming
Chapter 3: Principles of Scalable Performance
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
COMP60621 Fundamentals of Parallel and Distributed Systems
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Parallel Programs Conditions of Parallelism: Data Dependence Control Dependence Resource Dependence Bernstein’s Conditions Asymptotic Notations for Algorithm Analysis Parallel Random-Access Machine (PRAM) Example: sum algorithm on P processor PRAM Network Model of Message-Passing Multicomputers Example: Asynchronous Matrix Vector Product on a Ring Levels of Parallelism in Program Execution Hardware Vs. Software Parallelism Parallel Task Grain Size Example Motivating Problems With high levels of concurrency Limited Concurrency: Amdahl’s Law Parallel Performance Metrics: Degree of Parallelism (DOP) Concurrency Profile Steps in Creating a Parallel Program: Decomposition, Assignment, Orchestration, Mapping Program Partitioning Example Static Multiprocessor Scheduling Example #1 lec # 3 Spring2000 3-14-2000

Conditions of Parallelism: Data Dependence True Data or Flow Dependence: A statement S2 is data dependent on statement S1 if an execution path exists from S1 to S2 and if at least one output variable of S1 feeds in as an input operand used by S2 denoted by S1 ¾® S2 Antidependence: Statement S2 is antidependent on S1 if S2 follows S1 in program order and if the output of S2 overlaps the input of S1 denoted by S1 +¾® S2 Output dependence: Two statements are output dependent if they produce the same output variable denoted by S1 o¾® S2 #2 lec # 3 Spring2000 3-14-2000

Conditions of Parallelism: Data Dependence I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same variable is involved but because the same file is referenced by both I/O statements. Unknown dependence: Subscript of a variable is subscribed (indirect addressing) The subscript does not contain the loop index. A variable appears more than once with subscripts having different coefficients of the loop variable. The subscript is nonlinear in the loop index variable. #3 lec # 3 Spring2000 3-14-2000

Data and I/O Dependence: Examples B - S1 S3 S4 S2 S1: Load R1,A S2: Add R2, R1 S3: Move R1, R3 S4: Store B, R1 Dependence graph S1: Read (4),A(I) /Read array A from tape unit 4/ S2: Rewind (4) /Rewind tape unit 4/ S3: Write (4), B(I) /Write array B into tape unit 4/ S4: Rewind (4) /Rewind tape unit 4/ S1 S3 I/O I/O dependence caused by accessing the same file by the read and write statements #4 lec # 3 Spring2000 3-14-2000

Conditions of Parallelism Control Dependence: Order of execution cannot be determined before runtime due to conditional statements. Resource Dependence: Concerned with conflicts in using shared resources including functional units (integer, floating point), memory areas, among parallel tasks. Bernstein’s Conditions: Two processes P1 , P2 with input sets I1, I2 and output sets O1, O2 can execute in parallel (denoted by P1 || P2) if: I1 Ç O2 = Æ I2 Ç O1 = Æ O1 Ç O2 = Æ #5 lec # 3 Spring2000 3-14-2000

Bernstein’s Conditions: An Example For the following instructions P1, P2, P3, P4, P5 in program order and Instructions are in program order Each instruction requires one step to execute Two adders are available P1 : C = D x E P2 : M = G + C P3 : A = B + C P4 : C = L + M P5 : F = G ¸ E Using Bernstein’s Conditions after checking statement pairs: P1 || P5 , P2 || P3 , P2 || P5 , P5 || P3 , P4 || P5 X P1 D E +1 P2 +3 P4 ¸ P5 G B F C +2 P3 A L M X P1 D E +3 P4 +2 P3 +1 P2 C B G L ¸ P5 F A Time X P1 ¸ P5 +2 +3 +1 P2 P4 P3 Parallel execution in three steps assuming two adders are available per step Dependence graph: Data dependence (solid lines) Resource dependence (dashed lines) Sequential execution #6 lec # 3 Spring2000 3-14-2000

Asymptotic Notations for Algorithm Analysis Asymptotic analysis of computing time of an algorithm f(n) ignores constant execution factors and concentrates on determining the order of magnitude of algorithm performance. Upper bound: Used in worst case analysis of algorithm performance. f(n) = O(g(n)) iff there exist two positive constants c and n0 such that | f(n) | £ c | g(n) | for all n > n0 Þ i.e. g(n) an upper bound on f(n) O(1) < O(log n) < O(n) < O(n log n) < O (n2) < O(n3) < O(2n) #7 lec # 3 Spring2000 3-14-2000

Asymptotic Notations for Algorithm Analysis Lower bound: Used in the analysis of the lower limit of algorithm performance f(n) = W(g(n)) if there exist positive constants c, n0 such that | f(n) | ³ c | g(n) | for all n > n0 Þ i.e. g(n) is a lower bound on f(n) Tight bound: Used in finding a tight limit on algorithm performance f(n) = Q (g(n)) if there exist constant positive integers c1, c2, and n0 such that c1 | g(n) | £ | f(n) | £ c2 | g(n) | for all n > n0 Þ i.e. g(n) is both an upper and lower bound on f(n) #8 lec # 3 Spring2000 3-14-2000

The Growth Rate of Common Computing Functions log n n n log n n2 n3 2n 0 1 0 1 1 2 1 2 2 4 8 4 2 4 8 16 64 16 3 8 24 64 512 256 4 16 64 256 4096 65536 5 32 160 1024 32768 4294967296 #9 lec # 3 Spring2000 3-14-2000

Theoretical Models of Parallel Computers Parallel Random-Access Machine (PRAM): n processor, global shared memory model. Models idealized parallel computers with zero synchronization or memory access overhead. Utilized parallel algorithm development and scalability and complexity analysis. PRAM variants: More realistic models than pure PRAM EREW-PRAM: Simultaneous memory reads or writes to/from the same memory location are not allowed. CREW-PRAM: Simultaneous memory writes to the same location is not allowed. ERCW-PRAM: Simultaneous reads from the same memory location are not allowed. CRCW-PRAM: Concurrent reads or writes to/from the same memory location are allowed. #10 lec # 3 Spring2000 3-14-2000

Example: sum algorithm on P processor PRAM begin 1. for j = 1 to l ( = n/p) do Set B(l(s - 1) + j): = A(l(s-1) + j) 2. for h = 1 to log n do 2.1 if (k- h - q ³ 0) then for j = 2k-h-q(s-1) + 1 to 2k-h-qS do Set B(j): = B(2j -1) + B(2s) 2.2 else {if (s £ 2k-h) then Set B(s): = B(2s -1 ) + B(2s)} 3. if (s = 1) then set S: = B(1) end Input: Array A of size n = 2k in shared memory Initialized local variables: the order n, number of processors p = 2q £ n, the processor number s Output: The sum of the elements of A stored in shared memory Running time analysis: Step 1: takes O(n/p) each processor executes n/p operations The hth of step 2 takes O(n / (2hp)) since each processor has to perform (n / (2hp)) Ø operations Step three takes O(1) Total Running time: #11 lec # 3 Spring2000 3-14-2000

Example: Sum Algorithm on P Processor PRAM For n = 8 p = 4 Processor allocation for computing the sum of 8 elements on 4 processor PRAM 5 4 3 2 1 B(6) =A(6) P3 B(5) =A(5) B(3) B(8) =A(8) P4 B(7) =A(7) B(4) B(2) =A(2) P1 B(1) =A(1) =A(4) P2 =A(3) S= B(1) Operation represented by a node is executed by the processor indicated below the node. Time Unit #12 lec # 3 Spring2000 3-14-2000

The Power of The PRAM Model Well-developed techniques and algorithms to handle many computational problems exist for the PRAM model Removes algorithmic details regarding synchronization and communication, concentrating on the structural properties of the problem. Captures several important parameters of parallel computations. Operations performed in unit time, as well as processor allocation. The PRAM design paradigms are robust and many network algorithms can be directly derived from PRAM algorithms. It is possible to incorporate synchronization and communication into the shared-memory PRAM model. #13 lec # 3 Spring2000 3-14-2000

Performance of Parallel Algorithms Performance of a parallel algorithm is typically measured in terms of worst-case analysis. For problem Q with a PRAM algorithm that runs in time T(n) using P(n) processors, for an instance size of n: The time-processor product C(n) = T(n) . P(n) represents the cost of the parallel algorithm. For P < P(n), each of the of the T(n) parallel steps is simulated in O(P(n)/p) substeps. Total simulation takes O(T(n)P(n)/p) The following four measures of performance are asymptotically equivalent: P(n) processors and T(n) time C(n) = P(n)T(n) cost and T(n) time O(T(n)P(n)/p) time for any number of processors p < P(n) O(C(n)/p + T(n)) time for any number of processors. #14 lec # 3 Spring2000 3-14-2000

Network Model of Message-Passing Multicomputers A network of processors can viewed as a graph G (N,E) Each node i Î N represents a processor Each edge (i,j) Î E represents a two-way communication link between processors i and j. Each processor is assumed to have its own local memory. No shared memory is available. Operation is synchronous or asynchronous(message passing). Typical message-passing communication constructs: send(X,i) a copy of X is sent to processor Pi, execution continues. receive(Y, j) execution suspended until the data from processor Pj is received and stored in Y then execution resumes. #15 lec # 3 Spring2000 3-14-2000

Network Model of Multicomputers Routing is concerned with delivering each message from source to destination over the network. Additional important network topology parameters: The network diameter is the maximum distance between any pair of nodes. The maximum degree of any node in G Example: Linear array: P processors P1, …, Pp are connected in a linear array where: Processor Pi is connected to Pi-1 and Pi+1 if they exist. Diameter is p-1; maximum degree is 2 A ring is a linear array of processors where processors P1 and Pp are directly connected. #16 lec # 3 Spring2000 3-14-2000

A Four-Dimensional Hypercube Two processors are connected if their binary indices differ in one bit position. #17 lec # 3 Spring2000 3-14-2000

Example: Asynchronous Matrix Vector Product on a Ring Input: n x n matrix A ; vector x of order n The processor number i. The number of processors The ith submatrix B = A( 1:n, (i-1)r +1 ; ir) of size n x r where r = n/p The ith subvector w = x(i - 1)r + 1 : ir) of size r Output: Processor Pi computes the vector y = A1x1 + …. Aixi and passes the result to the right Upon completion P1 will hold the product Ax Begin 1. Compute the matrix vector product z = Bw 2. If i = 1 then set y: = 0 else receive(y,left) 3. Set y: = y +z 4. send(y, right) 5. if i =1 then receive(y,left) End Tcomp = k(n2/p) Tcomm = p(l+ mn) T = Tcomp + Tcomm = k(n2/p) + p(l+ mn) #18 lec # 3 Spring2000 3-14-2000

Creating a Parallel Program Assumption: Sequential algorithm to solve problem is given Or a different algorithm with more inherent parallelism is devised. Most programming problems have several parallel solutions. The best solution may differ from that suggested by existing sequential algorithms. One must: Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access, communication and synchronization Note: work includes computation, data access and I/O Main goal: Speedup (plus low programming effort and resource needs) Speedup (p) = For a fixed problem: Performance(p) Performance(1) Time(1) Time(p) #19 lec # 3 Spring2000 3-14-2000

Some Important Concepts Task: Arbitrary piece of undecomposed work in parallel computation Executed sequentially on a single processor; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Processor: Physical engine on which process executes Processes virtualize machine to programmer first write program in terms of processes, then map to processors #20 lec # 3 Spring2000 3-14-2000

Levels of Parallelism in Program Execution } Jobs or programs (Multiprogramming) Level 5 Subprograms, job steps or related parts of a program Level 4 Procedures, subroutines, or co-routines Level 3 Non-recursive loops or unfolded iterations Level 2 Instructions or statements Level 1 Coarse Grain } Increasing communications demand and mapping/scheduling overhead Higher degree of Parallelism Medium Grain } Fine Grain #21 lec # 3 Spring2000 3-14-2000

Hardware and Software Parallelism Hardware parallelism: Defined by machine architecture, hardware multiplicity (number of processors available) and connectivity. Often a function of cost/performance tradeoffs. Characterized in a single processor by the number of instructions k issued in a single cycle (k-issue processor). A multiprocessor system with n k-issue processor can handle a maximum limit of nk threads. Software parallelism: Defined by the control and data dependence of programs. Revealed in program profiling or program flow graph. A function of algorithm, programming style and compiler optimization. #22 lec # 3 Spring2000 3-14-2000

Computational Parallelism and Grain Size Grain size (granularity) is a measure of the amount of computation involved in a task in parallel computation : Instruction Level: At instruction or statement level. 20 instructions grain size or less. For scientific applications, parallelism at this level range from 500 to 3000 concurrent statements Manual parallelism detection is difficult but assisted by parallelizing compilers. Loop level: Iterative loop operations. Typically, 500 instructions or less per iteration. Optimized on vector parallel computers. Independent successive loop operations can be vectorized or run in SIMD mode. #23 lec # 3 Spring2000 3-14-2000

Computational Parallelism and Grain Size Procedure level: Medium-size grain; task, procedure, subroutine levels. Less than 2000 instructions. More difficult detection of parallel than finer-grain levels. Less communication requirements than fine-grain parallelism. Relies heavily on effective operating system support. Subprogram level: Job and subprogram level. Thousands of instructions per grain. Often scheduled on message-passing multicomputers. Job (program) level, or Multiprogrammimg: Independent programs executed on a parallel computer. Grain size in tens of thousands of instructions. #24 lec # 3 Spring2000 3-14-2000

Example Motivating Problems: Simulating Ocean Currents (a) Cross sections (b) Spatial discretization of a cross section Model as two-dimensional grids Discretize in space and time finer spatial and temporal resolution => greater accuracy Many different computations per time step set up and solve equations Concurrency across and within grid computations #25 lec # 3 Spring2000 3-14-2000

Example Motivating Problems: Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2) brute force approach Hierarchical Methods take advantage of force law: G m1m2 r2 Many time-steps, plenty of concurrency across stars within one #26 lec # 3 Spring2000 3-14-2000

Example Motivating Problems: Rendering Scenes by Ray Tracing Shoot rays into scene through pixels in image plane Follow their paths They bounce around as they strike objects They generate new rays: ray tree per input ray Result is color and opacity for that pixel Parallelism across rays All above case studies have abundant concurrency #27 lec # 3 Spring2000 3-14-2000

Limited Concurrency: Amdahl’s Law Most fundamental limitation on parallel speedup. If fraction s of seqeuential execution is inherently serial, speedup <= 1/s Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum Time for first phase = n2/p Second phase serialized at global variable, so time = n2 Speedup <= or at most 2 Possible Trick: divide second phase into two Accumulate into private sum during sweep Add per-process private sum into global sum Parallel time is n2/p + n2/p + p, and speedup at best 2n2 n2 p + n2 2n2 2n2 + p2 #28 lec # 3 Spring2000 3-14-2000

Amdahl’s Law Example: A Pictorial Depiction 1 p n2/p n2 work done concurrently Time (c) (b) (a) #29 lec # 3 Spring2000 3-14-2000

Parallel Performance Metrics Degree of Parallelism (DOP) For a given time period, DOP reflects the number of processors in a specific parallel computer actually executing a particular parallel program. Average Parallelism: given maximum parallelism = m n homogeneous processors computing capacity of a single processor D Total amount of work W (instructions, computations): or as a discrete summation Where ti is the total time that DOP = i and The average parallelism A: In discrete form #30 lec # 3 Spring2000 3-14-2000

Example: Concurrency Profile of A Divide-and-Conquer Algorithm Execution observed from t1 = 2 to t2 = 27 Peak parallelism m = 8 A = (1x5 + 2x3 + 3x4 + 4x6 + 5x2 + 6x2 + 8x3) / (5 + 3+4+6+2+2+3) = 93/25 = 3.72 Degree of Parallelism (DOP) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 11 10 9 8 7 6 5 4 3 2 1 Time t1 t2 #31 lec # 3 Spring2000 3-14-2000

Concurrency Profile & Speedup For a parallel program DOP may range from 1 (serial) to a maximum m Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors) Speedup is the ratio: , base case: Amdahl’s law applies to any overhead, not just limited concurrency. fk k fk k p  k=1  1 s + 1-s #32 lec # 3 Spring2000 3-14-2000

Parallel Performance Example The execution time T for three parallel programs is given in terms of processor count P and problem size N In each case, we assume that the total computation work performed by an optimal sequential algorithm scales as N+N2 . For first parallel algorithm: T = N + N2/P This algorithm partitions the computationally demanding O(N2) component of the algorithm but replicates the O(N) component on every processor. There are no other sources of overhead. For the second parallel algorithm: T = (N+N2 )/P + 100 This algorithm optimally divides all the computation among all processors but introduces an additional cost of 100. For the third parallel algorithm: T = (N+N2 )/P + 0.6P2 This algorithm also partitions all the computation optimally but introduces an additional cost of 0.6P2. All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100 . However, they behave differently in other situations as shown next. With N=100 , all three algorithms perform poorly for larger P , although Algorithm (3) does noticeably worse than the other two. When N=1000 , Algorithm (2) is much better than Algorithm (1) for larger P . #33 lec # 3 Spring2000 3-14-2000

Parallel Performance Example (continued) All algorithms achieve: Speedup = 10.8 when P = 12 and N=100 N=1000 , Algorithm (2) performs much better than Algorithm (1) for larger P . Algorithm 1: T = N + N2/P Algorithm 2: T = (N+N2 )/P + 100 Algorithm 3: T = (N+N2 )/P + 0.6P2 #34 lec # 3 Spring2000 3-14-2000

Steps in Creating a Parallel Program Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime, ...) Issues are the same, so assume programmer does it all explicitly #35 lec # 3 Spring2000 3-14-2000

Decomposition Break up computation into concurrent tasks to be divided among processes: Tasks may become available dynamically. No. of available tasks may vary with time. Together with assignment, also called partitioning. i.e. identify concurrency and decide level at which to exploit it. Grain-size problem: To determine the number and size of grains or tasks in a parallel program. Problem and machine-dependent. Solutions involve tradeoffs between parallelism, communication and scheduling/synchronization overhead. Grain packing: To combine multiple fine-grain nodes into a coarse grain node (task) to reduce communication delays and overall scheduling overhead. Goal: Enough tasks to keep processes busy, but not too many No. of tasks available at a time is upper bound on achievable speedup #36 lec # 3 Spring2000 3-14-2000

Assignment Specifying mechanisms to divide work up among processes: Together with decomposition, also called partitioning. Balance workload, reduce communication and management cost Partitioning problem: To partition a program into parallel branches, modules to give the shortest possible execution on a specific parallel architecture. Structured approaches usually work well: Code inspection (parallel loops) or understanding of application. Well-known heuristics. Static versus dynamic assignment. As programmers, we worry about partitioning first: Usually independent of architecture or programming model. But cost and complexity of using primitives may affect decisions. #37 lec # 3 Spring2000 3-14-2000

Orchestration Naming data. Structuring communication. Synchronization. Organizing data structures and scheduling tasks temporally. Goals Reduce cost of communication and synch. as seen by processors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Closest to architecture (and programming model & language). Choices depend a lot on comm. abstraction, efficiency of primitives. Architects should provide appropriate primitives efficiently. #38 lec # 3 Spring2000 3-14-2000

Mapping One extreme: space-sharing Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs. Mapping can be specified statically or determined at runtime by load-balancing algorithms (dynamic scheduling). Two aspects of mapping: Which processes will run on the same processor, if necessary Which process runs on which particular processor mapping to a network topology One extreme: space-sharing Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS. Another extreme: complete resource management control to OS OS uses the performance techniques we will discuss later. Real world is between the two. User specifies desires in some aspects, system may ignore #39 lec # 3 Spring2000 3-14-2000

Program Partitioning Example Example 2.4 page 64 Fig 2.6 page 65 Fig 2.7 page 66 In Advanced Computer Architecture, Hwang #40 lec # 3 Spring2000 3-14-2000

Static Multiprocessor Scheduling Dynamic multiprocessor scheduling is an NP-hard problem. Node Duplication: to eliminate idle time and communication delays, some nodes may be duplicated in more than one processor. Fig. 2.8 page 67 Example: 2.5 page 68 In Advanced Computer Architecture, Hwang #41 lec # 3 Spring2000 3-14-2000

#42 lec # 3 Spring2000 3-14-2000