Presentation is loading. Please wait.

Presentation is loading. Please wait.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Similar presentations


Presentation on theme: "10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1."— Presentation transcript:

1 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1

2 Administrative Assignment deadline extended to Friday night. C++ validation code added to website, if Octave/Matlab too hard to use (must update file names accordingly) Have you figured out how to validate? How to time? 10/02/2012CS42302

3 How to Increase Parallelism Granularity when there are Dependences? Your assignment has a number of dependences that prevent outer loop parallelization Parallelizing inner “k” loops leads to poor performance How do we break dependences to parallelize the outer loops? 10/02/2012CS42303

4 Two Transformations for Breaking Dependences Scalar expansion: -A scalar is replaced by an array to maintain its value calculated in each iteration of a loop (nest). This transformation may make it possible to perform loop distribution (below) and parallelize the resulting loops differently. Array expansion is analogous, increasing the dimensionality of a loop. Loop distribution (fission): -The loop header is replicated and distributed across statements in a loop, so that subsets of statements are executed in separate loops. The resulting loops can be parallelized differently. 10/02/2012CS42304

5 Scalar Expansion Example for (i=0; i<m; i++) { a[i] = temp; temp = b[i] + c[i]; } 10/02/2012CS42305 Dependence? Can you parallelize some of the computation? How?

6 Loop Distribution Example for (j=0; j<100; j++) { a[j] = b[j] + c[j]; d[j] = a[j-1] * 2.0; } 10/02/2012CS42306 Dependence? Can you parallelize some of the computation? How?

7 Back to assignment Where are the dependences? Can we identify where scalar expansion and/or loop distribution might be beneficial? 10/02/2012CS42307

8 Summary of stencil discussion A stencil examines the input data in the neighborhood of an output point to compute a result. Common in scientific codes, also looks like a lot of image and signal processing algorithms Core performance issues in parallel stencils: -Multi-dimensional stencils walk all over the memory, can lead to poor performance -May benefit from tilling -Amount of data reuse and therefore benefit of locality optimization depends on how much of neighborhood is examined (higher-order stencils look at more of the neighborhood) -May cause very expensive TLB misses for large data sets and 3 or more dimensions (if crosses page boundaries) -“Ghost zones” for replicating input/output like in red-blue 10/02/2012CS42308

9 General: Task Parallelism Recall definition: -A task parallel computation is one in which parallelism is applied by performing distinct computations – or tasks – at the same time. Since the number of tasks is fixed, the parallelism is not scalable. OpenMP support for task parallelism -Parallel sections: different threads execute different code -Tasks (NEW): tasks are created and executed at separate times Common use of task parallelism = Producer/consumer -A producer creates work that is then processed by the consumer -You can think of this as a form of pipelining, like in an assembly line -The “producer” writes data to a FIFO queue (or similar) to be accessed by the “consumer” 9/22/2011CS49619

10 Simple: OpenMP sections directive #pragma omp parallel { #pragma omp sections #pragma omp section {{ a=...; b=...; } #pragma omp section { c=...; d=...; } #pragma omp section { e=...; f=...; } #pragma omp section { g=...; h=...; } } /*omp end sections*/ } /*omp end parallel*/ 9/22/201110CS4961

11 Parallel Sections, Example 9/22/2011CS496111 #pragma omp parallel shared(n,a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i<n; i++) d[i] = 1.0/c[i]; #pragma omp section for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; } /*-- End of sections --*/ } /*-- End of parallel region

12 Simple Producer-Consumer Example // PRODUCER: initialize A with random data void fill_rand(int nval, double *A) { for (i=0; i<nval; i++) A[i] = (double) rand()/1111111111; } // CONSUMER: Sum the data in A double Sum_array(int nval, double *A) { double sum = 0.0; for (i=0; i<nval; i++) sum = sum + A[i]; return sum; } 9/22/2011CS496112

13 Key Issues in Producer-Consumer Parallelism Producer needs to tell consumer that the data is ready Consumer needs to wait until data is ready Producer and consumer need a way to communicate data -output of producer is input to consumer Producer and consumer often communicate through First-in-first-out (FIFO) queue 9/22/2011CS496113

14 One Solution to Read/Write a FIFO The FIFO is in global memory and is shared between the parallel threads How do you make sure the data is updated? Need a construct to guarantee consistent view of memory -Flush: make sure data is written all the way back to global memory 9/22/2011CS496114 Example: Double A; A = compute(); Flush(A);

15 Another Example from Textbook Implement Message-Passing on a Shared-Memory System A FIFO queue holds messages A thread has explicit functions to Send and Receive -Send a message by enqueuing on a queue in shared memory -Receive a message by grabbing from queue -Ensure safe access 9/22/2011CS496115

16 Message-Passing Copyright © 2010, Elsevier Inc. All rights Reserved

17 Sending Messages Copyright © 2010, Elsevier Inc. All rights Reserved Use synchronization mechanisms to update FIFO “Flush” happens implicitly What is the implementation of Enqueue?

18 Receiving Messages Copyright © 2010, Elsevier Inc. All rights Reserved This thread is the only one to dequeue its messages. Other threads may only add more messages. Messages added to end and removed from front. Therefore, only if we are on the last entry is synchronization needed.

19 Termination Detection Copyright © 2010, Elsevier Inc. All rights Reserved each thread increments this after completing its for loop More synchronization needed on “done_sending”

20 What’s coming? 10/02/2012CS423020 Next time we will finish task parallelism in OpenMP, look at actual executing code Homework to be used as midterm review, due before class, Thursday, October 18 In-class midterm on October 18


Download ppt "10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1."

Similar presentations


Ads by Google