10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Compiler Challenges for High Performance Architectures
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
Recursion CS /02/05 L7: Files Slide 2 Copyright 2005, by the authors of these slides, and Ateneo de Manila University. All rights reserved Iteration.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Ordering and Consistent Cuts Presented By Biswanath Panda.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
5.6 Semaphores Semaphores –Software construct that can be used to enforce mutual exclusion –Contains a protected variable Can be accessed only via wait.
Synchronization Principles. Race Conditions Race Conditions: An Example spooler directory out in 4 7 somefile.txt list.c scores.txt Process.
Games at Bolton OpenMP Techniques Andrew Williams
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
Data Dependences CS 524 – High-Performance Computing.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.
10/07/2010CS4961 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Mary Hall October 7,
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
M180: Data Structures & Algorithms in Java Arrays in Java Arab Open University 1.
09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
Data Structure Introduction.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Multithreading Chapter Introduction Consider ability of _____________ to multitask –Breathing, heartbeat, chew gum, walk … In many situations we.
Multithreading Chapter Introduction Consider ability of human body to ___________ –Breathing, heartbeat, chew gum, walk … In many situations we.
Lecture 5 Barriers and MPI Introduction Topics Barriers Uses implementations MPI Introduction Readings – Semaphore handout dropboxed January 24, 2012 CSCE.
09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Parallel and Distributed Simulation Deadlock Detection & Recovery.
CS 110 Computer Architecture Lecture 20: Thread-Level Parallelism (TLP) and OpenMP Intro Instructor: Sören Schwertfeger School.
Code Optimization Overview and Examples
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Lecture 25 More Synchronized Data and Producer/Consumer Relationship
Exploiting Parallelism
CS427 Multicore Architecture and Parallel Computing
Parallelizing Loops Moreno Marzolla
L21: Putting it together: Tree Search (Ch. 6)
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Multithreading Chapter 23.
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Programming with Shared Memory Specifying parallelism
Introduction to Optimization
Presentation transcript:

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

Administrative Assignment deadline extended to Friday night. C++ validation code added to website, if Octave/Matlab too hard to use (must update file names accordingly) Have you figured out how to validate? How to time? 10/02/2012CS42302

How to Increase Parallelism Granularity when there are Dependences? Your assignment has a number of dependences that prevent outer loop parallelization Parallelizing inner “k” loops leads to poor performance How do we break dependences to parallelize the outer loops? 10/02/2012CS42303

Two Transformations for Breaking Dependences Scalar expansion: -A scalar is replaced by an array to maintain its value calculated in each iteration of a loop (nest). This transformation may make it possible to perform loop distribution (below) and parallelize the resulting loops differently. Array expansion is analogous, increasing the dimensionality of a loop. Loop distribution (fission): -The loop header is replicated and distributed across statements in a loop, so that subsets of statements are executed in separate loops. The resulting loops can be parallelized differently. 10/02/2012CS42304

Scalar Expansion Example for (i=0; i<m; i++) { a[i] = temp; temp = b[i] + c[i]; } 10/02/2012CS42305 Dependence? Can you parallelize some of the computation? How?

Loop Distribution Example for (j=0; j<100; j++) { a[j] = b[j] + c[j]; d[j] = a[j-1] * 2.0; } 10/02/2012CS42306 Dependence? Can you parallelize some of the computation? How?

Back to assignment Where are the dependences? Can we identify where scalar expansion and/or loop distribution might be beneficial? 10/02/2012CS42307

Summary of stencil discussion A stencil examines the input data in the neighborhood of an output point to compute a result. Common in scientific codes, also looks like a lot of image and signal processing algorithms Core performance issues in parallel stencils: -Multi-dimensional stencils walk all over the memory, can lead to poor performance -May benefit from tilling -Amount of data reuse and therefore benefit of locality optimization depends on how much of neighborhood is examined (higher-order stencils look at more of the neighborhood) -May cause very expensive TLB misses for large data sets and 3 or more dimensions (if crosses page boundaries) -“Ghost zones” for replicating input/output like in red-blue 10/02/2012CS42308

General: Task Parallelism Recall definition: -A task parallel computation is one in which parallelism is applied by performing distinct computations – or tasks – at the same time. Since the number of tasks is fixed, the parallelism is not scalable. OpenMP support for task parallelism -Parallel sections: different threads execute different code -Tasks (NEW): tasks are created and executed at separate times Common use of task parallelism = Producer/consumer -A producer creates work that is then processed by the consumer -You can think of this as a form of pipelining, like in an assembly line -The “producer” writes data to a FIFO queue (or similar) to be accessed by the “consumer” 9/22/2011CS49619

Simple: OpenMP sections directive #pragma omp parallel { #pragma omp sections #pragma omp section {{ a=...; b=...; } #pragma omp section { c=...; d=...; } #pragma omp section { e=...; f=...; } #pragma omp section { g=...; h=...; } } /*omp end sections*/ } /*omp end parallel*/ 9/22/201110CS4961

Parallel Sections, Example 9/22/2011CS #pragma omp parallel shared(n,a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i<n; i++) d[i] = 1.0/c[i]; #pragma omp section for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; } /*-- End of sections --*/ } /*-- End of parallel region

Simple Producer-Consumer Example // PRODUCER: initialize A with random data void fill_rand(int nval, double *A) { for (i=0; i<nval; i++) A[i] = (double) rand()/ ; } // CONSUMER: Sum the data in A double Sum_array(int nval, double *A) { double sum = 0.0; for (i=0; i<nval; i++) sum = sum + A[i]; return sum; } 9/22/2011CS496112

Key Issues in Producer-Consumer Parallelism Producer needs to tell consumer that the data is ready Consumer needs to wait until data is ready Producer and consumer need a way to communicate data -output of producer is input to consumer Producer and consumer often communicate through First-in-first-out (FIFO) queue 9/22/2011CS496113

One Solution to Read/Write a FIFO The FIFO is in global memory and is shared between the parallel threads How do you make sure the data is updated? Need a construct to guarantee consistent view of memory -Flush: make sure data is written all the way back to global memory 9/22/2011CS Example: Double A; A = compute(); Flush(A);

Another Example from Textbook Implement Message-Passing on a Shared-Memory System A FIFO queue holds messages A thread has explicit functions to Send and Receive -Send a message by enqueuing on a queue in shared memory -Receive a message by grabbing from queue -Ensure safe access 9/22/2011CS496115

Message-Passing Copyright © 2010, Elsevier Inc. All rights Reserved

Sending Messages Copyright © 2010, Elsevier Inc. All rights Reserved Use synchronization mechanisms to update FIFO “Flush” happens implicitly What is the implementation of Enqueue?

Receiving Messages Copyright © 2010, Elsevier Inc. All rights Reserved This thread is the only one to dequeue its messages. Other threads may only add more messages. Messages added to end and removed from front. Therefore, only if we are on the last entry is synchronization needed.

Termination Detection Copyright © 2010, Elsevier Inc. All rights Reserved each thread increments this after completing its for loop More synchronization needed on “done_sending”

What’s coming? 10/02/2012CS Next time we will finish task parallelism in OpenMP, look at actual executing code Homework to be used as midterm review, due before class, Thursday, October 18 In-class midterm on October 18