10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5, 2010 1.

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Games at Bolton OpenMP Techniques Andrew Williams
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
OpenMP China MCP.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
10/07/2010CS4961 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Mary Hall October 7,
OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Introduction to OpenMP
09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.
09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Martin Kruliš Jiří Dokulil
Shared Memory Parallelism - OpenMP
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Loop Parallelism and OpenMP CS433 Spring 2001
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
CS4961 Parallel Programming Lecture 12: Data Locality, cont
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
OpenMP Parallel Programming
Shared-Memory Paradigm & OpenMP
WorkSharing, Schedule, Synchronization and OMP best practices
Presentation transcript:

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Programming Assignment 2: Due 11:59 PM, Friday October 8 Combining Locality, Thread and SIMD Parallelism: The following code excerpt is representative of a common signal processing technique called convolution. Convolution combines two signals to form a third signal. In this example, we slide a small (32x32) signal around in a larger (4128x4128) signal to look for regions where the two signals have the most overlap. for (l=0; l<N; l++) { for (k=0; k<N; k++) { C[k][l] = 0.0; for (j=0; j<W; j++) { for (i=0; i<W; i++) { C[k][l] += A[k+i][l+j]*B[i][j]; } 09/30/2010CS49612

Programming Assignment 2, cont. Your goal is to take the code as written, and by either changing the code or changing the way you invoke the compiler, improve its performance. You will need to use an Intel platform that has the “icc” compiler installed. -You can use the CADE Windows lab, or another Intel platform with the appropriate software. -The Intel compiler is available for free on Linux platforms for non-commercial use. -The version of the compiler, the specific architecture and the flags you give to the compiler will drastically impact performance. You can discuss strategies with other classmates, but do not copy code. Also, do not copy solutions from the web. This must be your own work. 09/30/2010CS49613

Programming Assignment 2, cont. How to compile -OpenMP (all versions): icc –openmp conv-assign.c Measure and report performance for five versions of the code, and turn in all variants: -Baseline: compile and run code as provided (5 points) -Thread parallelism only (using OpenMP): icc –openmp conv- omp.c (5 points) -SSE-3 only: icc –openmp –msse3 –vec-report=3 conv-sse.c (10 points) -Locality only: icc –openmp conv-loc.c (10 points) -Combined: icc –openmp –msse3 –vec-report=3 conv-all.c (15 points) Explain results and observations (15 points) Extra credit (10 points): improve results by optimizing for register reuse 09/30/2010CS49614

If you are using Linux (from mailing list) Download from Intel for noncommercial use - software-download/ Large data structures on the stack may get segmentation fault -Try "ulimit -s ” 10/05/2010CS49615

Hints and Suggestions on the Assignment There are no absolute answers -And the only incorrect answers are when you change the program and get the wrong answer The goal is to improve performance through optimization You can only speculate on what is happening based on measured performance -Observe changes in performance -You may even be wrong, but the important thing is to have a reasoned argument about why the performance changed 09/30/2010CS49616

Today’s Lecture Go over questions on programming assignment Discussion of Task Parallelism in Open MP 2.x and 3.0 Sources for Lecture: -OpenMP Tutorial by Ruud van der Pas -Recent OpenMP Tutorial SC08.pdf -OpenMP 3.0 specification (May 2008): documents/spec30.pdf 10/05/2010CS49617

OpenMP Data Parallel Summary Work sharing -parallel, parallel for, TBD -scheduling directives: static(CHUNK), dynamic(), guided() Data sharing -shared, private, reduction Environment variables -OMP_NUM_THREADS, OMP_SET_DYNAMIC, OMP_NESTED, OMP_SCHEDULE Library -E.g., omp_get_num_threads(), omp_get_thread_num() 10/05/2010CS49618

Conditional Parallelization if (scalar expression) 10/05/2010CS49619 #pragma omp parallel if (n > threshold) \ shared(n,x,y) private(i) { #pragma omp for for (i=0; i<n; i++) x[i] += y[i]; } /*-- End of parallel region --*/ Review: parallel for private shared Only execute in parallel if expression evaluates to true Otherwise, execute serially

SINGLE and MASTER constructs Only one thread in team executes code enclosed Useful for things like I/O or initialization No implicit barrier on entry or exit Similarly, only master executes code 10/05/2010CS #pragma omp single { } #pragma omp master { }

Also, more control on synchronization #pragma omp parallel shared(A,B,C) { tid = omp_get_thread_num(); A[tid] = big_calc1(tid); #pragma omp barrier #pragma omp for for (i=0; i<N; i++) C[i] = big_calc2(tid); #pragma omp for nowait for (i=0; i<N; i++) B[i] = big_calc3(tid); A[tid] = big_calc4(tid); } 10/05/2010CS496111

General: Task Parallelism Recall definition (p. 88 in textbook): -A task parallel computation is one in which parallelism is applied by performing distinct computations – or tasks – at the same time. Since the number of tasks is fixed, the parallelism is not scalable. Common use of task parallelism = Producer/consumer (see textbook pages ) -A producer creates work that is then processed by the consumer -You can think of this as a form of pipelining, like in an assembly line -The “producer” writes data to a FIFO queue (or similar) to be accessed by the “consumer” 10/05/2010CS496112

OpenMP sections directive #pragma omp parallel { #pragma omp sections #pragma omp section {{ a=...; b=...; } #pragma omp section { c=...; d=...; } #pragma omp section { e=...; f=...; } #pragma omp section { g=...; h=...; } } /*omp end sections*/ } /*omp end parallel*/ 10/05/201013CS4961

Parallel Sections, Example 10/05/2010CS #pragma omp parallel default(none)\ shared(n,a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i<n; i++) d[i] = 1.0/c[i]; #pragma omp section for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; } /*-- End of sections --*/ } /*-- End of parallel region

Tasks in OpenMP 3.0 A task has -Code to execute -A data environment (shared, private, reduction) -An assigned thread that executes the code and uses the data Two activities: packaging and execution -Each encountering thread packages a new instance of a task -Some thread in the team executes the thread at a later time 10/05/2010CS496115

Last Year’s Programming Assignment // PRODUCER: initialize A with random data void fill_rand(int nval, double *A) { for (i=0; i<nval; i++) A[i] = (double) rand()/ ; } // CONSUMER: Sum the data in A double Sum_array(int nval, double *A) { double sum = 0.0; for (i=0; i<nval; i++) sum = sum + A[i]; return sum; } 10/05/2010CS496116

Aside, How do you Read/Write a FIFO? The FIFO is in global memory and is shared between the parallel threads How do you make sure the data is updated Construct to guarantee consistent view of memory -Flush: make sure data is written all the way back to global memory 10/05/2010CS Example: Double A; A = compute(); Flush(A);

Solution to Producer/Consumer #pragma omp parallel { #pragma omp section { fillrand(N,A); #pragma omp flush flag = 1; #pragma omp flush(flag) } #pragma omp section { while (!flag) #pragma omp flush(flag) #pragma omp flush sum = sum_array(N,A); } 10/05/2010CS496118

Motivating Example: Linked List Traversal How to express with parallel for? -Must have fixed number of iterations -Loop-invariant loop condition and no early exits Convert to parallel for -A priori count number of iterations (if possible) 10/05/2010CS while(my_pointer) { (void) do_independent_work (my_pointer); my_pointer = my_pointer->next ; } // End of while loop

OpenMP 3.0: Tasks! 10/05/2010CS my_pointer = listhead; #pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier here firstprivate = private and copy initial value from global variable lastprivate = private and copy back final value to global variable

Summary Completed coverage of OpenMP -Locks -Conditional execution -Single/Master -Task parallelism -Pre-3.0: parallel sections -OpenMP 3.0: tasks Next time: -OpenMP programming assignment 10/05/2010CS496121