Parallel Programming – Barriers, Locks, and Continued Discussion of Parallel Decomposition David Monismith Jan. 27, 2015 Based upon notes from the LLNL.

Slides:



Advertisements
Similar presentations
Operating Systems Mehdi Naghavi Winter 1385.
Advertisements

Practice Session 7 Synchronization Liveness Deadlock Starvation Livelock Guarded Methods Model Thread Timing Busy Wait Sleep and Check Wait and Notify.
Operating Systems Lecture Notes Deadlocks Matthew Dailey Some material © Silberschatz, Galvin, and Gagne, 2002.
1 C OMP 346 – W INTER 2015 Tutorial # 5. Semaphores for Barrier Sync A barrier is a type of synchronization method. A barrier for a group of threads or.
Overview Assignment 8: hints Assignment 7: solution Deadlocks
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 7 – Deadlock and Indefinite Postponement Outline 7.1 Introduction 7.2Examples of Deadlock.
CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
02/19/2010CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Concurrency: Deadlock and Starvation Chapter 6. Revision Describe three necessary conditions for deadlock Which condition is the result of the three necessary.
OS Spring 2004 Concurrency: Principles of Deadlock Operating Systems Spring 2004.
A Introduction to Computing II Lecture 18: Concurrency Issues Fall Session 2000.
Chapter 7 – Deadlock and Indefinite Postponement
OS Fall’02 Concurrency: Principles of Deadlock Operating Systems Fall 2002.
02/23/2004CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
02/17/2010CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
Definitions Process – An executing program
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.
Games at Bolton Parallel Sections Andrew Williams.
02/19/2007CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
CS 153 Design of Operating Systems Spring 2015 Lecture 11: Scheduling & Deadlock.
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
Threaded Programming Lecture 4: Work sharing directives.
Threading Eriq Muhammad Adams J
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
OPERATING SYSTEMS DESIGN AND IMPLEMENTATION Third Edition ANDREW S. TANENBAUM ALBERT S. WOODHULL Yan hao (Wilson) Wu University of the Western.
15.1 Threads and Multi- threading Understanding threads and multi-threading In general, modern computers perform one task at a time It is often.
Dining Philosophers A “Classical” Synchronization Problem devised by Edsger Dijkstra in 1965 (modified for the masses by Tony Hoare) You have –5 philosophers.
Copyright © Curt Hill Concurrent Execution An Overview for Database.
Operating Systems Inter-Process Communications. Lunch time in the Philosophy Department. Dining Philosophers Problem (1)
Debugging Threaded Applications By Andrew Binstock CMPS Parallel.
Deadlock cs550 Operating Systems David Monismith.
CS 241 Section Week #7 (10/22/09). Topics This Section  Midterm Statistics  MP5 Forward  Classical Synchronization Problems  Problems.
Synchronization These notes introduce:
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-3. OMP_INIT_LOCK OMP_INIT_NEST_LOCK Purpose: ● This subroutine initializes a lock associated with the lock variable.
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Operating Systems Unit 4: – Dining Philosophers – Deadlock – Indefinite postponement Operating Systems.
Concurrency (2) CSE 132. Question The following four numbers are in 8-bit 2’s complement form: Which order below represents.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems David Goldschmidt, Ph.D.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
Process Synchronization
Chapter 5: Process Synchronization – Part 3
Auburn University COMP 3500 Introduction to Operating Systems Synchronization: Part 4 Classical Synchronization Problems.
Chapter 5: Process Synchronization
Classical Synchronization Problems
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Chapter 7 – Deadlock and Indefinite Postponement
Definitions Concurrent program – Program that executes multiple instructions at the same time. Process – An executing program (the running JVM for Java.
Chapter 5: Process Synchronization (Con’t)
Module 7a: Classic Synchronization
Chapter 7: Synchronization Examples
Conditions for Deadlock
CONCURRENCY Concurrency is the tendency for different tasks to happen at the same time in a system ( mostly interacting with each other ) .   Parallel.
Synchronization These notes introduce:
EECE.4810/EECE.5730 Operating Systems
Presentation transcript:

Parallel Programming – Barriers, Locks, and Continued Discussion of Parallel Decomposition David Monismith Jan. 27, 2015 Based upon notes from the LLNL OpenMP Tutorial and from Introduction to Parallel Computing by Grama, Karypis, Kumar, and Gupta

Last Time Parallel Decomposition Review of OpenMP Directives – parallel for – shared – private – reduction

This Time OpenMP Barriers OpenMP Locks An example with the Dining Philosophers problem More decompositions – Exploratory Decomposition – Speculative Decomposition Tasks and Interactions Load balancing Handling overhead Parallel Algorithm Models

OpenMP Barriers A barrier is a parallel programming primitive that forces all threads to wait until every thread in the process has reached the barrier. Barriers are used to enforce synchronization within programs. OpenMP Barriers are implemented with the following syntax: –#pragma omp barrier Important – barriers cannot be used in a #pragma omp parallel for, they may only be used within a #pragma omp parallel block

OpenMP Lock Variables Lock variables allow for finer granularity of control over synchronization In OpenMP the lock variable type is omp_lock_t –omp_lock_t myLock; Locks must be initialized using the omp_init_lock function. –omp_init_lock(&myLock); After initializing a lock, the lock may be acquired using the omp_set_lock function and a lock may be released using the omp_unset_lock function. –omp_set_lock(&myLock); –omp_unset_lock(&myLock); Locks must be destroyed using the omp_destroy_lock function. –omp_destroy_lock(&myLock);

Example: Dining Philosophers Five students eating at Joy Wok with chopsticks Only five chopsticks available. A student needs two chopsticks to eat. Chopsticks are situated between students. Students can't move from their positions at the table. Students alternate between thinking and eating.

\ (S1) | (S2) / (S3) Rice (S4) / (S5) \ A possible solution to deadlock: Students always reach right first. Example : Dining Philosophers (Contd..) (S1)-Req->[R2]-Ac->(S2)-Req->[R3]-Ac->(S3)-Req->[R4]-Ac->(S4)-Req-+ ^ | | | Requests----(S5)<----Acquired---[R5]<

Philosopher Pseudocode 1. Check if the index value of the left chopstick is less than the index value of the right chopstick. If true, do 2 then 3. Otherwise, do 3 then Attempt to acquire the left chopstick 3. Attempt to acquire the right chopstick. 4. Eat for specified time period 5. Release chopsticks 6. Think for a specified time period

Review of OpenMP Dining Philosophers See the OpenMP Dining Philosophers Example on the course website. Be sure to make note of the following items – OpenMP Parallel Section – OpenMP Barriers – OpenMP Locks – Deadlock Prevention by disallowing a cycle in the resource allocation graph

Exploratory Decomposition Exploratory Decomposition – used to decompose problems corresponding to a solution space that must be searched. Partition search space into smaller parts. Search each part concurrently until solutions are found. Examples – 15 puzzle – Chess AI (Tree search) – Checkers We will return to this later when we investigate parallel depth-first search.

Speculative Decomposition Speculative decomposition – used when a program may take one of many computationally significant branches based upon the computation that preceded it. Discrete Event Simulations are an example of such types of computations and will also be covered in following lectures.

Task Properties Four properties play a major role in mapping tasks – Task Generation – Task Size – Prior knowledge of sizes – Amount of data associated with each task

Task Creation Tasks (i.e. threads) may be created dynamically or statically – Static creation – all tasks are known before beginning an algorithm – Example – matrix multiplication and LU factorization – Dynamic generation – tasks and a dependency graph may not be available before starting an algorithm – Examples include the 15-puzzle, chess, and other exploratory decomposition algorithms

Task Sizes Uniform – time complexity (and often data complexity) is similar across tasks. – Example – matrix multiplication Non-uniform – significant variation in time and space complexity across tasks. – Example – performing a parameter sweep Prior knowledge of the sizes of tasks can drastically affect how tasks are handled and load balanced. Similarly, prior knowledge of the amount of data to be processed by each task can affect such handling.

Task Interaction Static vs. Dynamic – Static - Interaction pattern for each task occurs at predetermined times/intervals – Dynamic – time of execution of tasks cannot be determined prior to execution of the algorithm Regular vs. Irregular – Regular – structure can be exploited for efficient computation (e.g. matrix multiplication) – Irregular – no such pattern exists (e.g. sparse matrix-vector multiplication) Read-only vs. Read-write One-way vs. two-way – One-way – one task provides work other tasks without being interrupted (e.g. read-only data) – Two-way - data/work needed by tasks is provided by another task and access is coordinated by the pair/group of tasks (e.g. producer-consumer)

Load Balancing Overhead – task idle time, context switching time, and time spent initiating interactions with other tasks – Want to reduce the amount of time processes and threads spend interacting with each other – Want to reduce idle time Static Mapping – distribute tasks among processes or threads prior to running the algorithm – E.g. Array_length/num_threads Dynamic Mapping – distribute work during algorithm execution – Task sizes may be unknown and may cause load imbalances – Data may need to be moved between processes to balance the load

Data Partitioning for Static Mapping Array Distribution – Assume tasks are associated with data such that distributing data is equivalent to distributing tasks Block Distribution – Assign uniform portions of an array to different processes – Example: assign n/p rows of an n by n matrix to each of p processes – Example: assign n/p blocks of size n/sqrt(p) by n/sqrt(p) of an n by n matrix to each of p processes – Useful to ensure data reuse in cache Cyclic and Block Cyclic Distribution – Partition an array into many more blocks than available processes – Assign partitions and associated tasks to processes in a round robin fashion so each process gets several non-adjacent blocks – Used to reduce idling in operations such as LU factorization to ensure each process gets an equal sampling of a data structure

Other Mapping and Partitioning Schemes Randomized Block Distributions Graph Partitioning Task Partitioning Hierarchical Mapping All are somewhat more complex and will be discussed at a later date

Dynamic Load Balancing Centralized – All executable tasks are maintained in a common data structure. – Process designated to manage pool of available tasks is called the master. – Worker processes are called slaves. – Processes that have no work to do take work from the master. – Newly generated tasks are added to the data structure. – Example: processes could process small chunks of data and request more to process once they become idle Distributed – Executable tasks are divided between processes – Allow processes to send and receive work from any other process – Important to take care in how much work is transferred, how often, how processes are paired, and how work transfer is initiated (e.g. by sender or receiver)

Controlling Overhead Maximize data locality – Promote and maximize use of local data or recently fetched memory (i.e. cache lines) – Use row-major operations in C and column-major in Fortran Minimize data exchange volume – Maximize temporal data locality – Make as many consecutive references to the same data as possible – Example: Use block operations to perform matrix multiplication – Store intermediate results in local data, and perform shared data access only to store the final result Minimize frequency of interactions – Restructure your algorithm so that shared data are accessed and used in large chunks Minimize contention – Avoid having multiple processes access the same resources at the same time – Example: sending multiple messages to the same process at the same time, outputting data from multiple processes to one file at the same time, etc. Overlap computation and interaction – Perform computations while waiting for shared data or messages to arrive

Controlling Overhead Replicate data or computations – Replicate copies of commonly used data structures on each process as memory permits. – This avoids communication overhead between processes, especially if the data structures are read-only. – Additionally, performing redundant computations may cost less time than performing message-passing. Use carefully as appropriate Use optimized collective interaction operations – Broadcast, All-to-All, and other shared data operations have been implemented in a highly optimized fashion in the MPI library. – We will make use of such functions later in the course.

Parallel Algorithm Models Data Parallelism Task Graph Model Work Pool Model Master-Slave (i.e. Client-Server) Producer-Consumer