09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7, 2012 1.

Slides:

Advertisements

Similar presentations

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Advertisements

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

CS 470/570:Introduction to Parallel and Distributed Computing.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.

CS240A, T. Yang, 2013 Modified from Demmel/Yelick’s and Mary Hall’s Slides 1 Parallel Programming with OpenMP.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Introduction to OpenMP

09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

My Coordinates Office EM G.27 contact time:

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23,

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Introduction to OpenMP

Conception of parallel algorithms

Lecture 5: Shared-memory Computing with Open MP

CS427 Multicore Architecture and Parallel Computing

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Computer Engg, IIT(BHU)

Introduction to OpenMP

Exploiting Parallelism

Parallelizing Loops Moreno Marzolla

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Parallel Programming with OpenMP

CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 4, /04/2012 CS4230.

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

Multithreading Why & How.

Programming with Shared Memory - 3 Recognizing parallelism

Programming with Shared Memory Specifying parallelism

Introduction to Optimization

Presentation transcript:

09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,

Administrative I will be on travel on Tuesday, September 11 There will be no lecture Instead, you should watch the following two videos: -“Intro to Parallel Programming. Lesson 3, pt. 1 - Implementing Domain Decompositons with OpenMP”, by Clay Breshears, Intel, (9:53), reinforces OpenMP data parallelism -“Dense Linear Algebra” by Jim Demmel, CS237 at UC Berkeley, (start at 18:15 until 30:40), establishes bounds of dense linear algebra performance, taking parallelism and locality into account -There will be a quiz on this material on Thursday, September 13 (participation grade) 09/07/2012CS42302

Homework 2: Mapping to Architecture Due before class, Thursday, September 6 Objective: Begin thinking about architecture mapping issues Turn in electronically on the CADE machines using the handin program: “handin cs4230 hw2 ” Problem 1: (2.3 in text) [Locality] Problem 2: (2.8 in text) [Caches and multithreading] Problem 3: [Amdahl’s Law] A multiprocessor consists of 100 processors, each capable of a peak execution rate of 20 Gflops. What is performance of the system as measured in Gflops when 20% of the code is sequential and 80% is parallelizable? Problem 4: (2.16 in text) [Parallelization scaling] Problem 5: [Buses and crossbars] Suppose you have a computation that uses two vector inputs to compute a vector output, where each vector is stored in consecutive memory locations. Each input and output location is unique, but data is loaded/stored from cache in 4-word transfers. Suppose you have P processors and N data elements, and execution time is a function of time L for a load from memory and time C for the computation. Compare parallel execution time for a shared memory architecture with a bus (Nehalem)versus a full crossbar (Niagara) from Lecture 3, assuming a write back cache that is larger than the data footprint. 09/07/2012CS42303

Programming Assignment 1: Due Friday, Sept. 14, 11PM MDT To be done on water.eng.utah.edu (you all have accounts – passwords available if your CS account doesn’t work) 1.Write a program to calculate π in OpenMP for a problem size and data set to be provided. Use a block data distribution. 2.Write the same computation in Pthreads. Report your results in a separate README file. -What is the parallel speedup of your code? To compute parallel speedup, you will need to time the execution of both the sequential and parallel code, and report speedup = Time(seq) / Time (parallel) -If your code does not speed up, you will need to adjust the parallelism granularity, the amount of work each processor does between synchronization points. You can do this by either decreasing the number of threads or increasing the number of iterations. -Report results for two different # of threads, holding iterations fixed, and two different # of iterations holding threads fixed. Also report lines of code for the two solutions. Extra credit: Rewrite both codes using a cyclic distribution and measure performance for same configurations. 09/07/2012CS42304

Programming Assignment 1, cont. A test harness is provided in pi-test-harness.c that provides a sequential pi calculation, validation, speedup timing and substantial instructions on what you need to do to complete the assignment. Here are the key points: -You’ll need to write the parallel code, and the things needed to support that. Read the top of the file, and search for “TODO”. -Compile w/ OpenMP: cc –o pi-openmp –O3 –xopenmp pi-openmp.c -Compile w/ Pthreads: cc –o pi-pthreads –O3 pi-pthreads.c –lpthread -Run OpenMP version:./pi-openmp > openmp.out -Run Pthreads version:./pi-pthreads > pthreads.out Note that editing on water is somewhat primitive – I’m using vim. You may want to edit on a different CADE machine. 09/07/2012CS42305

Estimating π CS423009/07/20126

Today’s Lecture OpenMP parallel, parallel for and for constructs from last time OpenMP loop scheduling demonstration Data Dependences -How compilers reason about them -Formal definition of reordering transformations that preserve program meaning -Informal determination of parallelization safety Sources for this lecture: -Notes on website 09/07/20127CS4230

OpenMP Sum Examples: Improvement Last time (sum v3): int sum, mysum[64]; sum = 0; #pragma omp parallel { int my_id = omp_get_thread_num(); mysum[my_id] = 0; } #pragma omp parallel for for (i=0; i<size; i++) { int my_id = omp_get_thread_num(); mysum[my_id] += _iplist[i]; } #pragma omp parallel { int my_id = omp_get_thread_num(); #pragma omp critical sum += mysum[my_id]; } return sum; 09/07/2012CS42308 Improved, posted (sum v3): int sum, mysum[64]; sum = 0; #pragma omp parallel { int my_id = omp_get_thread_num(); mysum[my_id] = 0; #pragma omp for for (i=0; i<size; i++) { mysum[my_id] += _iplist[i]; } #pragma omp critical sum += mysum[my_id]; } return sum;

Common Data Distributions Consider a 1-Dimensional array to solve the global sum problem, 16 elements, 4 threads CYCLIC (chunk = 1) (version 2): for (i = 0; i<blocksize; i++) … in [i*blocksize + tid]; BLOCK (chunk = 4) (default, version 1): for (i=tid*blocksize; i<(tid+1) *blocksize; i++) … in[i]; BLOCK-CYCLIC (chunk = 2) (version 3): CS /07/20129

The Schedule Clause Default schedule: Cyclic schedule: CS423009/07/201210

OpenMP environment variables OMP_NUM_THREADS  sets the number of threads to use during execution  when dynamic adjustment of the number of threads is enabled, the value of this environment variable is the maximum number of threads to use  For example, setenv OMP_NUM_THREADS 16 [csh, tcsh] export OMP_NUM_THREADS=16 [sh, ksh, bash] OMP_SCHEDULE (version 4)  applies only to do/for and parallel do/for directives that have the schedule type RUNTIME  sets schedule type and chunk size for all such loops  For example, setenv OMP_SCHEDULE GUIDED,4 [csh, tcsh] export OMP_SCHEDULE= GUIDED,4 [sh, ksh, bash] 09/07/2012CS423011

Race Condition or Data Dependence A race condition exists when the result of an execution depends on the timing of two or more events. A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness. 09/07/201212CS4230

Key Control Concept: Data Dependence Question: When is parallelization guaranteed to be safe? Answer: If there are no data dependences across reordered computations. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the accesses is a write. Bernstein’s conditions (1966): I j is the set of memory locations read by process P j, and O j the set updated by process P j. To execute P j and another process P k in parallel, I j ∩ O k = ϕ write after read I k ∩ O j = ϕ read after write O j ∩ O k = ϕ write after write 09/07/2012CS423013

Data Dependence and Related Definitions Actually, parallelizing compilers must formalize this to guarantee correct code. Let’s look at how they do it. It will help us understand how to reason about correctness as programmers. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write. A data dependence can either be between two distinct program statements or two different dynamic executions of the same program statement. Source: “Optimizing Compilers for Modern Architectures: A Dependence-Based Approach”, Allen and Kennedy, 2002, Ch /07/201214CS4230

Data Dependence of Scalar Variables True (flow) dependence a = = a Anti-dependence = a a= Output dependence a= a = Input dependence (for locality) = a = a Definition: Data dependence exists from a reference instance i to i’ iff either i or i’ is a write operation i and i’ refer to the same variable i executes before i’ 09/07/201215CS4230

Some Definitions (from Allen & Kennedy) Definition 2.5: -Two computations are equivalent if, on the same inputs, -they produce identical outputs -the outputs are executed in the same order Definition 2.6: -A reordering transformation -changes the order of statement execution -without adding or deleting any statement executions. Definition 2.7: -A reordering transformation preserves a dependence if -it preserves the relative execution order of the dependences’ source and sink. 09/07/201216CS4230

Fundamental Theorem of Dependence Theorem 2.2: -Any reordering transformation that preserves every dependence in a program preserves the meaning of that program. 09/07/201217CS4230

In this course, we consider two kinds of reordering transformations Parallelization -Computations that execute in parallel between synchronization points are potentially reordered. Is that reordering safe? According to our definition, it is safe if it preserves the dependences in the code. Locality optimizations -Suppose we want to modify the order in which a computation accesses memory so that it is more likely to be in cache. This is also a reordering transformation, and it is safe if it preserves the dependences in the code. Reduction computations -We have to relax this rule for reductions. It is safe to reorder reductions for commutative and associative operations. 09/07/201218CS4230

How to determine safety of reordering transformations Informally -Must preserve relative order of dependence source and sink -So, cannot reverse order Formally -Tracking dependences -A simple abstraction: Distance vectors 09/07/2012CS423019

20 Forall or Doall loops: Loops whose iterations can execute in parallel (a particular reordering transformation) Example forall (i=1; i<=n; i++) A[i] = B[i] + C[i]; Meaning? Brief Detour on Parallelizable Loops as a Reordering Transformation Each iteration can execute independently of others Free to schedule iterations in any order (e.g., pragma omp forall) Source of scalable, balanced work Common to scientific, multimedia, graphics & other domains 09/07/2012CS4230

Data Dependence for Arrays Recognizing parallel loops (intuitively) -Find data dependences in loop -No dependences crossing iteration boundary  parallelization of loop’s iterations is safe for (i=2; i<5; i++) A[i] = A[i-2]+1; for (i=1; i<=3; i++) A[i] = A[i]+1; Loop- Carried dependence Loop- Independent dependence 09/07/201221CS4230