ME964 High Performance Computing for Engineering Applications Parallel Programming Patterns Oct. 16, 2008.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Practical techniques & Examples
Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
Introduction to Parallel Computing
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Reference: Message Passing Fundamentals.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Models and Paradigms
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Csinparallel.org Patterns and Exemplars: Compelling Strategies for Teaching Parallel and Distributed Computing to CS Undergraduates Libby Shoop Joel Adams.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
ME964 High Performance Computing for Engineering Applications “Part of the inhumanity of the computer is that, once it is competently programmed and working.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Parallel Computing Presented by Justin Reschke
Agenda  Quick Review  Finish Introduction  Java Threads.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
These slides are based on the book:
Parallel Patterns.
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming Patterns
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
What is Parallel and Distributed computing?
COMP60621 Fundamentals of Parallel and Distributed Systems
Lecture 2 The Art of Concurrency
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
6- General Purpose GPU Programming
Mattan Erez The University of Texas at Austin
Presentation transcript:

ME964 High Performance Computing for Engineering Applications Parallel Programming Patterns Oct. 16, 2008

Before we get started… Last Time CUDA arithmetic support Performance implications Accuracy implications A software design exercise: prefix scan Preamble to the topic of software design patterns Today Parallel programming patterns Other issues: HW due on Saturday, 11:59 PM. I will be in Milwaukee next Th, Mik will be the lecturer - tracing the evolution of the hardware support for the graphics pipeline 2

“If you build it, they will come.” “And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in every department … and they haven’t come. Programmers haven’t come to program these wonderful machines. … The computer industry is ready to flood the market with hardware that will only run at full speed with parallel programs. But who will write these programs?” - Mattson, Sanders, Massingill (2005) 3 HK-UIUC

Objective Outline a framework that draws on techniques & best practices used by experienced parallel programmers in order to enable you to Think about the problem of parallel programming Address functionality and performance issues in your parallel program design Discuss your design decisions with others Select an appropriate implementation platform that will that will facilitate application development 4

Fundamentals of Parallel Computing Parallel computing requires that The problem can be decomposed into sub-problems that can be safely solved at the same time The programmer structures the code and data to solve these sub-problems concurrently The goals of parallel computing are To solve problems in less time, and/or To solve bigger problems The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency. 5 HK-UIUC

Recommended Reading Mattson, Sanders, Massingill, Patterns for Parallel Programming, Addison Wesley, 2005, ISBN Used for this presentation A good overview of challenges, best practices, and common techniques in all aspects of parallel programming Book is on reserve at Wendt Library 6 HK-UIUC

Parallel Programming: Key Steps 1) Find the concurrency in the problem 2) Structure the algorithm so that concurrency can be exploited 3) Implement the algorithm in a suitable programming environment 4) Execute and tune the performance of the code on a parallel system NOTE: The reality is that these have not been separated into levels of abstractions that can be dealt with independently. 7 HK-UIUC

What’s Next? Focus on this for a while 8 From “Patterns for Parallel Programming”

Challenges of Parallel Programming Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle Dependences need to be identified and managed The order of task execution may change the answers Obvious: One step feeds result to the next steps Subtle: numeric accuracy may be affected by ordering steps that are logically parallel with each other 9 HK-UIUC

Challenges of Parallel Programming (cont.) Performance can be drastically reduced by many factors Overhead of parallel processing Load imbalance among processor elements Inefficient data sharing patterns Saturation of critical resources such as memory bandwidth 10 HK-UIUC

Shared Memory vs. Message Passing We will focus on shared memory parallel programming This is what CUDA is based on Future massively parallel microprocessors are expected to support shared memory at the chip level The programming considerations of message passing model is quite different! 11 HK-UIUC

Finding Concurrency in Problems Identify a decomposition of the problem into sub-problems that can be solved simultaneously A task decomposition that identifies tasks that can execute concurrently A data decomposition that identifies data local to each task A way of grouping tasks and ordering the groups to satisfy temporal constraints An analysis on the data sharing patterns among the concurrent tasks A design evaluation that assesses of the quality the choices made in all the steps 12 HK-UIUC

Finding Concurrency – The Process Task Decomposition Data Decomposition Data Sharing Order Tasks Decomposition Group Tasks Dependence Analysis Design Evaluation This is typically an iterative process. Opportunities exist for dependence analysis to play earlier role in decomposition. 13 HK-UIUC

Find Concr. 1: Decomp. Stage: Task Decomposition Many large problems have natural independent tasks The number of tasks used should be adjustable to the execution resources available. Each task must include sufficient work in order to compensate for the overhead of managing their parallel execution. Tasks should maximize reuse of sequential program code to minimize design/implementation effort. “In an ideal world, the compiler would find tasks for the programmer. Unfortunately, this almost never happens.” - Mattson, Sanders, Massingill 14 HK-UIUC

Example: Task Decomposition Square Matrix Multiplication P = M * N of WIDTH ● WIDTH One natural task (sub-problem) produces one element of P All tasks can execute in parallel M N P WIDTH 15 HK-UIUC

Task Decomposition Example: Molecular Dynamics Simulation of motions of a large molecular system A natural task is to perform calculations of individual atoms in parallel For each atom, there are several tasks to complete Covalent forces acting on the atom Neighbors that must be considered in non-bonded forces Non-bonded forces Update position and velocity Misc physical properties based on motions Some of these can go in parallel for an atom It is common that there are multiple ways to decompose any given problem. 16 HK-UIUC

Find Concr. 2: Decomp. Stage: Data Decomposition The most compute intensive parts of many large problems manipulate a large data structure Similar operations are being applied to different parts of the data structure, in a mostly independent manner. This is what CUDA is optimized for. The data decomposition should lead to Efficient data usage by each UE within the partition Few dependencies between the UEs that work on different partitions Adjustable partitions that can be varied according to the hardware characteristics 17 HK-UIUC

Data Decomposition Example - Square Matrix Multiplication P Row blocks (adjacent full rows) Computing each partition requires access to entire N array Square sub-blocks Only bands of M and N are needed M N P WIDTH 18 Think how data is stored in the Matrix type Computing things row-wise or column- wise might actually lead to different performance…

Find Concr. 3: Dependence Analysis - Tasks Grouping Sometimes natural tasks of a problem can be grouped together to improve efficiency Reduced synchronization overhead – because when task grouping there is supposedly no need for synchronization All tasks in the group efficiently share data loaded into a common on-chip, shared storage (Shard Memory) 19 HK-UIUC

Task Grouping Example - Square Matrix Multiplication Tasks calculating a P sub-block Extensive input data sharing, reduced memory bandwidth using Shared Memory All synched in execution M N P WIDTH 20 HK-UIUC

Find Concr. 4: Dependence Analysis - Task Ordering Identify the data required by a group of tasks before they can execute Find the task group that creates it (look upwind) Determine a temporal order that satisfy all data constraints 21 HK-UIUC

Task Ordering Example: Molecular Dynamics Neighbor List Covalent Forces Non-bonded Force Next Time Step Update atomic positions and velocities 22 HK-UIUC

Find Concr. 5: Dependence Analysis - Data Sharing Data sharing can be a double-edged sword An algorithm that calls for excessive data sharing can drastically reduce advantage of parallel execution Localized sharing can improve memory bandwidth efficiency Use the execution of task groups to interleave with (mask) global memory data accesses Read-only sharing can usually be done at much higher efficiency than read-write sharing, which often requires a higher level of synchronization 23 HK-UIUC

Data Sharing Example – Matrix Multiplication Each task group will finish usage of each sub-block of N and M before moving on N and M sub-blocks loaded into Shared Memory for use by all threads of a P sub-block Amount of on-chip Shared Memory strictly limits the number of threads working on a P sub-block Read-only shared data can be efficiently accessed as Constant or Texture data (on the GPU) Frees up the shared memory for other uses 24

Data Sharing Example – Molecular Dynamics The atomic coordinates Read-only access by the neighbor list, bonded force, and non- bonded force task groups Read-write access for the position update task group The force array Read-only access by position update group Accumulate access by bonded and non-bonded task groups The neighbor list Read-only access by non-bonded force task groups Generated by the neighbor list task group 25 HK-UIUC

Find Concr. 6: Design Evaluation Key questions to ask How many UE can be used? How are the data structures shared? Is there a lot of data dependency that leads to excessive synchronization needs? Is there enough work in each UE between synchronizations to make parallel execution worthwhile? 26 HK-UIUC

Key Parallel Programming Steps 1) Find the concurrency in the problem 2) Structure the algorithm so that concurrency can be exploited 3) Implement the algorithm in a suitable programming environment 4) Execute and tune the performance of the code on a parallel system 27

What’s Next? Focus on this for a while 28 From “Patterns for Parallel Programming”

Algorithm: Definition A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer Definiteness – the notion that each step is precisely stated Effective computability – each step can be carried out by a computer Finiteness – the procedure terminates Multiple algorithms can be used to solve the same problem Some require fewer steps and some exhibit more parallelism 29 HK-UIUC

Algorithm Structure - Strategies Task Parallelism Divide and Conquer Organize by Tasks Geometric Decomposition Recursive Data Decomposition Organize by Data Pipeline Event Condition Organize by Flow 30 HK-UIUC

Choosing Algorithm Structure Algorithm Design Organize by Task Organize by Data Organize by Data Flow LinearRecursiveLinearRecursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data RegularIrregular PipelineEvent Driven 31 HK-UIUC

Alg. Struct. 1: Organize by Structure Linear Parallel Tasks Common in the context of distributed memory models These are the algorithms where the work needed can be represented as a collection of decoupled or loosely coupled tasks that can be executed in parallel The tasks don’t even have to be identical Load balancing between UE can be an issue (dynamic load balancing?) If there is no data dependency involved there is no need for synchronization: the so called “embarrassingly parallel” scenario Example: ray tracing, a good portion of the N-body problem, Monte Carlo type simulation 32

Alg. Struct. 2: Organize by Structure Recursive Parallel Tasks (Divide & Conquer) Valid when you can break a problem into a set of decoupled or loosely coupled smaller problems, which in turn… This pattern is applicable when you can solve concurrently and with little synchronization the small problems (the leafs) Chances are you need synchronization when you have this balanced tree type algorithm Often required by the merging step (assembling the result from the “subresults”) Examples: FFT, Linear Algebra problems (see FLAME project), the vector reduce operation 33

Computational ordering can have major effects on memory bank conflicts and control flow divergence. 0123…

Alg. Struct. 3: Organize by Data ~ Linear (Geometric Decomposition) ~ This is the case where the UEs gather and work around a big chunk of data with no or little synchronization Imagine a car that needs to be painted: One robot paints the front left door, another one the rear left door, another one the hood, etc. The car is the “data” and it’s parceled up with a collection of UEs acting on these chunks of data NOTE: this is exactly the algorithmic approach enabled best by the GPU & CUDA Examples: Matrix multiplication, matrix convolution 35

M N P P sub BLOCK_WIDTH WIDTH BLOCK_WIDTH bx tx 01 bsize by ty bsize BLOCK_WIDTH BLOCK_SIZE WIDTH Tiled (Stenciled) Algorithms are Important for G80 36 HK-UIUC

Alg. Struct. 4: Organize by Data ~ Recursive Data Scenario ~ This scenario comes into play when the data that you operate on is structured in a recursive fashion Balanced Trees Graphs Lists Sometimes you don’t even have to have the data represented as such See example with prefix scan You choose to look at data as having a balanced tree structure Many problems that seem inherently sequential can be approached in this framework This is typically associated with an net increase in the amount of work you have to do Work goes up from O(n) to O(n log(n)) (see for instance Hillis and Steele algorithm) The key question is whether parallelism gained brings you ahead of the sequential alternative 37

Example: The prefix scan ~ Reduction Step~ 38 for k=0 to M-1 offset = 2 k for j=1 to 2 M-k-1 in parallel do x[j·2 k+1 -1] = x[j · 2 k+1 -1] + x[j · 2 k+1 -2 k -1] endfor

Example: The array reduction (the bad choice) Array elements iterations 39 HK-UIUC

Alg. Struct. 5: Organize by Data Flow ~ Regular: Pipeline ~ Tasks are parallel, but there is a high degree of synchronization (coordination) between the UEs The input of one UE is the output of an upwind UE (the “pipeline”) The time elapsed between two pipeline ticks dictated by the slowest stage of the pipe (the bottleneck) Commonly employed by sequential chips For complex problems having a deep pipeline is attractive At each pipeline tick you output one set of results Examples: CPU: instruction fetch, decode, data fetch, instruction execution, data write are pipelined. The Cray vector architecture drawing heavily on this algorithmic structure when performing linear algebra operations 40

Alg. Struct. 6: Organize by Data Flow ~ Irregular: Event Driven Scenarios ~ The furthest away from what the GPU can support today Well supported by MIMD architectures You coordinate UEs through asynchronous events Critical aspects: Load balancing – should be dynamic Communication overhead, particularly in real-time application Suitable for action-reaction type simulations Examples: computer games, traffic control algorithms 41

Key Parallel Programming Steps 1) Find the concurrency in the problem 2) Structure the algorithm so that concurrency can be exploited 3) Implement the algorithm in a suitable programming environment 4) Execute and tune the performance of the code on a parallel system 42

What’s Next? Focus on this for a while 43

Objective To understand SPMD and the role of the CUDA supporting structures How it relates other common parallel programming structures. What the supporting structures entail for applications development. How the supporting structures could be used to implement parameterized applications for different levels of efficiency and complexity. 44

Parallel Programming Coding Styles - Program and Data Models Fork/Join Master/Worker SPMD Program Models Loop Parallelism Distributed Array Shared Queue Shared Data Data Models These are not necessarily mutually exclusive. 45

Program Models SPMD (Single Program, Multiple Data) All PE’s (Processor Elements) execute the same program in parallel Each PE has its own data Each PE uses a unique ID to access its portion of data Different PE can follow different paths through the same code This is essentially the CUDA Grid model SIMD is a special case - WARP Master/Worker Loop Parallelism Fork/Join 46

Program Models SPMD (Single Program, Multiple Data) Master/Worker A Master thread sets up a pool of worker threads and a bag of tasks Workers execute concurrently, removing tasks until done Loop Parallelism Loop iterations execute in parallel FORTRAN do-all (truly parallel), do-across (with dependence) Fork/Join Most general, generic way of creation of threads 47

Review: Algorithm Structure Algorithm Design Organize by Task Organize by Data Organize by Data Flow LinearRecursiveLinearRecursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data RegularIrregular PipelineEvent Driven 48

Algorithm Structures vs. Coding Styles (four smiles is best) Task Parallel.Divide/Conquer Geometric Decomp. Recursive Data PipelineEvent-based SPMD ☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺ Loop Parallel ☺☺☺☺☺☺☺☺☺ Master/ Worker ☺☺☺☺☺☺☺☺☺☺ Fork/ Join ☺☺☺☺☺☺☺☺☺☺☺☺ Source: Mattson, et al. 49

Supporting Structures vs. Program Models OpenMPMPICUDA SPMD ☺☺☺☺☺☺☺ Loop Parallel ☺☺☺☺☺ Master/ Slave ☺☺☺☺☺ Fork/Join ☺☺☺ 50

More on SPMD Dominant coding style of scalable parallel computing MPI code is mostly developed in SPMD style Many OpenMP code is also in SPMD (next to loop parallelism) Particularly suitable for algorithms based on data parallelism, geometric decomposition, divide and conquer. Main advantage Tasks and their interactions visible in one piece of source code, no need to correlated multiple sources SPMD is by far the most commonly used pattern for structuring parallel programs. 51

Typical SPMD Program Phases Initialize Establish localized data structure and communication channels Obtain a unique identifier Each thread acquires a unique identifier, typically range from 0 to N=1, where N is the number of threads. Both OpenMP and CUDA have built-in support for this. Distribute Data Decompose global data into chunks and localize them, or Sharing/replicating major data structure using thread ID to associate subset of the data to threads Run the core computation More details in next slide… Finalize Reconcile global data structure, prepare for the next major iteration 52

Core Computation Phase Thread IDs are used to differentiate behavior of threads Use thread ID in loop index calculations to split loop iterations among threads Use thread ID or conditions based on thread ID to branch to their specific actions Both can have very different performance results and code complexity depending on the way they are done. 53

A Simple Example Assume The computation being parallelized has 1,000,000 iterations. Sequential code: Num_steps = ; for (i=0; i< num_steps, i++) { … } 54

SPMD Code Version 1 Assign a chunk of iterations to each thread The last thread also finishes up the remainder iterations num_steps = ; i_start = my_id * (num_steps/num_threads); i_end = i_start + (num_steps/num_threads); if (my_id == (num_threads-1)) i_end = num_steps; for (i = i_start; i < i_end; i++) { …. } Reconciliation of results across threads if necessary. 55

Problems with Version 1 The last thread executes more iterations than others The number of extra iterations is up to the total number of threads – 1 This is not a big problem when the number of threads is small When there are thousands of threads, this can create serious load imbalance problems. Also, the extra if statement is a typical source of “branch divergence” in CUDA programs. 56

SPMD Code Version 2 Assign one more iterations to some of the threads int rem = num_steps % num_threads; i_start = my_id * (num_steps/num_threads); i_end = i_start + (num_steps/num_threads); if (rem != 0) { if (my_id < rem) { i_start += my_id; i_end += (my_id +1); } else { i_start += rem; i_end += rem; }. Less load imbalance More branch divergence. 57

SPMD Code Version 3 Use cyclic distribution of iteration num_steps = ; for (i = my_id; i < num_steps; i+= num_threads) { …. } Less load imbalance Loop branch divergence in the last Warp Data padding further eliminates divergence. 58

Comparing the Three Versions ID=0ID=1ID=3ID=2 ID=0ID=1ID=3ID=2 ID=0ID=1ID=3ID=2 Version 1 Version 2 Version 3 Padded version1 1 may be best for some data access patterns. 59

Data Styles Shared Data All threads share a major data structure This is what CUDA supports Shared Queue All threads see a “thread safe” queue that maintains ordering of data communication Distributed Array Decomposed and distributed among threads Limited support in CUDA Shared Memory 60