© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
Slides from “PMPP” book
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Parallel Computation Patterns (Scan)
Parallel Computation Patterns (Reduction)
Programming Massively Parallel Processors Performance Considerations
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 498AL Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms

Slides from: David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana- Champaign Podcasts of their lectures (recommended) : –

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 3 Key Parallel Programming Steps 1)To find the concurrency in the problem 2)To structure the algorithm to translate concurrency into performance 3)To implement the algorithm in a suitable programming environment 4)To execute and tune the performance of the code on a parallel system Unfortunately, these have not been separated into levels of abstractions that can be dealt with independently.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 4 Algorithm A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer –Definiteness – the notion that each step is precisely stated –Effective computability – each step can be carried out by a computer –Finiteness – the procedure terminates Multiple algorithms can be used to solve the same problem –Some require fewer steps –Some exhibit more parallelism –Some have larger memory footprint than others

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 5 Choosing Algorithm Structure Start Task-centric LinearRecursive Task Parallelism Divide and Conquer Data-centric LinearRecursive Geometric Decomposition Recursive Data Data Flow centric RegularIrregular PipelineEvent Driven

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 6 A simple implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 7 Mapping a Divide and Conquer Algorithm Array elements iterations Thread 0Thread 8Thread 2Thread 4Thread 6Thread 10

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 8 M N P P sub BLOCK_WIDTH WIDTH BLOCK_WIDTH bx tx 01 bsize by ty bsize BLOCK_WIDTH BLOCK_SIZE WIDTH Tiled (Stenciled) Algorithms are Important for Geometric Decomposition A framework for memory data sharing and reuse by increasing data access locality. –Tiled access patterns allow small cache/scartchpad memories to hold on to data for re-use. –For matrix multiplication, a 16X16 thread block perform 2*256 = 512 float loads from device memory for 256 * (2*16) = 8,192 mul/add operations. A convenient framework for organizing threads (tasks)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 9 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Increased Work per Thread for even more locality Each thread computes two element of Pd sub Reduced loads from global memory (Md) to shared memory Reduced instruction overhead –More work done in each iteration Pd sub

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 10 Double Buffering - a frequently used algorithm pattern One could double buffer the computation, getting better instruction mix within each thread –This is classic software pipelining in ILP compilers Loop { Load current tile to shared memory syncthreads() Compute current tile syncthreads() } Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() Load next tile from global memory Compute current tile syncthreads() }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 11 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Double Buffering Deposit blue tile from register into shared memory Syncthreads Load orange tile into register Compute Blue tile Deposit orange tile into shared memory ….

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 12 One can trade more work for increased parallelism Diamond search algorithm for motion estimation work efficient but sequential –Popular in traditional CPU implementations Exhaustive Search totally parallel but work inefficient –Popular in HW and parallel implementations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 13 An MPEG Algorithm based on Data Parallelism Loops distributed – DOALL style –Replicates instructions and tables across accelerators If instructions and data are too large for local memory… –Large memory transfers required to preserve data –Saturation of and contention for communication resources can leave computation resources idle Memory bandwidth constrains performance Communication

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 14 Loop fusion & memory privatization Stage loops fused into single DOALL macroblock loop –Memory privatization reduces main memory access Replicates instructions and tables across processors –Local memory constraints may prevent this technique Novel dimensions of parallelism reduce communication

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 15 Pipeline or “Spatial Computing” Model Each PE performs as one pipeline stage in macroblock processing Imbalanced stages result in idle resources Takes advantage of direct, accelerator to accelerator communication Not very effective in CUDA but can be effective for Cell Efficient point-to-point communication can enable new models

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 16 Small decisions in algorithms can have major effect on parallelism. (H.263 motion estimation example) Different algorithms may expose different levels of parallelism while achieving desired result In motion estimation, can use previous vectors (either from space or time) as guess vectors