1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012.

Slides:



Advertisements
Similar presentations
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
Advertisements

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
ECE408/CS483 Applied Parallel Programming Lecture 9: Tiled Convolution
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
1 CS/EE 217: GPU Architecture and Parallel Programming Tiled Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Parallel Computation Patterns (Scan)
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
Memory and Data Locality
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© 2012 Elsevier, Inc. All rights reserved.
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Computation Patterns (Stencil)
Convolution Layer Optimization
© David Kirk/NVIDIA and Wen-mei W
Presentation transcript:

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

Objective To learn more about the analysis of tiled algorithms 2

If we used a larger (8 element) tile N_ds Mask_Width is 5 For Mask_Width = 5, we load = 12 elements (12 memory loads) P

Each output P element uses 5 N elements (in N_ds) N_ds Mask_Width is 5 P[8] uses N[6], N[7], N[8], N[9], N[10] P[9] uses N[7], N[8], N[9], N[10], N[11] P[10] uses N[8], N[9], N[10], N[11], N[12] … P[14] uses N[12], N[13], N[14], N[15],N[16] P[15] uses N[13], N[14], N[15], N[16], N[17] P

A simple way to calculate tiling benefit (8+5-1)=12 elements loaded 8*5 global memory accesses replaced by shared memory accesses This gives a bandwidth reduction of 40/12=3.3 5

In General Tile_Width + Mask_Width -1 elements loaded Tile_Width * Mask_Width global memory accesses replaced by shared memory access This gives a reduction of bandwidth by (Tile_Width *Mask_Width)/(Tile_Width+Mask_Width-1) 6

Another Way to Look at Reuse N_ds Mask_Width is 5 N[6] is used by P[8] (1X) N[7] is used by P[8], P[9] (2X) N[8] is used by P[8], P[9], P[10] (3X) N[9] is used by P[8], P[9], P[10], P[11] (4X) N[10] is used by P[8], P[9], P[10], P[11], P[12] (5X) … (5X) N[14] is uses by P[12], P[13], P[14], P[15] (4X) N[15] is used by P[13], P[14], P[15] (3X) P

Another Way to Look at Reuse The total number of global memory accesses (to the (8+5-1)=12 N elements) replaced by shared memory accesses is * (8-5+1) = = 40 So the reduction is 40/12 = 3.3 8

Ghost elements change ratios For a boundary tile, we load Tile_Width + (Mask_Width-1)/2 elements –10 in our example of Tile_Width =8 and Mask_Width=5 Computing boundary elements do not access global memory for ghost cells –Total accesses is *5 = 37 accesses The reduction is 37/10 = 3.7 9

In General for 1D The total number of global memory accesses to the (Tile_Width+Mask_Width-1) N elements replaced by shared memory accesses is … + Mask_Width-1+ Mask_Width * (Tile_Width -Mask_Width+1) + Mask_Width-1+… = (Mask_Width-1) *Mask_Width+ Mask_Width*(Tile_Width-Mask_Width+1) = Mask_Width*(Tile_Width) 10

Bandwidth Reduction for 1D The reduction is Mask_Width * (Tile_Width)/(Tile_Width+Mask_Size-1) Tile_Width Reduction Mask_Width = Reduction Mask_Width =

2D Output Tiling and Indexin (P) Use a thread block to calculate a tile of P –Each output tile is of TILE_SIZE for both x and y 12 col_o = blockIdx.x * TILE_WIDTH + tx; row_o = blockIdx.y*TILE_WIDTH + ty;

Input tiles need to cover halo elements Output Tile Input Tile Mask_Width = 5

A Simple Analysis for a small 8X8 output tile example 12X12=144 N elements need to be loaded into shared memory The calculation of each P element needs to access 25 N elements 8X8X25 = 1600 global memory accesses are converted into shared memory accesses A reduction of 1600/144 = 11X 14

In General (Tile_Width+Mask_Width-1) 2 elements from N need to be loaded into shared memory for each tile The calculation of each P element needs to access Mask_Width 2 elements –Tile_Width 2 * Mask_Width 2 global memory accesses are converted into shared memory accesses The reduction is Tile_Width 2 * Mask_Width 2 / (Tile_Width+Mask_Width-1) 2 15

Bandwidth Reduction for 2D The reduction is Mask_Width 2 * (Tile_Width) 2 /(Tile_Width+Mask_Size-1) 2 Tile_Width Reduction Mask_Width = Reduction Mask_Width =

Ghost elements change ratios Left as homewok. 17

ANY MORE QUESTIONS? READ CHAPTER 8 © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois,