Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

CS179: GPU Programming Lecture 8: More CUDA Runtime.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA and the Memory Model (Part II). Code executed on GPU.

CUDA Grids, Blocks, and Threads

Introduction of Arrays. Arrays Array form an important part of almost all programming language. It provides a powerful feature and can be used as such.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

Texture Memory -in CUDA Perspective TEXTURE MEMORY IN - IN CUDA PERSPECTIVE VINAY MANCHIRAJU.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 24: Advanced CUDA Feature Highlights April 21, 2009.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

ICOM 4035 – Data Structures Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 – August 23, 2001.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

Arrays in MIPS Assembly Computer Organization and Assembly Language: Module 6.

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

ADVANCED POINTERS. Overview Review on pointers and arrays Common troubles with pointers Multidimensional arrays Pointers as function arguments Functions.

1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Arrays An array is a sequence of objects all of which have the same type. The objects are called the elements of the array and are numbered consecutively.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

A FIRST BOOK OF C++ CHAPTER 7 ARRAYS. OBJECTIVES In this chapter, you will learn about: One-Dimensional Arrays Array Initialization Arrays as Arguments.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Computer Engg, IIT(BHU)

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Computers’ Basic Organization

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Understand Computer Storage and Data Types

Basic CUDA Programming

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

6- General Purpose GPU Programming

Presentation transcript:

Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides

Memory Hierarchy overview Registers –Very fast Shared Memory –Very Fast Local Memory – cycles Global Memory – cycles Constant Memory – cycles Texture Memory – cycles –8K Cache

What is Texture Memory A block of read-only memory shared by all multi- processors –1D, 2D, or 3D array –Texels: Up to 4-element vectors –x, y, z, w Reads from texture memory can be “samples” of multiple texels Slow to access –several hundred clock cycle latency But it is cached: –8KB per multi-processor –Fast access if cache hit Good if you have random accesses to a large read-only data structure

Overview: Benefits & Limitations of CUDA textures Texture fetches are cached –Optimized for 2D locality We’ll talk about this at the end Addressing: –1D, 2D, or 3D Coordinates: –integer or normalized –Fewer addressing calculations in code Provide filtering for free Free out-of-bounds handling: wrap modes –Clamp to edge / warp Limitations of CUDA textures: –Read-only from within a kernel

Texture Abstract Structure A 1D, 2D, or 3D array. Example 4x4: Values Set by the Program

Regular Indexing Indexes are floating point numbers –Think of the texture as a surface as opposed to a grid for which you have a grid of samples Not there

Normalized Indexing NxM Texture: –[0,1.0) x [0.0, 1.0) indexes (0.0,0.0) (1.0,1.0) (0.5,0.5) Convenient if you want to express the computation in size-independent terms

How to think about program values Values

What Value Does a Texture Reference Return? Nearest-Point Sampling –Comes for “free” –Elements must be floats

Nearest-Point Sampling In this filtering mode, the value returned by the texture fetch is –tex(x) = T[i] for a one-dimensional texture, –tex(x, y) = T[i, j] for a two-dimensional texture, –tex(x, y, z) = T[i, j, k] for a three-dimensional texture, where i = floor(x), j = floor( y), and k = floor(z).

Nearest-Point Sampling: 4-Element 1D Texture Behaves more like a conventional array

Another Filtering Option Linear Filtering See Appendix D of the Programming Guide

Linear-Filtering Detail Good luck with this one: Effectively the value read is a weighted average of all neighboring texels

Linear-Filtering: 4-Element 1D Texture

Dealing with Out-of-Bounds References Clamping –Get’s stuck at the edge i < 0  actual i = 0 i > N -1  actual i = N -1 Warping –Warps around actual i = i MOD N Useful when texture is a periodic signal

Texture Addressing Explained

Texels Texture Elements –All elemental datatypes Integer, char, short, float (unsigned) –CUDA vectors: 1, 2, or 4 elements char1, uchar1, char2, uchar2, char4, uchar4, short1, ushort1, short2, ushort2, short4, ushort4, int1, uint1, int2, uint2, int4, uint4, long1, ulong1, long2, ulong2, long4, ulong4, float1, float2, float4,

Programmer’s view of Textures Texture Reference Object –Use that to access the elements –Tells CUDA what the texture looks like Space to hold the values –Linear Memory (portion of memory) Only for 1D textures –CUDA Array Special CUDA Structure used for Textures –Opaque Then you bind the two: –Space and Reference

Texture Reference Object –texture texRef; Type = texel datatype Dim = 1, 2, 3 ReadMode: –What values are returned cudaReadModeElementType –Just the elements  What you write is what you get cudaReadModeNormalizedFloat –Works for chars and shorts (unsigned) –Value normalized to [0.0, 1.0]

CUDA Containers: Linear Memory Bound to linear memory –Global memory is bound to a texture CudaMalloc() –Only 1D –Integer addressing –No filtering, no addressing modes –Return either element type or normalized float

CUDA Containers: CUDA Arrays Bound to CUDA arrays –CUDA array is bound to a texture –1D, 2D, or 3D –Float addressing size-based, normalized –Filtering –Addressing modes clamping, warping –Return either element type or normalized float

CUDA Texturing Steps Host (CPU) code: –Allocate/obtain memory global linear, or CUDA array –Create a texture reference object Currently must be at file-scope –Bind the texture reference to memory/array –When done: Unbind the texture reference, free resources Device (kernel) code: –Fetch using texture reference –Linear memory textures: tex1Dfetch() –Array textures: tex1D(), tex2D(), tex3D()

Texture Reference Parameters Immutable compile-time Specified at compile time –Type: texel type Basic int, float types CUDA 1-, 2-, 4-element vectors –Dimensionality: 1, 2, or 3 –Read Mode: cudaReadModeElementType cudaReadModeNormalizedFloat –valid for 8- or 16-bit ints –returns [-1,1] for signed, [0,1] for unsigned

Texture Reference Mutable Parameters Mutable parameters Can be changed at run-time –only for array-textures –Normalized: non-zero = addressing range [0, 1] –Filter Mode: cudaFilterModePoint cudaFilterModeLinear –Address Mode: cudaAddressModeClamp cudaAddressModeWrap

Example: Linear Memory // declare texture reference (must be at file-scope) Texture texRef; // Type, Dimensions, return value normalization // set up linear memory on Device unsigned short *dA = 0; cudaMalloc ((void**)&dA, numBytes); // Copy data from host to device cudaMempcy(dA, hA, numBytes, cudaMemcpyHostToDevice); // bind texture reference to array cudaBindTexture(NULL, texRef, dA, numBytes);

How to Access Texels In Linear Memory Bound Textures Type tex1Dfetch(texRef, int x); Where Type is the texel datatype Previous example: –Unsigned short value = tex1Dfetch (texRef, 10) –Returns element 10 You can write to the memory holding the texture  dA allocated with cudaMalloc –Bad idea  no hardware guarantees

CUDA Array Type Got to specify two things: –Channel format –Dimensions CudaMallocArray –2D arrays CudaMallocArray3D –3D arrays Management functions: –cudaMallocArray, cudaFreeArray, –cudaMemcpyToArray, cudaMemcpyFromArray,...

Channel Descriptors What data appears on each element –Think of images for example –Every element is an RBG value cudaChannelFormatDesc structure –int x, y, z, w: parts for each component Number of bits: e.g., 8 –enum cudaChannelFormatKind – one of: cudaChannelFormatKindSigned cudaChannelFormatKindUnsigned cudaChannelFormatKindFloat –Some predefined constructors: cudaCreateChannelDesc (void); Management functions: –cudaMallocArray, cudaFreeArray, –cudaMemcpyToArray, cudaMemcpyFromArray,...

Example Host Code for 2D array // declare texture reference (must be at file-scope) Texture texRef; // set up the CUDA array cudaChannelFormatDesc cf = cudaCreateChannelDesc (); cudaArray *texArray = 0; cudaMallocArray(&texArray, &cf, dimX, dimY); cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice); // specify mutable texture reference parameters texRef.normalized = 0; texRef.filterMode = cudaFilterModeLinear; texRef.addressMode = cudaAddressModeClamp; // bind texture reference to array cudaBindTextureToArray(texRef, texArray);

Accessing Texels Type tex1D(texRef, float x); Type tex2D(texRef, float x, float y); Type tex3D(texRef, float x, float y, float z);

At the end cudaUnbindTexture (texRef)

Dimension Limits In Elements not bytes –In CUDA Arrays: 1D: 8K 2D: 64K x 32K 3D: 2K x 2K x 2K –If in linear memory: 2^27 That’s 128M elements Floats: –128M x 4 = 512MB Not verified: Info from: Cyril Zeller of NVIDIA – &view=findpost&p=169592

Textures are Optimized for 2D Locality Regular Array Allocation –Row-Major Because of Filtering –Neighboring texels –Accessed close in time

Textures are Optimized for 2D Locality

Using Textures Textures are read-only –Within a kernel A kernel can produce an array –Cannot write CUDA Arrays Then this can be bound to a texture for the next kernel Linear Memory can be copied to CUDA Arrays –cudaMemcpyFromArray() Copies linear memory array to a CudaArray –cudaMemcpyToArray() Copies CudaArray to linear memory array

An Example r_Advect.htmhttp:// r_Advect.htm GPU Acceleration of Scalar Advection

Cuda Arrays Read the CUDA Reference Manual Relevant functions are the ones with “Array” in it Remember: –Array format is opaque Pitch: –Padding added to achieve good locality –Some functions require this pitch to be passed as a an argument –Prefer those that use it from the Array structure directly