Presentation is loading. Please wait.

Presentation is loading. Please wait.

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Similar presentations


Presentation on theme: "ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010."— Presentation transcript:

1 ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010

2 | ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration The problem for( many input values ) { histogram[ value ]++; } Many scattered read-modify-write accesses into small data structure On CPU, scattered r-m-w goes to cache by default  fast On GPU, goes to __global by default  worst case Solution: use __local memory & parallelize histogram compute

3 | ATI Stream Computing Update | Confidential 33 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration SIMD GPU Algorithm 1. Thread fetches input data from __global to __private (registers) 2. Scatter into __local sub-histograms in group (multiple LDS banks per bin) 3. Reduce __local bins into single histogram per group,.. 4. Reduce __global histograms (2 nd kernel for global sync point) __local Histograms Input Buffer __global SIMD __global flush to __global

4 | ATI Stream Computing Update | Confidential 44 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration SIMD GPU Algorithm 1. Thread fetches input data from __global to __private (registers) 2. Scatter into __local sub-histograms in group (multiple LDS banks per bin) 3. Reduce __local bins into single histogram per group,.. flush to __global 4. Reduce __global histograms (2 nd kernel for global sync point) __local Histograms Input Buffer __global SIMD __global SIMD Generic reduction performance 1 1+2 1+2+3 1+2+3+4 Input bytes processed, approximate numbers ATI Radeon™ HD 5870, ATI Stream SDK v2.01 (256 MB to 320KB) (320 KB to 256 KB) (256 KB to 1 KB) 145 GB/s 109 GB/s 107 GB/s 103 GB/s Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

5 | ATI Stream Computing Update | Confidential 55 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel launch setup At least as many threads as needed to optimally fetch input: Group size Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

6 | ATI Stream Computing Update | Confidential 66 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Launch setup: assorted lore At least 1 group per SIMD 3-4 wavefronts per SIMD to keep SIMD stages busy (2 ALU, 1 fetch, 1 export) For memory bound kernels: >= 7 wavefronts per SIMD for __global latency hiding (> 8k threads on AMD “Cypress” GPU) Per-thread and per-group costs become noticeable at high thread counts (i.e. 1 thread per DWORD 4-vec) Good experimental starting point: 64 and/or 128 threads/group, >= 16k threads (on AMD “Cypress GPU”) On CPU: as few threads as possible, e.g. 1x – 2x number of compute units

7 | ATI Stream Computing Update | Confidential 77 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Launch setup, histogram Larger group size: better __local sharing between threads Smaller group size: __local reduction gets more expensive Experimental peak at 256 threads/group, 64k threads Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

8 | ATI Stream Computing Update | Confidential 88 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Launch setup, histogram, cont’d #define NBINS 256 main() { nThreads = 64 * 1024; nThreadsPerGroup = 256; nGroups = nThreads / nThreadsPerGroup; n4Vectors = 4096 * 4096; n4VectorsPerThread = n4Vectors / nThreads; inputNBytes = n4Vectors * sizeof(cl_uint4); outputNBytes = nGroups * NBINS * sizeof(cl_uint); (static setup for benchmarking purpose only; a real app will take into account the image size and GPU type (wavefront size, # of compute units))

9 | ATI Stream Computing Update | Confidential 99 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel __kernel void histogramKernel( global uint4 *Image, global uint *Histogram, uint n4VectorsPerThread) { __local uint subhists[NBANKS * NBINS]; … input buffer processed as 4-vectors output buffer holds sub- and final histograms (256 bins * 256 groups * cl_uint = 256KB) __local buffer holds work-group sub-histograms (256 bins * 16 banks * cl_uint = 16KB per SIMD)

10 | ATI Stream Computing Update | Confidential 10 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, parallel LDS clear __local uint2 *p = (__local uint2 *) subhists; if( ltid < lmem_max_threads ) { for( ) p[idx] = 0; } barrier( CLK_LOCAL_MEM_FENCE ); Significant difference compared to single thread clear (4.5x) Slightly faster as uint2 vs. uint (2x more LDS requests per instruction) Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

11 | ATI Stream Computing Update | Confidential 11 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, coalesced access uint tid = get_global_id(0); uint Stride = get_global_size(0); uint4 temp; for( i=0, idx = tid; i < n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; Each thread starts at its global thread ID Stride is the number of threads Resulting pattern over all threads is optimally coalesced … Loop 0Loop 1Loop 2 get_global_size(0)

12 | ATI Stream Computing Update | Confidential 12 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, serial access uint tid = get_global_id(0); uint4 temp; for( i=0, idx = tid*n4VectorsPerThread; i<n4VectorsPerThread; i++, idx++) { temp = Image[idx]; Each thread reads a block with stride 1 Resulting pattern is bad for uncached __global Ok on CPU and GPU cached Loop 0Loop 1Loop 2 n4VectorsPerThread

13 | ATI Stream Computing Update | Confidential 13 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Coalesced vs. serial access group size 64 Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

14 | ATI Stream Computing Update | Confidential 14 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel: 4-vector pixel mask & shift 1.fetch: XYZWXYZWXYZWXYZW 2.mask: ___W___W___W___W 3.shift: _XYZ_XYZ_XYZ_XYZ 4.mask: ___Z___Z___Z___Z 5.shift: __XY__XY__XY__XY 6.mask: ___Y___Y___Y___Y 7.… Performs better than generic uchar4/uchar16 #define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS); for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; … temp = temp >> shft; temp2 = (temp & msk) * (uint4) NBANKS + offset; …

15 | ATI Stream Computing Update | Confidential 15 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, atomic scatter #define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS); for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; (void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); …

16 | ATI Stream Computing Update | Confidential 16 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, LDS banks #define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS); for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; … 0123456789ABCDEF LDS addr 0 NBANKS = 1 0000000000000000 1111111111111111 2222222222222222 LDS addr 0 LDS addr 0x10 LDS addr 0x20 NBANKS = 16

17 | ATI Stream Computing Update | Confidential 17 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration LDS banking performance Effective LDS rate: > 900 GB/sec Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

18 | ATI Stream Computing Update | Confidential 18 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, LDS reduction barrier( CLK_LOCAL_MEM_FENCE ); if( ltid < NBINS ) { uint bin = 0; for( i=0; i<NBANKS; i++ ) bin += subhists[ (ltid * NBANKS) + i ]; Histogram[ (get_group_id(0) * NBINS) + ltid ] = bin; } 0000000000000000 1111111111111111 LDS addr 0 LDS addr 0x10 0123456789 __global

19 | ATI Stream Computing Update | Confidential 19 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel 2, __global reduction __kernel void reduceKernel( __global uint *Histogram, uint nSubHists ) { uint tid = get_global_id(0); uint bin = 0; for( int i=0; i < nSubHists; i++ ) bin += Histogram[ (i * NBINS) + tid ]; Histogram[ tid ] = bin; } 0123456789 __global 01 2 3456789 01 2 3456789 01 2 3456789

20 | ATI Stream Computing Update | Confidential 20 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Single component vs. 4-vector 4-vectors work best for many cases. Some corner cases can be faster using single component access.. For absolute peak performance, it’s worth trying both.

21 | ATI Stream Computing Update | Confidential 21 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Single component vs. 4-vector, histogram for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; (void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); temp = temp >> shft;

22 | ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Single component vs. 4-vector, histogram, cont’d for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp.x = Image[idx].x; temp.y = Image[idx].y; temp.z = Image[idx].z; temp.w = Image[idx].w; temp2.x = (temp.x & msk) * (uint) NBANKS + offset; temp2.y = (temp.y & msk) * (uint) NBANKS + offset; temp2.z = (temp.z & msk) * (uint) NBANKS + offset; temp2.w = (temp.w & msk) * (uint) NBANKS + offset; (void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); temp.x = temp.x >> shft; temp.y = temp.y >> shft; temp.z = temp.z >> shft; temp.w = temp.w >> shft; 10 % faster! Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

23 | ATI Stream Computing Update | Confidential 23 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, AMD Phenom, Catalyst, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows Vista are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.


Download ppt "ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010."

Similar presentations


Ads by Google