ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
Micah Villmow May 30, 2008

AMD Core Math Library Full implementation of Level 1, 2 and 3 Basic Linear Algebra Subroutines (BLAS) with some optimizations for AMD Opteron™ processors. A full suite of Linear Algebra (LAPACK) routines. Comprehensive suite of Fast Fourier Transforms (FFTs) in both single-, double-, single-complex and double-complex data types Fast scalar, vector, and array math transcendental library routines optimized for AMD Opteron processors. Single and double precision Random Number Generators. Load balance between multiple CPU cores and GPU cores

SGEMM – C = α * A * B + β * C Similar to simple matmult
Requires multiply of result by alpha C must be multiplied by a beta Theoretical Algorithmic peak is ~½ Theoretical ALU peak Issues: Kernel performance != real-world performance Memory bound algorithm vs. texture bound kernel Balancing register count versus wavefronts in flight

Case Study Study the kernels to find out the bottlenecks
Determine the actions needed to take to fix bottlenecks Determine how close to theoretical peak fixes come Work on improving system time Memory/Execution parallelization

Kernel 1 - Setup il_ps_2_0 dcl_input_position_interp(linear_noperspective) vWinCoord.xy__ dcl_cb cb0[2] dcl_output_generic o0 dcl_output_generic o1 dcl_output_generic o2 dcl_output_generic o3 dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(6)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(7)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0, 0x , 0x , 0x , 0x dcl_literal l3, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 add r25.xy__, vWinCoord.xyxx, cb0[0].xyxx mov r25.__zw, l0 mov r8, l0 mov r9, l0 mov r10, l0 mov r11, l0 mov r12, l0 mov r13, l0 mov r14, l0 mov r15, l0 mov r16, l0 mov r17, l0 mov r18, l0 mov r19, l0 mov r20, l0 mov r21, l0 mov r22, l0 mov r23, l0

Kernel 1 – Loop & Output dp4 o0.x___, r8, cb0[0].wwww whileloop
ge r2._y__, r25.wwww, cb0[0].zzzz break_logicalnz r2.y sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz mad r8, r0, r4, r8 mad r9, r0, r5, r9 mad r10, r0, r6, r10 mad r11, r0, r7, r11 mad r12, r1, r4, r12 mad r13, r1, r5, r13 mad r14, r1, r6, r14 mad r15, r1, r7, r15 mad r16, r2, r4, r16 mad r17, r2, r5, r17 mad r18, r2, r6, r18 mad r19, r2, r7, r19 mad r20, r3, r4, r20 mad r21, r3, r5, r21 mad r22, r3, r6, r22 mad r23, r3, r7, r23 add r25.___w, r25.w, l3 endloop dp4 o0.x___, r8, cb0[0].wwww dp4 o0._y__, r9, cb0[0].wwww dp4 o0.__z_, r10, cb0[0].wwww dp4 o0.___w, r11, cb0[0].wwww dp4 o1.x___, r12, cb0[0].wwww dp4 o1._y__, r13, cb0[0].wwww dp4 o1.__z_, r14, cb0[0].wwww dp4 o1.___w, r15, cb0[0].wwww dp4 o2.x___, r16, cb0[0].wwww dp4 o2._y__, r17, cb0[0].wwww dp4 o2.__z_, r18, cb0[0].wwww dp4 o2.___w, r19, cb0[0].wwww dp4 o3.x___, r20, cb0[0].wwww dp4 o3._y__, r21, cb0[0].wwww dp4 o3.__z_, r22, cb0[0].wwww dp4 o3.___w, r23, cb0[0].wwww ret_dyn end

Kernel 1 Stats – ATI Radeon™ HD 3870 GPU
GSA: ALU: 52 ALU TEX: 8 Fetches OUT: 4 Color Writes REG: 25 Registers ISA: ALU: Setup: 14 ALU Loop: 18 ALU * LoopCount Output: 20 ALU TEX: 8 Fetches * LoopCount LoopCount: (Width/4) Perf Workbook(1k*1k): Total Pixels: Total ALU: 4642 Total Tex: 2048 Total Out: 4 ALU Time: ms Tex Time: ms 0% Bandwidth: ms 75% Bandwidth: ms Bottleneck: 0% cache hit rate – Bandwidth Theoretical Peak: Gflops 75% cache hit rate – Tex Theoretical Peak: Gflops

Kernel 2 - Setup il_ps_2_0 dcl_input_position_interp(linear_noperspective) v0.xy__ dcl_cb cb0[2] dcl_output_generic o0 dcl_output_generic o1 dcl_output_generic o2 dcl_output_generic o3 dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(6)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(7)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0, 0x , 0x , 0x , 0x dcl_literal l3, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 add r25.xy__, v0.xyxx, cb0[0].xyxx mov r25.__zw, l0 mov r10, l0 mov r11, l0 mov r12, l0 mov r13, l0 mov r14, l0 mov r15, l0 mov r16, l0 mov r17, l0 mov r18, l0 mov r19, l0 mov r20, l0 mov r21, l0 mov r22, l0 mov r23, l0

Kernel 2 - Loop mad r11.__z_, r1.x, r6.x, r11.z whileloop
ge r2._y__, r25.w, cb0[0].z break_logicalnz r2.y sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz add r25.___w, r25.w, l3 mad r10.x___, r0.x, r4.x, r10.x mad r10.x___, r0.y, r4.y, r10.x mad r10.x___, r0.z, r4.z, r10.x mad r10.x___, r0.w, r4.w, r10.x mad r10._y__, r0.x, r5.x, r10.y mad r10._y__, r0.y, r5.y, r10.y mad r10._y__, r0.z, r5.z, r10.y mad r10._y__, r0.w, r5.w, r10.y mad r10.__z_, r0.x, r6.x, r10.z mad r10.__z_, r0.y, r6.y, r10.z mad r10.__z_, r0.z, r6.z, r10.z mad r10.__z_, r0.w, r6.w, r10.z mad r10.___w, r0.x, r7.x, r10.w mad r10.___w, r0.y, r7.y, r10.w mad r10.___w, r0.z, r7.z, r10.w mad r10.___w, r0.w, r7.w, r10.w mad r11.x___, r1.x, r4.x, r11.x mad r11.x___, r1.y, r4.y, r11.x mad r11.x___, r1.z, r4.z, r11.x mad r11.x___, r1.w, r4.w, r11.x mad r11._y__, r1.x, r5.x, r11.y mad r11._y__, r1.y, r5.y, r11.y mad r11._y__, r1.z, r5.z, r11.y mad r11._y__, r1.w, r5.w, r11.y mad r11.__z_, r1.x, r6.x, r11.z mad r11.__z_, r1.y, r6.y, r11.z mad r11.__z_, r1.z, r6.z, r11.z mad r11.__z_, r1.w, r6.w, r11.z mad r11.___w, r1.x, r7.x, r11.w mad r11.___w, r1.y, r7.y, r11.w mad r11.___w, r1.z, r7.z, r11.w mad r11.___w, r1.w, r7.w, r11.w mad r12.x___, r2.x, r4.x, r12.x mad r12.x___, r2.y, r4.y, r12.x mad r12.x___, r2.z, r4.z, r12.x mad r12.x___, r2.w, r4.w, r12.x mad r12._y__, r2.x, r5.x, r12.y mad r12._y__, r2.y, r5.y, r12.y mad r12._y__, r2.z, r5.z, r12.y mad r12._y__, r2.w, r5.w, r12.y mad r12.__z_, r2.x, r6.x, r12.z mad r12.__z_, r2.y, r6.y, r12.z mad r12.__z_, r2.z, r6.z, r12.z mad r12.__z_, r2.w, r6.w, r12.z mad r12.___w, r2.x, r7.x, r12.w mad r12.___w, r2.y, r7.y, r12.w mad r12.___w, r2.z, r7.z, r12.w mad r12.___w, r2.w, r7.w, r12.w mad r13.x___, r3.x, r4.x, r13.x mad r13.x___, r3.y, r4.y, r13.x mad r13.x___, r3.z, r4.z, r13.x mad r13.x___, r3.w, r4.w, r13.x mad r13._y__, r3.x, r5.x, r13.y mad r13._y__, r3.y, r5.y, r13.y mad r13._y__, r3.z, r5.z, r13.y mad r13._y__, r3.w, r5.w, r13.y mad r13.__z_, r3.x, r6.x, r13.z mad r13.__z_, r3.y, r6.y, r13.z mad r13.__z_, r3.z, r6.z, r13.z mad r13.__z_, r3.w, r6.w, r13.z mad r13.___w, r3.x, r7.x, r13.w mad r13.___w, r3.y, r7.y, r13.w mad r13.___w, r3.z, r7.z, r13.w mad r13.___w, r3.w, r7.w, r13.w endloop

Kernel 2 - Output mul o0.x___, r10.x, cb0[0].w
mul o0._y__, r10.y, cb0[0].w mul o0.__z_, r10.z, cb0[0].w mul o0.___w, r10.w, cb0[0].w mul o1.x___, r11.x, cb0[0].w mul o1._y__, r11.y, cb0[0].w mul o1.__z_, r11.z, cb0[0].w mul o1.___w, r11.w, cb0[0].w mul o2.x___, r12.x, cb0[0].w mul o2._y__, r12.y, cb0[0].w mul o2.__z_, r12.z, cb0[0].w mul o2.___w, r12.w, cb0[0].w mul o3.x___, r13.x, cb0[0].w mul o3._y__, r13.y, cb0[0].w mul o3.__z_, r13.z, cb0[0].w mul o3.___w, r13.w, cb0[0].w ret_dyn end

Kernel 2 Stats GSA: Perf Workbook(1k*1k): ALU: 31 ALU
TEX: 8 Fetches OUT: 4 Color Writes REG: 13 Registers ISA: ALU: Setup: 4 ALU Loop: 19 ALU * LoopCount Output: 8 ALU TEX: 8 Fetches * LoopCount LoopCount: (Width/4) Perf Workbook(1k*1k): Total Pixels: Total ALU: 4876 Total Tex: 2048 Total Out: 4 ALU Time: ms Tex Time: ms 0% CacheHit: ms 75% CacheHit: ms Bottleneck: 0% cache hit rate – Bandwidth Theoretical Peak: Gflops 75% cache hit rate – Tex Theoretical Peak: Gflops

Kernel1 vs. Kernel2 Relative Performance
Why performance difference? Kernel1 vs. Kernel2 Relative Performance Configuration: AMD Phenom™ 9950 X GHz, RD790 reference motherboard, 8 GB RAM, Windows® XP Professional x64 edition, ATI Radeon™ HD 3870, ATI Stream SDK v1.01-beta, ATI Catalyst™ 8.5

Unexpected results Must reexamine Kernel implementation
Memory access pattern Register/Wavefront count Timing Issues Cache Pressure Is this fastest possible? Possible to calculate Theoretical Algorithmic Peak 2 * M * K * N / tex_time / (2^30) 4 Outputs = GFlops 8 Outputs = GFlops What is causing slowdown? Memory?

Memory Access Pattern A B C A B Size Loop 0: Loop 1: Loop 2:
Texture 1 Texture 2 A B C Texture 3 Texture 4 sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz A B Size Loop 0: Data (Row, 0) Fetch: float4/tex Data (0, Cols) 126B Loop 1: Data (Row, 1) Fetch: float4/tex Data (1, Cols) 126B Loop 2: Data (Row, 2) Fetch: float4/tex Data (2, Cols) 126B For Width of 1024, N = 256 Total Size: N * 126B = 32KB Each Thread Fills up L1 Cache Once Loop N-1: Data (Row, N-1) Fetch: float4/tex Data (N-1, Rows) 126B

Register/Wavefront Count
Kernel 1 uses 25 registers Kernel 2 uses 13 registers In CAL if register count > 12, 160 physical registers are made available Wavefront count = floor(Available / Used) Kernel 1 gets 6 wavefronts Kernel 2 gets 12 wavefronts Kernel 1 peaks at 139.6 Kernel 2 peaks at 107.9

Timing Issues How many wavefronts are required to cover TEX?
TEX issues 16 instructions a cycle for 4 cycles over whole wavefront L1 Cache hit is ~40 cycles, TEX is ~80 Cycles L2 Cache Hit is Cycles Assuming 100% cache hit rate 8 Fetches * 4 Cycles/Fetch TEX/L1 = 152 cycles to hide Kernel 1: 18 ALU * 6 Wavefronts = 108 cycles Kernel 2: 19 ALU * 12 Wavefronts = 228 cycles

Cache Pattern - Per Texture
B^T Tiled Wavefront 4 CL’s 4 CL’s 2 CL’s B

Cache Pressure – Kernel 1
Fetch A Fetch B WF 0: 4x16B(16CL) 4x8K L1 Cache B cachelines A WF 0: 16x16B(16CL) Wavefront 0 WF 1: 16x16B(16CL) WF 1: 16x16B(16CL) TEX Wavefront 1 WF 2: 16x16B(16CL) WF 2: 16x16B(16CL) Wavefront 2 WF 3: 16x16B(16CL) WF 3: 16x16B(16CL) Wavefront 3 B WF 4: 16x16B(16CL) WF 4: 16x16B(16CL) Wavefront 4 WF 5: 16x16B(16CL) WF 5: 16x16B(16CL) Wavefront 5 Wavefront 0

Cache Pressure – Kernel 2
Fetch B Fetch A 4x8K L1 Cache B cachelines WF 0: 4x16B(16CL) WF 8: 32x16B(32CL) A WF 0: 16x16B(16CL) Wavefront 0 WF 1: 16x16B(16CL) WF 9: 32x16B(32CL) WF 1: 16x16B(16CL) TEX Wavefront 1 WF 10:32x16B(32CL) WF 2: 16x16B(16CL) Wavefront 2 WF 2: 16x16B(16CL) Wavefront 3 WF 3: 16x16B(16CL) WF 11:32x16B(32CL) WF 3: 16x16B(16CL) Wavefront 5 Wavefront 6 WF 5: 16x16B(16CL) Wavefront 7 WF 6: 16x16B(32CL) WF 7: 32x16B(32CL) Wavefront 4 B WF 4: 16x16B(16CL) WF 0: 32x16B(32CL) WF 1: 32x16B(32CL) WF 2:32x16B(32CL) WF 3:32x16B(32CL) WF 4: 16x16B(16CL) Wavefront 9-11 …?? Wavefront 0-3 Evicted!!!

ACML-GPU DGEMM/SGEMM In white spaces CPU Idle while GPU executes

Overall performance - Memory
Memory is bottleneck over kernel Keeping overall performance from kernel peak Compute Smaller strips and overlap memory copy and computation Splitting matrix over multiple GPU’s also a possibility use calResCreate2D to remove user->kernel copy Find optimal computation strips for hardware

Solutions? As seen from Previous slide, 8 wavefronts fits perfectly in cache. 16 CL from A, 16 CL from B, 32 CL per wavefront. 256CL in L1 / 32 CL per wavefront = 8 wavefronts Create garbage calculations to pad register count to registers Transpose B Matrix reduces 8 CL’s requests per iteration, for a total of 24 CL’s requests per iteration. 256CL / 24CL = 10 wavefronts, 25% improvement 160 Registers / 10 wavefronts = 16 registers max

Final Results 94% Theoretical Peak
Configuration: AMD Phenom™ 9950 X GHz, RD790 reference motherboard, 8 GB RAM, Windows® XP Professional x64 edition, ATI Radeon™ HD 3870, ATI Stream SDK v1.01-beta, ATI Catalyst™ 8.5

Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2009 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, AMD Opteron, AMD Phenom, Catalyst, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows Vista are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. her names are for informational purposes only and may be trademarks of their respective owners.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Similar presentations

Presentation on theme: "ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Similar presentations

Presentation on theme: "ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration"— Presentation transcript:

Similar presentations

About project

Feedback