Presentation is loading. Please wait.

Presentation is loading. Please wait.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

Similar presentations


Presentation on theme: "ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008."— Presentation transcript:

1 ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008

2 | ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration AMD Core Math Library Full implementation of Level 1, 2 and 3 Basic Linear Algebra Subroutines (BLAS) with some optimizations for AMD Opteron™ processors. A full suite of Linear Algebra (LAPACK) routines. Comprehensive suite of Fast Fourier Transforms (FFTs) in both single-, double-, single-complex and double-complex data types Fast scalar, vector, and array math transcendental library routines optimized for AMD Opteron processors. Single and double precision Random Number Generators. Load balance between multiple CPU cores and GPU cores

3 | ATI Stream Computing Update | Confidential 33 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration SGEMM – C = α * A * B + β * C Similar to simple matmult Requires multiply of result by alpha C must be multiplied by a beta Theoretical Algorithmic peak is ~½ Theoretical ALU peak Issues: Kernel performance != real-world performance Memory bound algorithm vs. texture bound kernel Balancing register count versus wavefronts in flight

4 | ATI Stream Computing Update | Confidential 44 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Case Study Study the kernels to find out the bottlenecks Determine the actions needed to take to fix bottlenecks Determine how close to theoretical peak fixes come Work on improving system time Memory/Execution parallelization

5 | ATI Stream Computing Update | Confidential 55 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel 1 - Setup il_ps_2_0 dcl_input_position_interp(linear_noperspective) vWinCoord.xy__ dcl_cb cb0[2] dcl_output_generic o0 dcl_output_generic o1 dcl_output_generic o2 dcl_output_generic o3 dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(6)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(7)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0, 0x , 0x , 0x , 0x dcl_literal l3, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F add r25.xy__, vWinCoord.xyxx, cb0[0].xyxx mov r25.__zw, l0 mov r8, l0 mov r9, l0 mov r10, l0 mov r11, l0 mov r12, l0 mov r13, l0 mov r14, l0 mov r15, l0 mov r16, l0 mov r17, l0 mov r18, l0 mov r19, l0 mov r20, l0 mov r21, l0 mov r22, l0 mov r23, l0

6 | ATI Stream Computing Update | Confidential 66 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel 1 – Loop & Output dp4 o0.x___, r8, cb0[0].wwww dp4 o0._y__, r9, cb0[0].wwww dp4 o0.__z_, r10, cb0[0].wwww dp4 o0.___w, r11, cb0[0].wwww dp4 o1.x___, r12, cb0[0].wwww dp4 o1._y__, r13, cb0[0].wwww dp4 o1.__z_, r14, cb0[0].wwww dp4 o1.___w, r15, cb0[0].wwww dp4 o2.x___, r16, cb0[0].wwww dp4 o2._y__, r17, cb0[0].wwww dp4 o2.__z_, r18, cb0[0].wwww dp4 o2.___w, r19, cb0[0].wwww dp4 o3.x___, r20, cb0[0].wwww dp4 o3._y__, r21, cb0[0].wwww dp4 o3.__z_, r22, cb0[0].wwww dp4 o3.___w, r23, cb0[0].wwww ret_dyn end whileloop ge r2._y__, r25.wwww, cb0[0].zzzz break_logicalnz r2.y sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz mad r8, r0, r4, r8 mad r9, r0, r5, r9 mad r10, r0, r6, r10 mad r11, r0, r7, r11 mad r12, r1, r4, r12 mad r13, r1, r5, r13 mad r14, r1, r6, r14 mad r15, r1, r7, r15 mad r16, r2, r4, r16 mad r17, r2, r5, r17 mad r18, r2, r6, r18 mad r19, r2, r7, r19 mad r20, r3, r4, r20 mad r21, r3, r5, r21 mad r22, r3, r6, r22 mad r23, r3, r7, r23 add r25.___w, r25.w, l3 endloop

7 | ATI Stream Computing Update | Confidential 77 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel 1 Stats – ATI Radeon™ HD 3870 GPU GSA:  ALU: 52 ALU  TEX: 8 Fetches  OUT: 4 Color Writes  REG: 25 Registers ISA:  ALU: –Setup: 14 ALU –Loop: 18 ALU * LoopCount –Output: 20 ALU  TEX: 8 Fetches * LoopCount LoopCount: (Width/4) Perf Workbook(1k*1k):  Total Pixels:  Total ALU: 4642  Total Tex: 2048  Total Out: 4  ALU Time: ms  Tex Time: ms  0% Bandwidth: ms  75% Bandwidth: ms Bottleneck: 0% cache hit rate – Bandwidth Theoretical Peak: Gflops 75% cache hit rate – Tex Theoretical Peak: Gflops

8 | ATI Stream Computing Update | Confidential 88 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel 2 - Setup il_ps_2_0 dcl_input_position_interp(linear_noperspective) v0.xy__ dcl_cb cb0[2] dcl_output_generic o0 dcl_output_generic o1 dcl_output_generic o2 dcl_output_generic o3 dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(6)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(7)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0, 0x , 0x , 0x , 0x dcl_literal l3, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F add r25.xy__, v0.xyxx, cb0[0].xyxx mov r25.__zw, l0 mov r10, l0 mov r11, l0 mov r12, l0 mov r13, l0 mov r14, l0 mov r15, l0 mov r16, l0 mov r17, l0 mov r18, l0 mov r19, l0 mov r20, l0 mov r21, l0 mov r22, l0 mov r23, l0

9 | ATI Stream Computing Update | Confidential 99 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel 2 - Loop whileloop ge r2._y__, r25.w, cb0[0].z break_logicalnz r2.y sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz add r25.___w, r25.w, l3 mad r10.x___, r0.x, r4.x, r10.x mad r10.x___, r0.y, r4.y, r10.x mad r10.x___, r0.z, r4.z, r10.x mad r10.x___, r0.w, r4.w, r10.x mad r10._y__, r0.x, r5.x, r10.y mad r10._y__, r0.y, r5.y, r10.y mad r10._y__, r0.z, r5.z, r10.y mad r10._y__, r0.w, r5.w, r10.y mad r10.__z_, r0.x, r6.x, r10.z mad r10.__z_, r0.y, r6.y, r10.z mad r10.__z_, r0.z, r6.z, r10.z mad r10.__z_, r0.w, r6.w, r10.z mad r10.___w, r0.x, r7.x, r10.w mad r10.___w, r0.y, r7.y, r10.w mad r10.___w, r0.z, r7.z, r10.w mad r10.___w, r0.w, r7.w, r10.w mad r11.x___, r1.x, r4.x, r11.x mad r11.x___, r1.y, r4.y, r11.x mad r11.x___, r1.z, r4.z, r11.x mad r11.x___, r1.w, r4.w, r11.x mad r11._y__, r1.x, r5.x, r11.y mad r11._y__, r1.y, r5.y, r11.y mad r11._y__, r1.z, r5.z, r11.y mad r11._y__, r1.w, r5.w, r11.y mad r11.__z_, r1.x, r6.x, r11.z mad r11.__z_, r1.y, r6.y, r11.z mad r11.__z_, r1.z, r6.z, r11.z mad r11.__z_, r1.w, r6.w, r11.z mad r11.___w, r1.x, r7.x, r11.w mad r11.___w, r1.y, r7.y, r11.w mad r11.___w, r1.z, r7.z, r11.w mad r11.___w, r1.w, r7.w, r11.w mad r12.x___, r2.x, r4.x, r12.x mad r12.x___, r2.y, r4.y, r12.x mad r12.x___, r2.z, r4.z, r12.x mad r12.x___, r2.w, r4.w, r12.x mad r12._y__, r2.x, r5.x, r12.y mad r12._y__, r2.y, r5.y, r12.y mad r12._y__, r2.z, r5.z, r12.y mad r12._y__, r2.w, r5.w, r12.y mad r12.__z_, r2.x, r6.x, r12.z mad r12.__z_, r2.y, r6.y, r12.z mad r12.__z_, r2.z, r6.z, r12.z mad r12.__z_, r2.w, r6.w, r12.z mad r12.___w, r2.x, r7.x, r12.w mad r12.___w, r2.y, r7.y, r12.w mad r12.___w, r2.z, r7.z, r12.w mad r12.___w, r2.w, r7.w, r12.w mad r13.x___, r3.x, r4.x, r13.x mad r13.x___, r3.y, r4.y, r13.x mad r13.x___, r3.z, r4.z, r13.x mad r13.x___, r3.w, r4.w, r13.x mad r13._y__, r3.x, r5.x, r13.y mad r13._y__, r3.y, r5.y, r13.y mad r13._y__, r3.z, r5.z, r13.y mad r13._y__, r3.w, r5.w, r13.y mad r13.__z_, r3.x, r6.x, r13.z mad r13.__z_, r3.y, r6.y, r13.z mad r13.__z_, r3.z, r6.z, r13.z mad r13.__z_, r3.w, r6.w, r13.z mad r13.___w, r3.x, r7.x, r13.w mad r13.___w, r3.y, r7.y, r13.w mad r13.___w, r3.z, r7.z, r13.w mad r13.___w, r3.w, r7.w, r13.w endloop

10 | ATI Stream Computing Update | Confidential 10 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel 2 - Output mul o0.x___, r10.x, cb0[0].w mul o0._y__, r10.y, cb0[0].w mul o0.__z_, r10.z, cb0[0].w mul o0.___w, r10.w, cb0[0].w mul o1.x___, r11.x, cb0[0].w mul o1._y__, r11.y, cb0[0].w mul o1.__z_, r11.z, cb0[0].w mul o1.___w, r11.w, cb0[0].w mul o2.x___, r12.x, cb0[0].w mul o2._y__, r12.y, cb0[0].w mul o2.__z_, r12.z, cb0[0].w mul o2.___w, r12.w, cb0[0].w mul o3.x___, r13.x, cb0[0].w mul o3._y__, r13.y, cb0[0].w mul o3.__z_, r13.z, cb0[0].w mul o3.___w, r13.w, cb0[0].w ret_dyn end

11 | ATI Stream Computing Update | Confidential 11 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel 2 Stats GSA:  ALU: 31 ALU  TEX: 8 Fetches  OUT: 4 Color Writes  REG: 13 Registers ISA:  ALU: –Setup: 4 ALU –Loop: 19 ALU * LoopCount –Output: 8 ALU  TEX: 8 Fetches * LoopCount LoopCount: (Width/4) Perf Workbook(1k*1k):  Total Pixels:  Total ALU: 4876  Total Tex: 2048  Total Out: 4  ALU Time: ms  Tex Time: ms  0% CacheHit: ms  75% CacheHit: ms Bottleneck: 0% cache hit rate – Bandwidth Theoretical Peak: Gflops 75% cache hit rate – Tex Theoretical Peak: Gflops

12 | ATI Stream Computing Update | Confidential 12 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Kernel1 vs. Kernel2 Relative Performance Why performance difference? Configuration: AMD Phenom™ 9950 X GHz, RD790 reference motherboard, 8 GB RAM, Windows® XP Professional x64 edition, ATI Radeon™ HD 3870, ATI Stream SDK v1.01-beta, ATI Catalyst™ 8.5

13 | ATI Stream Computing Update | Confidential 13 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Unexpected results Must reexamine Kernel implementation  Memory access pattern  Register/Wavefront count  Timing Issues  Cache Pressure Is this fastest possible? Possible to calculate Theoretical Algorithmic Peak 2 * M * K * N / tex_time / (2^30) 4 Outputs = GFlops 8 Outputs = GFlops What is causing slowdown? Memory?

14 | ATI Stream Computing Update | Confidential 14 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration C Memory Access Pattern A Texture 1Texture 2 Texture 3Texture 4 sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz ABSize Loop 0: Data (Row, 0) Fetch: float4/tex Data (0, Cols) Fetch: float4/tex 126B Loop 1: Data (Row, 1) Fetch: float4/tex Data (1, Cols) Fetch: float4/tex 126B Loop 2: Data (Row, 2) Fetch: float4/tex Data (2, Cols) Fetch: float4/tex 126B Loop N-1: Data (Row, N-1) Fetch: float4/tex Data (N-1, Rows) Fetch: float4/tex 126B For Width of 1024, N = 256 Total Size: N * 126B = 32KB Each Thread Fills up L1 Cache Once B

15 | ATI Stream Computing Update | Confidential 15 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Register/Wavefront Count Kernel 1 uses 25 registers Kernel 2 uses 13 registers In CAL if register count > 12, 160 physical registers are made available Wavefront count = floor(Available / Used) Kernel 1 gets 6 wavefronts Kernel 2 gets 12 wavefronts Kernel 1 peaks at Kernel 2 peaks at 107.9

16 | ATI Stream Computing Update | Confidential 16 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Timing Issues How many wavefronts are required to cover TEX? TEX issues 16 instructions a cycle for 4 cycles over whole wavefront L1 Cache hit is ~40 cycles, TEX is ~80 Cycles L2 Cache Hit is Cycles Assuming 100% cache hit rate 8 Fetches * 4 Cycles/Fetch TEX/L1 = 152 cycles to hide Kernel 1: 18 ALU * 6 Wavefronts = 108 cycles Kernel 2: 19 ALU * 12 Wavefronts = 228 cycles

17 | ATI Stream Computing Update | Confidential 17 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Cache Pattern - Per Texture Tiled Wavefront A B 4 CL’s B^T 2 CL’s 4 CL’s

18 | ATI Stream Computing Update | Confidential 18 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Fetch A Cache Pressure – Kernel 1 Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront 4 Wavefront 5 TEX A B WF 0: 4x16B(16CL) WF 0: 16x16B(16CL) Fetch B Wavefront 0 WF 1: 16x16B(16CL) WF 2: 16x16B(16CL) WF 3: 16x16B(16CL) WF 4: 16x16B(16CL) WF 5: 16x16B(16CL) 4x8K L1 Cache B cachelines

19 | ATI Stream Computing Update | Confidential 19 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Wavefront 5 Wavefront 6 WF 5: 16x16B(16CL) Wavefront 7 WF 6: 16x16B(32CL) WF 7: 32x16B(32CL) Fetch A Cache Pressure – Kernel 2 Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront 4 TEX A B WF 0: 4x16B(16CL) WF 0: 16x16B(16CL) Fetch B WF 1: 16x16B(16CL) WF 2: 16x16B(16CL) WF 3: 16x16B(16CL) WF 4: 16x16B(16CL) Wavefront 9-11 …?? 4x8K L1 Cache B cachelines WF 8: 32x16B(32CL) WF 9: 32x16B(32CL) WF 10:32x16B(32CL) WF 11:32x16B(32CL) WF 0: 32x16B(32CL) WF 1: 32x16B(32CL) WF 2:32x16B(32CL) WF 3:32x16B(32CL) Wavefront 0-3 Evicted!!!

20 | ATI Stream Computing Update | Confidential 20 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration ACML-GPU DGEMM/SGEMM In white spaces CPU Idle while GPU executes

21 | ATI Stream Computing Update | Confidential 21 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Overall performance - Memory Memory is bottleneck over kernel Keeping overall performance from kernel peak Compute Smaller strips and overlap memory copy and computation Splitting matrix over multiple GPU’s also a possibility use calResCreate2D to remove user->kernel copy Find optimal computation strips for hardware

22 | ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Solutions? As seen from Previous slide, 8 wavefronts fits perfectly in cache. 16 CL from A, 16 CL from B, 32 CL per wavefront. 256CL in L1 / 32 CL per wavefront = 8 wavefronts Create garbage calculations to pad register count to registers Transpose B Matrix reduces 8 CL’s requests per iteration, for a total of 24 CL’s requests per iteration. 256CL / 24CL = 10 wavefronts, 25% improvement 160 Registers / 10 wavefronts = 16 registers max

23 | ATI Stream Computing Update | Confidential 23 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Final Results 94% Theoretical Peak Configuration: AMD Phenom™ 9950 X GHz, RD790 reference motherboard, 8 GB RAM, Windows® XP Professional x64 edition, ATI Radeon™ HD 3870, ATI Stream SDK v1.01-beta, ATI Catalyst™ 8.5

24 | ATI Stream Computing Update | Confidential 24 | ATI Stream Computing – ACML-GPU – SGEMM Optimization Illustration Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2009 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, AMD Opteron, AMD Phenom, Catalyst, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows Vista are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. her names are for informational purposes only and may be trademarks of their respective owners.


Download ppt "ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008."

Similar presentations


Ads by Google