ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Slides:



Advertisements
Similar presentations
ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Advertisements

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
AMD platform security processor
OpenCL Introduction A TECHNICAL REVIEW LU OCT
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Enhancing GPU for Scientific Computing Some thoughts.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Sunpyo Hong, Hyesoon Kim
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
TI Information – Selective Disclosure
William Stallings Computer Organization and Architecture 8th Edition
Design-Space Exploration
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Presented by: Tim Olson, Architect
The Small batch (and Other) solutions in Mantle API
Heterogeneous System coherence for Integrated CPU-GPU Systems
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
NVIDIA Fermi Architecture
Interference from GPU System Service Requests
Simulation of exascale nodes through runtime hardware monitoring
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Compute Shaders Optimize your engine using compute
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Mastering Memory Modes
Advanced Micro Devices, Inc.
Memory System Performance Chapter 3
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008

AMD Core Math Library Full implementation of Level 1, 2 and 3 Basic Linear Algebra Subroutines (BLAS) with some optimizations for AMD Opteron™ processors. A full suite of Linear Algebra (LAPACK) routines. Comprehensive suite of Fast Fourier Transforms (FFTs) in both single-, double-, single-complex and double-complex data types Fast scalar, vector, and array math transcendental library routines optimized for AMD Opteron processors. Single and double precision Random Number Generators. Load balance between multiple CPU cores and GPU cores

SGEMM – C = α * A * B + β * C Similar to simple matmult Requires multiply of result by alpha C must be multiplied by a beta Theoretical Algorithmic peak is ~½ Theoretical ALU peak Issues: Kernel performance != real-world performance Memory bound algorithm vs. texture bound kernel Balancing register count versus wavefronts in flight

Case Study Study the kernels to find out the bottlenecks Determine the actions needed to take to fix bottlenecks Determine how close to theoretical peak fixes come Work on improving system time Memory/Execution parallelization

Kernel 1 - Setup il_ps_2_0 dcl_input_position_interp(linear_noperspective) vWinCoord.xy__ dcl_cb cb0[2] dcl_output_generic o0 dcl_output_generic o1 dcl_output_generic o2 dcl_output_generic o3 dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(6)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(7)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0, 0x00000000, 0x00000000, 0x00000000, 0x00000000 dcl_literal l3, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 add r25.xy__, vWinCoord.xyxx, cb0[0].xyxx mov r25.__zw, l0 mov r8, l0 mov r9, l0 mov r10, l0 mov r11, l0 mov r12, l0 mov r13, l0 mov r14, l0 mov r15, l0 mov r16, l0 mov r17, l0 mov r18, l0 mov r19, l0 mov r20, l0 mov r21, l0 mov r22, l0 mov r23, l0

Kernel 1 – Loop & Output dp4 o0.x___, r8, cb0[0].wwww whileloop ge r2._y__, r25.wwww, cb0[0].zzzz break_logicalnz r2.y sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz mad r8, r0, r4, r8 mad r9, r0, r5, r9 mad r10, r0, r6, r10 mad r11, r0, r7, r11 mad r12, r1, r4, r12 mad r13, r1, r5, r13 mad r14, r1, r6, r14 mad r15, r1, r7, r15 mad r16, r2, r4, r16 mad r17, r2, r5, r17 mad r18, r2, r6, r18 mad r19, r2, r7, r19 mad r20, r3, r4, r20 mad r21, r3, r5, r21 mad r22, r3, r6, r22 mad r23, r3, r7, r23 add r25.___w, r25.w, l3 endloop dp4 o0.x___, r8, cb0[0].wwww dp4 o0._y__, r9, cb0[0].wwww dp4 o0.__z_, r10, cb0[0].wwww dp4 o0.___w, r11, cb0[0].wwww dp4 o1.x___, r12, cb0[0].wwww dp4 o1._y__, r13, cb0[0].wwww dp4 o1.__z_, r14, cb0[0].wwww dp4 o1.___w, r15, cb0[0].wwww dp4 o2.x___, r16, cb0[0].wwww dp4 o2._y__, r17, cb0[0].wwww dp4 o2.__z_, r18, cb0[0].wwww dp4 o2.___w, r19, cb0[0].wwww dp4 o3.x___, r20, cb0[0].wwww dp4 o3._y__, r21, cb0[0].wwww dp4 o3.__z_, r22, cb0[0].wwww dp4 o3.___w, r23, cb0[0].wwww ret_dyn end

Kernel 1 Stats – ATI Radeon™ HD 3870 GPU GSA: ALU: 52 ALU TEX: 8 Fetches OUT: 4 Color Writes REG: 25 Registers ISA: ALU: Setup: 14 ALU Loop: 18 ALU * LoopCount Output: 20 ALU TEX: 8 Fetches * LoopCount LoopCount: (Width/4) Perf Workbook(1k*1k): Total Pixels: 262144 Total ALU: 4642 Total Tex: 2048 Total Out: 4 ALU Time: 25.3515 ms Tex Time: 44.7392 ms 0% Bandwidth: 116.9390 ms 75% Bandwidth: 52.0302 ms Bottleneck: 0% cache hit rate – Bandwidth Theoretical Peak: 40.7193 Gflops 75% cache hit rate – Tex Theoretical Peak: 140.0146 Gflops

Kernel 2 - Setup il_ps_2_0 dcl_input_position_interp(linear_noperspective) v0.xy__ dcl_cb cb0[2] dcl_output_generic o0 dcl_output_generic o1 dcl_output_generic o2 dcl_output_generic o3 dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(6)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(7)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0, 0x00000000, 0x00000000, 0x00000000, 0x00000000 dcl_literal l3, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 add r25.xy__, v0.xyxx, cb0[0].xyxx mov r25.__zw, l0 mov r10, l0 mov r11, l0 mov r12, l0 mov r13, l0 mov r14, l0 mov r15, l0 mov r16, l0 mov r17, l0 mov r18, l0 mov r19, l0 mov r20, l0 mov r21, l0 mov r22, l0 mov r23, l0

Kernel 2 - Loop mad r11.__z_, r1.x, r6.x, r11.z whileloop ge r2._y__, r25.w, cb0[0].z break_logicalnz r2.y sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz add r25.___w, r25.w, l3 mad r10.x___, r0.x, r4.x, r10.x mad r10.x___, r0.y, r4.y, r10.x mad r10.x___, r0.z, r4.z, r10.x mad r10.x___, r0.w, r4.w, r10.x mad r10._y__, r0.x, r5.x, r10.y mad r10._y__, r0.y, r5.y, r10.y mad r10._y__, r0.z, r5.z, r10.y mad r10._y__, r0.w, r5.w, r10.y mad r10.__z_, r0.x, r6.x, r10.z mad r10.__z_, r0.y, r6.y, r10.z mad r10.__z_, r0.z, r6.z, r10.z mad r10.__z_, r0.w, r6.w, r10.z mad r10.___w, r0.x, r7.x, r10.w mad r10.___w, r0.y, r7.y, r10.w mad r10.___w, r0.z, r7.z, r10.w mad r10.___w, r0.w, r7.w, r10.w mad r11.x___, r1.x, r4.x, r11.x mad r11.x___, r1.y, r4.y, r11.x mad r11.x___, r1.z, r4.z, r11.x mad r11.x___, r1.w, r4.w, r11.x mad r11._y__, r1.x, r5.x, r11.y mad r11._y__, r1.y, r5.y, r11.y mad r11._y__, r1.z, r5.z, r11.y mad r11._y__, r1.w, r5.w, r11.y mad r11.__z_, r1.x, r6.x, r11.z mad r11.__z_, r1.y, r6.y, r11.z mad r11.__z_, r1.z, r6.z, r11.z mad r11.__z_, r1.w, r6.w, r11.z mad r11.___w, r1.x, r7.x, r11.w mad r11.___w, r1.y, r7.y, r11.w mad r11.___w, r1.z, r7.z, r11.w mad r11.___w, r1.w, r7.w, r11.w mad r12.x___, r2.x, r4.x, r12.x mad r12.x___, r2.y, r4.y, r12.x mad r12.x___, r2.z, r4.z, r12.x mad r12.x___, r2.w, r4.w, r12.x mad r12._y__, r2.x, r5.x, r12.y mad r12._y__, r2.y, r5.y, r12.y mad r12._y__, r2.z, r5.z, r12.y mad r12._y__, r2.w, r5.w, r12.y mad r12.__z_, r2.x, r6.x, r12.z mad r12.__z_, r2.y, r6.y, r12.z mad r12.__z_, r2.z, r6.z, r12.z mad r12.__z_, r2.w, r6.w, r12.z mad r12.___w, r2.x, r7.x, r12.w mad r12.___w, r2.y, r7.y, r12.w mad r12.___w, r2.z, r7.z, r12.w mad r12.___w, r2.w, r7.w, r12.w mad r13.x___, r3.x, r4.x, r13.x mad r13.x___, r3.y, r4.y, r13.x mad r13.x___, r3.z, r4.z, r13.x mad r13.x___, r3.w, r4.w, r13.x mad r13._y__, r3.x, r5.x, r13.y mad r13._y__, r3.y, r5.y, r13.y mad r13._y__, r3.z, r5.z, r13.y mad r13._y__, r3.w, r5.w, r13.y mad r13.__z_, r3.x, r6.x, r13.z mad r13.__z_, r3.y, r6.y, r13.z mad r13.__z_, r3.z, r6.z, r13.z mad r13.__z_, r3.w, r6.w, r13.z mad r13.___w, r3.x, r7.x, r13.w mad r13.___w, r3.y, r7.y, r13.w mad r13.___w, r3.z, r7.z, r13.w mad r13.___w, r3.w, r7.w, r13.w endloop

Kernel 2 - Output mul o0.x___, r10.x, cb0[0].w mul o0._y__, r10.y, cb0[0].w mul o0.__z_, r10.z, cb0[0].w mul o0.___w, r10.w, cb0[0].w mul o1.x___, r11.x, cb0[0].w mul o1._y__, r11.y, cb0[0].w mul o1.__z_, r11.z, cb0[0].w mul o1.___w, r11.w, cb0[0].w mul o2.x___, r12.x, cb0[0].w mul o2._y__, r12.y, cb0[0].w mul o2.__z_, r12.z, cb0[0].w mul o2.___w, r12.w, cb0[0].w mul o3.x___, r13.x, cb0[0].w mul o3._y__, r13.y, cb0[0].w mul o3.__z_, r13.z, cb0[0].w mul o3.___w, r13.w, cb0[0].w ret_dyn end

Kernel 2 Stats GSA: Perf Workbook(1k*1k): ALU: 31 ALU TEX: 8 Fetches OUT: 4 Color Writes REG: 13 Registers ISA: ALU: Setup: 4 ALU Loop: 19 ALU * LoopCount Output: 8 ALU TEX: 8 Fetches * LoopCount LoopCount: (Width/4) Perf Workbook(1k*1k): Total Pixels: 262144 Total ALU: 4876 Total Tex: 2048 Total Out: 4 ALU Time: 26.6295 ms Tex Time: 44.7392 ms 0% CacheHit: 116.9390 ms 75% CacheHit: 29.2348 ms Bottleneck: 0% cache hit rate – Bandwidth Theoretical Peak: 54.6530 Gflops 75% cache hit rate – Tex Theoretical Peak: 147.4316 Gflops

Kernel1 vs. Kernel2 Relative Performance Why performance difference? Kernel1 vs. Kernel2 Relative Performance Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, RD790 reference motherboard, 8 GB RAM, Windows® XP Professional x64 edition, ATI Radeon™ HD 3870, ATI Stream SDK v1.01-beta, ATI Catalyst™ 8.5

Unexpected results Must reexamine Kernel implementation Memory access pattern Register/Wavefront count Timing Issues Cache Pressure Is this fastest possible? Possible to calculate Theoretical Algorithmic Peak 2 * M * K * N / tex_time / (2^30) 4 Outputs = 198.4 GFlops 8 Outputs = 264.5 GFlops What is causing slowdown? Memory?

Memory Access Pattern A B C A B Size Loop 0: Loop 1: Loop 2: Texture 1 Texture 2 A B C Texture 3 Texture 4 sample_resource(0)_sampler(0) r0, r25.wyzz sample_resource(1)_sampler(1) r1, r25.wyzz sample_resource(2)_sampler(2) r2, r25.wyzz sample_resource(3)_sampler(3) r3, r25.wyzz sample_resource(4)_sampler(4) r4, r25.wxzz sample_resource(5)_sampler(5) r5, r25.wxzz sample_resource(6)_sampler(6) r6, r25.wxzz sample_resource(7)_sampler(7) r7, r25.wxzz A B Size Loop 0: Data (Row, 0) Fetch: float4/tex Data (0, Cols) 126B Loop 1: Data (Row, 1) Fetch: float4/tex Data (1, Cols) 126B Loop 2: Data (Row, 2) Fetch: float4/tex Data (2, Cols) 126B For Width of 1024, N = 256 Total Size: N * 126B = 32KB Each Thread Fills up L1 Cache Once Loop N-1: Data (Row, N-1) Fetch: float4/tex Data (N-1, Rows) 126B

Register/Wavefront Count Kernel 1 uses 25 registers Kernel 2 uses 13 registers In CAL if register count > 12, 160 physical registers are made available Wavefront count = floor(Available / Used) Kernel 1 gets 6 wavefronts Kernel 2 gets 12 wavefronts Kernel 1 peaks at 139.6 Kernel 2 peaks at 107.9

Timing Issues How many wavefronts are required to cover TEX? TEX issues 16 instructions a cycle for 4 cycles over whole wavefront L1 Cache hit is ~40 cycles, TEX is ~80 Cycles L2 Cache Hit is 256-512 Cycles Assuming 100% cache hit rate 8 Fetches * 4 Cycles/Fetch + 120 TEX/L1 = 152 cycles to hide Kernel 1: 18 ALU * 6 Wavefronts = 108 cycles Kernel 2: 19 ALU * 12 Wavefronts = 228 cycles

Cache Pattern - Per Texture B^T Tiled Wavefront 4 CL’s 4 CL’s 2 CL’s B

Cache Pressure – Kernel 1 Fetch A Fetch B WF 0: 4x16B(16CL) 4x8K L1 Cache 256 128B cachelines A WF 0: 16x16B(16CL) Wavefront 0 WF 1: 16x16B(16CL) WF 1: 16x16B(16CL) TEX Wavefront 1 WF 2: 16x16B(16CL) WF 2: 16x16B(16CL) Wavefront 2 WF 3: 16x16B(16CL) WF 3: 16x16B(16CL) Wavefront 3 B WF 4: 16x16B(16CL) WF 4: 16x16B(16CL) Wavefront 4 WF 5: 16x16B(16CL) WF 5: 16x16B(16CL) Wavefront 5 Wavefront 0

Cache Pressure – Kernel 2 Fetch B Fetch A 4x8K L1 Cache 256 128B cachelines WF 0: 4x16B(16CL) WF 8: 32x16B(32CL) A WF 0: 16x16B(16CL) Wavefront 0 WF 1: 16x16B(16CL) WF 9: 32x16B(32CL) WF 1: 16x16B(16CL) TEX Wavefront 1 WF 10:32x16B(32CL) WF 2: 16x16B(16CL) Wavefront 2 WF 2: 16x16B(16CL) Wavefront 3 WF 3: 16x16B(16CL) WF 11:32x16B(32CL) WF 3: 16x16B(16CL) Wavefront 5 Wavefront 6 WF 5: 16x16B(16CL) Wavefront 7 WF 6: 16x16B(32CL) WF 7: 32x16B(32CL) Wavefront 4 B WF 4: 16x16B(16CL) WF 0: 32x16B(32CL) WF 1: 32x16B(32CL) WF 2:32x16B(32CL) WF 3:32x16B(32CL) WF 4: 16x16B(16CL) Wavefront 9-11 …?? Wavefront 0-3 Evicted!!!

ACML-GPU DGEMM/SGEMM In white spaces CPU Idle while GPU executes

Overall performance - Memory Memory is bottleneck over kernel Keeping overall performance from kernel peak Compute Smaller strips and overlap memory copy and computation Splitting matrix over multiple GPU’s also a possibility use calResCreate2D to remove user->kernel copy Find optimal computation strips for hardware

Solutions? As seen from Previous slide, 8 wavefronts fits perfectly in cache. 16 CL from A, 16 CL from B, 32 CL per wavefront. 256CL in L1 / 32 CL per wavefront = 8 wavefronts Create garbage calculations to pad register count to 18-20 registers Transpose B Matrix reduces 8 CL’s requests per iteration, for a total of 24 CL’s requests per iteration. 256CL / 24CL = 10 wavefronts, 25% improvement 160 Registers / 10 wavefronts = 16 registers max

Final Results 94% Theoretical Peak Configuration: AMD Phenom™ 9950 X4 processor @ 2.60 GHz, RD790 reference motherboard, 8 GB RAM, Windows® XP Professional x64 edition, ATI Radeon™ HD 3870, ATI Stream SDK v1.01-beta, ATI Catalyst™ 8.5

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2009 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, AMD Opteron, AMD Phenom, Catalyst, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows Vista are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. her names are for informational purposes only and may be trademarks of their respective owners.