ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

Advertisements

ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

DSPs Vs General Purpose Microprocessors

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

The University of Adelaide, School of Computer Science

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

EVOLUTION OF MULTIMEDIA & DISPLAY MAZEN SALLOUM 26 FEB 2015.

1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

AMD’s ATI GPU Radeon R700 (HD 4xxx) series Elizabeth Soechting David Chang Jessica Vasconcellos 1 CS 433 Advanced Computer Architecture May 7, 2008.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

Filtering Approaches for Real-Time Anti-Aliasing /

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

A Discussion of CPU vs. GPU 1. CUDA Real “Hardware” Intel Core 2 Extreme QX9650 NVIDIA GeForce GTX 280 NVIDIA GeForce GTX 480 Transistors820 million1.4.

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

My Coordinates Office EM G.27 contact time:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

GPU Architecture and Its Application

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

EECE571R -- Harnessing Massively Parallel Processors ece

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

The Small batch (and Other) solutions in Mantle API

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

Interference from GPU System Service Requests

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Compute Shaders Optimize your engine using compute

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Advanced Micro Devices, Inc.

CSE 502: Computer Architecture

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ATI Radeon™ HD 2900 Series GPU Hardware Overview Graphics View Compute View ATI Radeon™ HD 2900 Series GPU Hardware ATI Radeon™ HD 2400/2600 Series GPU Hardware

| ATI Stream Computing Update | Confidential 33 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ATI Radeon™ HD 2900 Series GPU - Graphics Overview Created for graphics Not optimal for compute Various functions have specific use cases Overhead caused by graphics pipeline Graphics APIs do not allow very direct control

| ATI Stream Computing Update | Confidential 44 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ATI Radeon™ HD 2900 Series GPU – Compute Overview Hides non- compute items:  Geometry Shader  Tesselation Unit  Vertex Shader  Vertex Cache  Z/Stencil Cache  Etc… Exposes only what is required

| ATI Stream Computing Update | Confidential 55 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ATI Radeon™ HD 2900 Series GPU Hardware ALU Hardware –Streaming Core –Thread processor –Flow Control –Thread Creation –ALU Scheduling Memory Hardware –Memory Controller –Texture Unit –Texture Unit Scheduling –Tiling –Render Backends

| ATI Stream Computing Update | Confidential 66 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ALU Hardware – Thread Processors 5 Streaming Cores Four thin SC’s[X,Y,Z,W] One fat SC[T] Branch execution unit Single cycle dispatch Four cycle latency 16 Threads/Cycle 00 ALU: ADDR(32) CNT(5) 0 x: MOV R1.x, 0.0f y: MOV R1.y, 0.0f z: MOV R1.z, 0.0f w: MOV R1.w, 0.0f t: MOV R0.x, 0.0f

| ATI Stream Computing Update | Confidential 77 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ALU Hardware – Flow Control Predication to mask state updates Writes only occur when mask not set 01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 02 ALU_BREAK: ADDR(37) CNT(2) KCACHE0(CB0:0-15) 1 y: SETE_INT R0.y, R0.x, KC0[1].x 2 x: PREDE_INT ____, R0.y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 03 ALU: ADDR(39) CNT(5) KCACHE0(CB0:0-15) 3 x: ADD R1.x, R1.x, KC0[0].x y: ADD R1.y, R1.y, KC0[0].y z: ADD R1.z, R1.z, KC0[0].z w: ADD R1.w, R1.w, KC0[0].w t: ADD_INT R0.x, R0.x, 1 04 ENDLOOP i0 PASS_JUMP_ADDR(2) 05 EXP_DONE: PIX0, R1 END_OF_PROGRAM

| ATI Stream Computing Update | Confidential 88 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ALU Hardware – DPP Array 4 SIMD Engines 4 Quads/SE 4 TP/Quads 5 Streaming Cores/TP 320 Streaming Cores 2 Wavefronts/SE 512 Threads Concurrently processed 256 Registers Per SIMD

| ATI Stream Computing Update | Confidential 99 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview Cycle 0: ALU Hardware – Wavefront Execution Even WavefrontOdd Wavefront Cycle 1: Cycle 2: Cycle 3: Cycle 4: Cycle 5: Cycle 6: IL Instr: imul r22, r22, r10 IL Instr: and r22, r22, r11 Repeat Ad Nauseam for ALU 1 square represents a quad(4 sequential threads) 4 quads execute per cycle on a SIMD Two Wavefronts(WF’s) execute in parallel Even/Odd WF’s interleave quads every other cycle Cycle 7:

| ATI Stream Computing Update | Confidential 10 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ALU Hardware – Thread Creation Stamps out 16 quads per wavefront in preset order Dispatched to SE’s in round robin fashion by Ultra- Threaded Dispatch Processor Affects memory access performance

| ATI Stream Computing Update | Confidential 11 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ALU Hardware - Wavefront Scheduling SIMD Engine is 100% busy SIMD Engine has stalls

| ATI Stream Computing Update | Confidential 12 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview Memory Hardware – Memory Controller Fully distributed memory interface Stacked I/O pad design Runs Independently of compute and texture units. Highlights: Over 100 GB/s memory bandwidth Achieved via last generation technology Eight 64-bit memory channels Kilobit ring bus Lower frequencies required

| ATI Stream Computing Update | Confidential 13 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview 13 12/6/2015 Memory Hardware – Texture Unit Four 32KB Four-way associative L1 caches L1 cache size is 4x8KB per SIMD Engine Data is split across all four 8K L1 cache’s L1 cache line is 256 bytes or 4 quads of data 256KB Unified Cache over all SIMDs

| ATI Stream Computing Update | Confidential 14 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview Memory Hardware – TEX Scheduling Run independently of ALU units Run on core/engine clocks Process multiple wavefronts sequentially to hide latency Transfers data from cache to registers Latency is predictable for L1 cache hits

| ATI Stream Computing Update | Confidential 15 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview Memory Hardware - Tiling Multiple tiling formats Micro-tiling and macro-tiling CAL tiled format is micro-tiled, macro-tiled Quad based hierarchical Z pattern CAL linear format is micro- tiled, macro-linear Tiled quad based linear format

| ATI Stream Computing Update | Confidential 16 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview Memory Hardware – Backends Also called ROPs (Raster Operator) Outputs data to memory via color registers Maximum 8 Outputs 4 Backend units 256B output width 32KB Write cache/unit 32 Pixels/Clk Memory Controller DPP Array

| ATI Stream Computing Update | Confidential 17 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview ATI Radeon™ HD 2400 / 2600 Series GPUs ATI Radeon™ HD 2400 Series GPU 40 Stream Processors 2 SIMD Engines 4 Thread Processors/SIMD 1 Texture Unit 1 Render Backend ATI Radeon™ HD 2600 Series GPU 120 Stream Processors 3 SIMD Engines 8 Thread Processors/SIMD 2 Texture Units 1 Render Backend

| ATI Stream Computing Update | Confidential 18 | ATI Stream Computing – ATI Radeon™ HD 2900 Series GPU Hardware Overview Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.