GPGPUs/DPAs Dezső Sima April 2011 (v1.0, Last updated 04/15/2011)

GPGPUs/DPAs Dezső Sima April 2011 (v1.0, Last updated 04/15/2011)

1. Introduction (1) Aim Brief introduction and overview.

Contents 1.Introduction 2. Basics of the SIMT execution 3. Overview of GPGPUs 4. Overview of data parallel accelerators 5. References

1. Introduction

1. Introduction (2) Representation of objects by triangles Vertex Edge
Surface Vertices have three spatial coordinates supplementary information necessary to render the object, such as color texture reflectance properties etc.

1. Introduction (3) Main types of shaders in GPUs Shaders
Vertex shaders Pixel shaders (Fragment shaders) Geometry shaders Transform each vertex’s 3D-position in the virtual space to the 2D coordinate, at which it appears on the screen Can add or remove vertices from a mesh Calculate the color of the pixels

DirectX: Microsoft’s API set for MM/3D
1. Introduction (4) DirectX version Pixel SM Vertex SM Supporting OS 8.0 (11/2000) 1.0, 1.1 Windows 2000 8.1 (10/2001) 1.2, 1.3, 1.4 Windows XP/ Windows Server 2003 9.0 (12/2002) 2.0 9.0a (3/2003) 2_A, 2_B 2.x 9.0c (8/2004) 3.0 Windows XP SP2 10.0 (11/2006) 4.0 Windows Vista 10.1 (2/2008) 4.1 Windows Vista SP1/ Windows Server 2008 11 (in development) 5.0 Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX and MS’s OSs [18], [21] DirectX: Microsoft’s API set for MM/3D

1. Introduction (3) Convergence of important features of the vertex and pixel shader models Subsequent shader models introduce typically, a number of new/enhanced features. Differences between the vertex and pixel shader models in subsequent shader models concerning precision requirements, instruction sets and programming resources. Shader model 2 [19] Different precision requirements Vertex shader: FP32 (coordinates) Pixel shader: FX24 (3 colors x 8) Different instructions Different resources (e.g. registers) Shader model 3 [19] Unified precision requirements for both shaders (FP32) with the option to specify partial precision (FP16 or FP24) by adding a modifier to the shader code Different instructions Different resources (e.g. registers)

1. Introduction (3) Shader model 4 (introduced with DirectX10) [20]
Unified precision requirements for both shaders (FP32) with the possibility to use new data formats. Unified instruction set Unified resources (e.g. temporary and constant registers) Shader architectures of GPUs prior to SM4 GPUs prior to SM4 (DirectX 10): have separate vertex and pixel units with different features. Drawback of having separate units for vertex and pixel shading Inefficiency of the hardware implementation (Vertex shaders and pixel shaders often have complementary load patterns [21]).

1. Introduction (5) Unified shader model (introduced in the SM 4.0 of DirectX 10.0) Unified, programable shader architecture The same (programmable) processor can be used to implement all shaders; the vertex shader the pixel shader and the geometry shader (new feature of the SMl 4)

1. Introduction (6) Figure: Principle of the unified shader architecture [22]

1. Introduction (7) Based on its FP32 computing capability and the large number of FP-units available the unified shader is a prospective candidate for speeding up HPC! GPUs with unified shader architectures also termed as GPGPUs (General Purpose GPUs) or cGPUs (computational GPUs)

1. Introduction (8) Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [43]

1. Introduction (9) Evolution of the FP-32 performance of GPGPUs [44]

1. Introduction (9) Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [43]

1. Introduction (10) Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]

1. Introduction (9) Background slides to Introduction

1. Introduction Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]

1. Introduction Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]

2. Basics of the SIMT execution

Data parallel execution
2. Basics of the SIMT execution (1) Main alternatives of data parallel execution Data parallel execution SIMD execution SIMT execution One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors Two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (matrices) is massively multithreaded, and provides data dependent flow control as well as barrier synchronization Needs an FX/FP SIMD extension of the ISA Needs an FX/FP SIMT extension of the ISA and the API E.g. 2. and 3. generation superscalars GPGPUs, data parallel accelerators Figure: Main alternatives of data parallel execution

2. Basics of the SIMT execution (2)
Scalar, SIMD and SIMT execution Scalar execution SIMD execution SIMT execution Domain of execution: single data elements Domain of execution: elements of vectors Domain of execution: elements of matrices (at the programming level) Figure: Domains of execution in case of scalar, SIMD and SIMT execution Remark SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)

Key components of the implementation of SIMT execution Data parallel execution Massive multithreading Data dependent flow control Barrier synchronization

Data parallel execution Performed by SIMT cores SIMT cores execute the same instruction stream on a number of ALUs (i.e. all ALUs of a SIMT core perform typically the same operation). SIMT core Fetch/Decode ALU ALU ALU ALU ALU ALU ALU ALU Figure: Basic layout of a SIMT core SIMT cores are the basic building blocks of GPGPU or data parallel accelerators. During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.

Remark 1 Different manufacturers designate SIMT cores differently, such as streaming multiprocessor (Nvidia), superscalar shader processor (AMD), wide SIMD processor, CPU core (Intel).

Each ALU is allocated a working register set (RF) Fetch/Decode ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF RF RF RF RF RF Figure: Main functional blocks of a SIMT core

SIMT ALUs perform typically, RRR operations, that is ALUs take their operands from and write the calculated results to the register set (RF) allocated to them. ALU RF Figure: Principle of operation of the SIMD ALUs

Remark 2 Actually, the register sets (RF) allocated to each ALU are given parts of a large enough register file. ALU RF Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs

Basic operation of recent SIMT ALUs execute basically SP FP-MADD (simple precision i.e. 32-bit. Multiply-Add) instructions of the form axb+c , ALU RF are pipelined, capable of starting a new operation every new clock cycle, (more precisely, every shader clock cycle), That is, without further enhancements their peak performance is 2 SP FP operations/cycle need a few number of clock cycles, e.g. 2 or 4 shader cycles, to present the results of the SP FMADD operations to the RF,

Additional operations provided by SIMT ALUs FX operations and FX/FP conversions, DP FP operations, trigonometric functions (usually supported by special functional units).

Massive multithreading Aim of massive multithreading to speed up computations by increasing the utilization of available computing resources in case of stalls (e.g. due to cache misses). Principle Suspend stalled threads from execution and allocate ready to run threads for execution. When a large enough number of threads are available long stalls can be hidden.

Multithreading is implemented by creating and managing parallel executable threads for each data element of the execution domain. Same instructions for all data elements Figure: Parallel executable threads for each element of the execution domain

Effective implementation of multithreading if thread switches, called context switches, do not cause cycle penalties. Achieved by providing separate contexts (register space) for each thread, and implementing a zero-cycle context switch mechanism.

SIMT core Fetch/Decode CTX CTX CTX Register file (RF) Actual context CTX Context switch CTX CTX ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU

Data dependent flow control Implemented by SIMT branch processing In SIMT processing both paths of a branch are executed subsequently such that for each path the prescribed operations are executed only on those data elements which fulfill the data condition given for that path (e.g. xi > 0). Example

Figure: Execution of branches [24] The given condition will be checked separately for each thread

First all ALUs meeting the condition execute the prescibed three operations, then all ALUs missing the condition execute the next two operatons Figure: Execution of branches [24]

Figure: Resuming instruction stream processing after executing a branch [24]

Barrier synchronization Lets wait all threads for completing all prior instructions before executing the next instruction. Implemented e.g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10]. Remark In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow (CF) instruction format [7].

Principle of SIMT execution assuming serial kernel processing Host Device Each kernel invocation lets execute all thread blocks (Block(i,j)) belonging to the related Grid Remark In the Figure CUDA terminology is used. Figure: Hierarchy of threads [25]

Remark A parallel kernel processing is also possible assuming advanced GPGPU devices (such as Nvidia’s Fermi or AMD’s HD 69xx GPGPUs) and appropriate software support.

3. Overview of GPGPUs

3. Overview of GPGPUs (1) Programmable GPUs with appropriate
Basic implementation alternatives of the SIMT execution GPGPUs Data parallel accelerators Programmable GPUs with appropriate programming environments Dedicated units supporting data parallel execution with appropriate programming environment Have display outputs No display outputs Have larger memories than GPGPUs E.g. Nvidia’s 8800 and GTX lines AMD’s HD 38xx, HD48xx lines Nvidia’s Tesla lines AMD’s FireStream lines Figure: Basic implementation alternatives of the SIMT execution

3. Overview of GPGPUs (2) GPGPUs Nvidia’s line AMD/ATI’s line 90 nm G80 80 nm R600 Shrink Enhanced arch. Shrink Enhanced arch. 65 nm G92 G200 55 nm RV670 RV770 Enhanced arch. Enhanced arch. Shrink Enhanced arch. Shrink 40 nm GF100 (Fermi) RV870 Cayman Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

3. Overview of GPGPUs (3) NVidia 11/06 90 nm/681 mtrs 10/07 65 nm/754 mtrs 6/08 65 nm/1400 mtrs Cores G80 G92 GT200 96 ALUs 320-bit 128 ALUs 384-bit 112 ALUs 256-bit 192 ALUs 448-bit 240 ALUs 512-bit Cards 8800 GTS 8800 GTX 8800 GT GTX260 GTX280 OpenCL OpenCL Standard 6/07 11/07 6/08 11/08 CUDA Version 1.0 Version 1.1 Version 2.0 Version 2.1 AMD/ATI 11/05 5/07 80 nm/681 mtrs 11/07 55 nm/666 mtrs 5/08 55 nm/956 mtrs Cores R500 R600 R670 RV770 48 ALUs 320 ALUs 512-bit 320 ALUs 256-bit 320 ALUs 256-bit 800 ALUs 256-bit 800 ALUs 256-bit Cards (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870 12/08 OpenCL OpenCL Standard 11/07 9/08 12/08 Brooks+ Brook+ Brook+ 1.2 Brook+ 1.3 (SDK v.1.0) (SDK v.1.2) (SDK v.1.3) 6/08 support RapidMind 3870 2005 2006 2007 2008 Figure: Overview of GPGPUs and their basic software support (1)

Intel bought RapidMind
3. Overview of GPGPUs (4) NVidia 3/10 40 nm/3000 mtrs 07/10 40 nm/1950 mtrs 11/10 40 nm/3000 mtrs Cores GF100 (Fermi) GF104 (Fermi) GF110 (Fermi) 448 ALUs 320-bit 480 ALUs 384-bit 336 ALUs 192/256-bit 512 ALUs 384-bit 480 ALUs 384-bit 1/11 Cards GTX 470 GTX 480 GTX 460 GTX 580 GTX 560 Ti 6/09 SDK 1.0 Early release 10/09 SDK 1.0 6/10 SDK 1.1 OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1 5/09 6/09 3/10 6/10 1/11 3/11 CUDA Version 22 Version 2.3 Version 3.0 Version 3.1 Version 3.2 Version 4.0 Beta AMD/ATI 9/09 40 nm/2100 mtrs 10/10 40 nm/1700 mtrs 12/10 40 nm/2640 mtrs Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT 1440/1600 ALUs 256-bit 960/1120 ALUs 256-bit 1408/1536 ALUs 256-bit Cards HD 5850/70 HD 6850/70 HD 6950/70 11/09 (SDK V.2.0) 03/10 (SDK V.2.01) 08/10 (SDK V.2.2) OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1 3/09 Brooks+ Brook+ 1.4 (SDK V.1.4 Beta) 8/09 Intel bought RapidMind RapidMind 2009 2010 2011 Figure: Overview of GPGPUs and their basic software support (2)

Intel bought RapidMind
3. Overview of GPGPUs (5) Remarks on AMD-based graphics cards [45], [66] Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+ and started supporting OpenCL as the basis of their HLL programming language. AMD/ATI 9/09 40 nm/2100 mtrs 10/10 40 nm/1700 mtrs 12/10 40 nm/2640 mtrs Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT 1440/1600 ALUs 256-bit 960/1120 ALUs 256-bit 1408/1536 ALUs 256-bit Cards HD 5850/70 HD 6850/70 HD 6950/70 11/09 (SDK V.2.0) 03/10 (SDK V.2.01) 08/10 (SDK V.2.2) OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1 3/09 Brooks+ Brook+ 1.4 (SDK V.2.01) 8/09 Intel bought RapidMind RapidMind 2009 2010 2011 As a consequence AMD changed also both the microarchitecture of their GPGPUs (by introducing Local and Global Data Share memories) and their terminology by introducing Pre-OpenCL and OpenCL terminology, as discussed in Section 5.2.

3. Overview of GPGPUs (6) Remarks on Fermi-based graphics cards [45], [66] FP64 speed ½ of the FP32 speed for the Tesla 20-series 1/8 of the SP32 speed for the GeForce GTX 470/480/570/580 cards 1/12 for other GForce GTX4xx cards ECC available only on the Tesla 20-series Number of DMA engines Tesla 20-series has 2 DMA Engines (copy engines). GeForce cards have 1 DMA Engine. This means that CUDA applications can overlap computation and communication on Tesla using bi-directional communication over PCI-e. Memory size Tesla 20 products have larger on board memory (3GB and 6GB)

3. Overview of GPGPUs (7) Positioning Nvidia’s discussed GPGPU cards in their entire product portfolio [82]

3. Overview of GPGPUs (8) Nvidia’s compute capability concept
Nvidia manages the continuous evolution by a) defining sets of capabilities and features designated as compute capability versions, b) specifying which compute capability version is supported by their programming environments, represented by their SDKs, and GPGPU lines, c) and specifying compatibility rules. among them.

compute capability versions
3. Overview of GPGPUs (9) a) Defined sets of compute capability versions by Nvidia-1 [81]

3. Overview of GPGPUs (10) a) Defined sets of compute capability versions by Nvidia-2 [81]

3. Overview of GPGPUs (11) Fermi
b1) Compute capability versions of the PTX ISAs generated by different releases of CUDA SDKs [50] Fermi

3. Overview of GPGPUs (12) b2) Support of the compute capability versions by Nvidia’s GPGPU cards [81] Capability GPGPU cores GPGPU devices 1.0 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600, 360M 1.1 G86, G84, G98, G96, G96b, G94, G94b, G92, G92b GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50 1.2 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M 1.3 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800 2.0 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580 2.1 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 460, 500M

3. Overview of GPGPUs (13) c) Compatibility rules related to compute capability versions [50] The basic rule is forward compatibility within the main versions (versions 1.x and 2.x), but not across main versions. This is interpreted as follows Object files (called CUBIN files) compiled to a particular compute capability, are supported on all devices having the same or higher version number within the same main version. E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices but not supported on compute capability 2.0 (Fermi) devices. For more details see [52].

3. Overview of GPGPUs (14) Table: Main features of Nvidia’s GPGPUs-1
8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280 Core G80 G92 GT200 Introduction 11/06 10/07 6/08 IC technology 90 nm 65 nm Nr. of transistors 681 mtrs 754 mtrs 1400 mtrs Die are 480 mm2 324 mm2 576 mm2 Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz Computation No of SMs (cores) 12 16 14 24 30 No.of FP32 EUss 96 128 112 192 240 Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz No. FP32 operations./cycle 21 3 Peak FP32 performance 230.4 GFLOPS GFLOPS 508 GFLOPS 715 GFLOPS 933 GFLOPS Peak FP64 performance – 59.62 GFLOPS 77.76 GFLOPS Memory Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB Mem. type GDDR3 Mem. channel 6*64-bit 4*64-bit 8*64-bit System Multi. CPU techn. SLI Interface PCIe x16 PCIe 2.0x16 MS Direct X 10 10.1 subset TDP 146 W 155 W 105 W 182 W 236 W 1: Nvidia takes the FP32 capable Texture Processing Units also into consideration and calculates with 3 FP32 operations/cycle Table: Main features of Nvidia’s GPGPUs-1

3. Overview of GPGPUs (15) Remarks
In publications there are conflicting statements about whether or not the GT80 makes use of dual issue (including a MAD and a Mul operation) within a period of four shader cycles or not. Official specifications [22] declare the capability of dual issue, but other literature sources [64] and even a textbook, co-authored by one of the chief developers of the GT80 (D. Kirk [65]) deny it. A clarification could be found in a blog [66], revealing that the higher figure given in Nvidia’s specifications includes calculations made both by the ALUs in the SMs and by the texture processing units TPU). Nevertheless, the TPUs can not be directly accessed by CUDA except for graphical tasks, such as texture filtering. Accordingly, in our discussion focusing on numerical calculations it is fair to take only the MAD operations into account for specifying the peak numerical performance.

3. Overview of GPGPUs (16) Structure of an SM of the G80 architecture
Texture processing Units consisting of TA: Texture Address units TF: Texture Filter Units They are FP32 or FP16 capable [46]

3. Overview of GPGPUs (17) Table: Main features of Nvidia’s GPGPUs-2
GTX 470 GTX 480 GTX 460 GTX 570 GTX 580 Core GF100 GF104 GF110 Introduction 3/10 7/10 12/10 11/10 IC technology 40 nm Nr. of transistors 3200 mtrs 1950 mtrs 3000 mtrs Die are 529 mm2 367 mm2 520 mm2 Core frequency 732 MHz 772 MHz Computation No of SMs (cores) 14 15 7 16 No. of FP32 EUs 448 480 336 512 Shader frequency 1215 MHz 1401 MHz 1350 MHz 1464 MHz 1544 MHz No. FP32 operations/cycle 2 3 Peak FP32 performance 1088 GFLOPS 1345 GFLOPS 9072 GFLOPS 1405 GFLOPS 1581 GFLOPS Peak FP64 performance 136 GFLOPS 168 GFLOPS 75.6 GFLOPS 175.6 GFLOPS 197.6 GFLOPS Memory Mem. transfer rate (eff) 3348 Mb/s 3698 Mb/s 3600 Mb/s 3800 Mb/s 4008 Mb/s Mem. interface 320-bit 384-bit 192/256-bit Mem. bandwidth 133.9 GB/s 177.4 GB/s 86.4/115.2 GB/s 152 GB/s 192.4 GB/s Mem. size 1.28 GB 1.536 GB 0.768/1.024 GB/s 1.536/3.072 GB Mem. type GDDR5 Mem. channel 5*64-bit 6*64-bit 3/4 *64-bit System Multi. CPU techn. SLI Interface PCIe 2.0*16 MS Direct X 11 TDP 215 W 250 W 150/160 W 219 W 244 W Table: Main features of Nvidia’s GPGPUs-2

3. Overview of GPGPUs (18) Remarks
1) The GDDR3 memory has a double clocked data transfer Effective memory transfer rate = 2 x memory frequency The GDDR5 memory has a quad clocked data transfer Effective memory transfer rate = 4 x memory frequency 2) Both the GDDR3 and GDDR5 memories are 32-bit devices. Nevertheless, memory controllers of GPGPUs may be designed either to control a single 32-bit memory channel or dual memory channels, providing a 64-bit channel width.

3. Overview of GPGPUs (19) Examples for Nvidia cards
Nvidia GeForce GTX 480 (GF 100 based) [47]

3. Overview of GPGPUs (20) Nvidia GeForce GTX 480 and 580 cards [77]
(GF 100 based) GTX 580 (GF 110 based)

3. Overview of GPGPUs (21) A pair of GeForce GTX 480 cards [47]
(GF100 based)

3. Overview of GPGPUs (22) Table: Main features of AMD/ATIs GPGPUs-1
HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870 Core R600 R670 RV770 (R700-based) RV770 (R700 based) Introduction 5/07 11/07 5/08 IC technology 80 nm 55 nm Nr. of transistors 700 mtrs 666 mtrs 956 mtrs Die are 408 mm2 192 mm2 260 mm2 Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz Computation No. of ALUs 320 800 Shader frequency No. FP32 operations./cycle 2 Peak FP32 performance 471.6 GFLOPS 429 GFLOPS 496 GFLOPS 1000 GFLOPS 1200 GFLOPS Peak FP64 performance – 200 GFLOPS 240 GFLOPS Memory Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5) Mem. interface 512-bit 256-bit 265-bit Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s Mem. size 512 MB 256 MB Mem. type GDDR3 GDDR4 GDDR3/GDDR5 Mem. channel 8*64-bit 8*32-bit 4*64-bit Mem. contr. Ring bus Crossbar System Multi. CPU techn. CrossFire X Interface PCIe x16 PCIe 2.0x16 MS Direct X 10 10.1 TDP Max./Idle 150 W 75 W 105 W 110 W Table: Main features of AMD/ATIs GPGPUs-1

Cypress PRO (RV870-based)
3. Overview of GPGPUs (23) Evergreen series HD 5850 HD 5870 HD 5970 Core Cypress PRO (RV870-based) Cypress XT (RV870-based) Hemlock XT (RV870-based) Introduction 9/09 11/09 IC technology 40 nm Nr. of transistors 2154 mtrs 2*2154 mtrs Die are 334 mm2 2*334 mm2 Core frequency 725 MHz 850 MHz Computation No. of SIMD cores / VLIW5 ALUs 18/16 20/16 2*20/16 No. of EUs 1440 1600 2*1600 Shader frequency No. FP32 inst./cycle 2 Peak FP32 performance 2088 GFLOPS 2720 GFLOPS 4640 GFLOPS Peak FP64 performance 417.6 GFLOPS 544 GFLOPS 928 GFLOPS Memory Mem. transfer rate (eff) 4000 Mb/s 4800 Mb/s Mem. interface 256-bit 2*256-bit Mem. bandwidth 128 GB/s 153.6 GB/s 2*128 GB/s Mem. size 1.0 GB 1.0/2.0 GB 2*(1.0/2.0) GB Mem. type GDDR5 Mem. channel 8*32-bit 2*8*32-bit System Multi. CPU techn. CrossFire X Interface PCIe 2.1*16 MS Direct X 11 TDP Max./Idle 151/27 W 188/27 W 294/51 W Table: Main features of AMD/ATI’s GPGPUs-2

3. Overview of GPGPUs (24) Table: Main features of AMD/ATI’s GPGPUs-3
Northerm Islands series HD 6850 HD 6870 Core Barts Pro Barts XT Introduction 10/10 IC technology 40 nm Nr. of transistors 1700 mtrs Die are 255 mm2 Core frequency 775 MHz 900 MHz Computation No. of SIMD cores /VLIW5 ALUs 12/16 14/16 No. of EUs 960 1120 Shader frequency No. FP32 inst./cycle 2 Peak FP32 performance 1488 GFLOPS 2016 GFLOPS Peak FP64 performance - Memory Mem. transfer rate (eff) 4000 Mb/s 4200 Mb/s Mem. interface 256-bit Mem. bandwidth 128 GB/s 134.4 GB/s Mem. size 1 GB Mem. type GDDR5 Mem. channel 8*32-bit System Multi. CPU techn. CrossFire X Interface PCIe 2.1*16 MS Direct X 11 TDP Max./Idle 127/19 W 151/19 W Table: Main features of AMD/ATI’s GPGPUs-3

3. Overview of GPGPUs (25) Table: Main features of AMD/ATIs GPGPUs-4
Northerm Islands series HD 6950 HD 6970 HD 6990 HD 6990 unlocked Core Cayman Pro Cayman XT Antilles Introduction 12/10 3/11 IC technology 40 nm Nr. of transistors 2.64 billion 2*2.64 billion Die are 389 mm2 2*389 mm2 Core frequency 800 MHz 880 MHz 830 MHz Computation No. of SIMD cores /VLIW4 ALUs 22/16 24/16 2*24/16 No. of EUs 1408 1536 2*1536 Shader frequency No. FP32 inst./cycle / ALU 4 Peak FP32 performance 2.25 TFLOPS 2.7 TFLOPS 5.1 TFLOPS 5.4 TFLOPS Peak FP64 performance TFLOPS 0.683 TFLOPS 1.275 TFLOPS 1.35 TFLOPS Memory Mem. transfer rate (eff) 5000 Mb/s 5500 Mb/s Mem. interface 256-bit Mem. bandwidth 160 GB/s 176 GB/s 2*160 GB/s Mem. size 2 GB 2*2 GB Mem. type GDDR5 Mem. channel 8*32-bit 5*32-bit 2*8*32-bit System ECC - Multi. CPU techn. CrossFireX Interface PCIe 2.1*16-bit MS Direct X 11 TDP Max./Idle 200/20 W 250/20 W 350/37 W 415/37 W Table: Main features of AMD/ATIs GPGPUs-4

3. Overview of GPGPUs (26) Remark
The Radeon HD 5xxx line of cards is designated also as the Evergreen series and the Radeon HD 6xxx line of cards is designated also as the Northern islands series.

3. Overview of GPGPUs (27) Examples for AMD cards
HD 5870 (RV870 based) [41]

3. Overview of GPGPUs (28) HD 5970 (actually RV870 based) [80]
ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock

3. Overview of GPGPUs (29) HD 5970 (actually RV870 based) [79]
ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock

3. Overview of GPGPUs (30) AMD HD 6990 (actually Cayman based) [78]
AMD HD 6990: 2 x ATI HD 6970 with slightly reduced memory and shader clock

3. Overview of GPGPUs (31) Price relations (as of 01/2011) Nvidia
GTX ~ 350 $ GTX ~ 500 $ AMD HD ~ 400 $ HD ~ 700 $ (Dual 6970)

4. Overview of data parallel accelerators

4. Overview of data parallel accelerators (1)
Implementation alternatives of data parallel accelerators On card implementation On-die integration Recent implementations Emerging implementations E.g. GPU cards Intel’s Heavendahl Intel’s Sandy Bridge (2011) Data-parallel accelerator cards AMD’s Torrenza integration technology AMD’s Fusion (2008) integration technology 2010/2011 Trend Figure: Implementation alternatives of dedicated data parallel accelerators

On-card accelerators Card implementations Desktop implementations 1U server implementations Usually dual cards mounted into a box, connected to an adapter card that is inserted into a free PCI-E x16 slot of the host PC through a cable. Usually 4 cards mounted into a 1U server rack, connected two adapter cards that are inserted into two free PCIEx16 slots of a server through two switches and two cables. Single cards fitting into a free PCI Ex16 slot of the host computer. E.g. Nvidia Tesla C870 Nvidia Tesla C1060 Nvidia Tesla C2070 AMD FireStream 9170 AMD FireStream 9250 AMD FireStream 9370 Nvidia Tesla D870 Nvidia Tesla S870 Nvidia Tesla S1070 Nvidia Tesla S2050/S2070 Figure: Implementation alternatives of on-card accelerators

NVidia Tesla-1 G80-based GT200-based 6/07 1.5 GB GDDR3 SP: GFLOPS DP: - 6/08 4 GB GDDR3 SP: GFLOPS DP: GFLOPS Card C870 C1060 345.6 6/07 2*C870 incl. 3 GB GDDR3 SP: GFLOPS DP: - Desktop D870 6/07 4*C870 incl. 6 GB GDDR3 SP: 1382 GFLOPS DP: - 6/08 4*C1060 16 GB GDDR3 SP: 3732 GFLOPS DP: GFLOPS IU Server S870 S1070 6/07 11/07 6/08 CUDA Version 1.0 Version 1.01 Version 2.0 2007 2008 Figure: Overview of Nvidia’s G80/G200-based Tesla family-1

FB: Frame Buffer Figure: Main functional units of Nvidia’s Tesla C870 card [2]

Figure: Nvida’s Tesla C870 and AMD’s FireStream 9170 cards [2], [3]

Figure: Tesla D870 desktop implementation [4]

Figure: Nvidia’s Tesla D870 desktop implementation [4]

Figure: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]

Figure: Concept of Nvidia’s Tesla S870 1U rack server [5]

Figure: Internal layout of Nvidia’s Tesla S870 1U rack [6]

Figure: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards inserted into PCI-E x16 slots of the host server [6]

NVidia Tesla-2 GF100 (Fermi)-based 11/09 3/6 GB GDDR5 SP: TLOPS1 DP: TFLOPS Card C2050/C2070 04/10 3/6 GB GDDR5 SP: TFLOPS1 DP: TFLOPS 08/10 6 GB GDDR5 SP: TFLOPS1 DP: TFLOPS Module M2050/M2070 M2070Q 11/09 4*C2050/C2070 12/24 GB GDDR31 SP: 4.1 TFLOPS DP: 8.2 TFLOPS IU Server S2050/S2070 5/09 6/09 3/10 6/10 1/11 CUDA CUDA Version 2.2 Version 2.3 Version 3.0 Version 3.1 Version 3.2 6/10 OpenCL+ OpenCL 1.1 2009 2010 2011 1: Without SF (Special Function) operations Figure: Overview of Nvidia’s GF100 (Fermi)-based Tesla family

Fermi based Tesla devices Tesla C2050/C2070 Card [71] Tesla S2050/S2070 1U [72] (11/2009) (11/2009) Single GPU Card 3/6 GB GDDR5 515 GFLOPS DP ECC Four GPUs 12/16 GB GDDR5s 2060 GFLOPS DP ECC

Tesla M2050/M2070/M2070Q Processor Module (Dual slot board with PCIe Gen. 2 x16 interface) (04/2010) Figure: Tesla M2050/M2070/M2070Q Processor Module [74] Used in the Tianhe-1A Chinese supercomputer (10/2010) Remark The M2070Q is an upgrade of the M2070 providing higher memory clock (introduced 08/2010)

Tianhe-1A (10/2010) [48] Upgraded version of the Tianhe-1 (China) 2.6 PetaFLOPS (fastest supercomputer in the World in 2010) Intel Xeon 5670 7 168 Nvidia Tesla M2050

Specification data of the Tesla M2050/M2070/M2070Q modules [74] (448 ALUs) Remark The M2070Q is an upgrade of the M2070, providing higher memory clock (introduced 08/2010)

Support of ECC Fermi based Tesla devices introduced the support of ECC. By contrast recently neither Nvidia’s straightforward GPGPU cards nor AMD’s GPGPU or DPA devices support ECC [76].

Tesla S2050/S2070 1U The S2050/S2070 differ only in the memory size, the S2050 includes 12 GB, the S GB. GPU Specification  Number of processor cores: 448  Processor core clock: 1.15 GHz  Memory clock: GHz  Memory interface: 384 bit System Specification  Four Fermi GPUs 12.0/24.0 GB of GDDR5, configured as 3.0/6.0 GB per GPU. When ECC is turned on, available memory is ~10.5 GB  Typical power consumption: 900 W Figure: Block diagram and technical specifications of Tesla S2050/S2070 [75]

AMD FireStream-1 RV670-based RV770-based 11/07 2 GB GDDR3 FP32: 500 GLOPS FP64:~200 GLOPS 6/08 Shipped Card 9170 9170 6/08 1 GB GDDR3 FP32: 1000 GLOPS FP64: ~300 GFLOPS 10/08 Shipped 9250 9250 Stream Computing SDK 12/07 Brook+ ACM/AMD Core Math Library CAL (Computer Abstor Layer) 09/08 Brook+ ACM/AMD Core Math Library CAL (Computer Abstor Layer) Version 1.0 Version 1.2 Rapid Mind 2007 2008 Figure: Overview of AMD/ATI’s FireStream family-1

AMD FireStream-2 In 01/11 Version 2.3 renamed to APP RV870-based 06/10 2/4 GB GDDR5 FP32: 2016 GLOPS FP64: 403/528 GLOPS 10/10 Shipped Card 9350/9370 9350/9370 Stream Computing SDK 03/09 Brooks+ 03/10 OpenCL 1.0 05/10 OpenCL 1.0 08/10 OpenCL 1.1 12/10 OpenCL 1.1 Version 1.4 Version 2.01 Version 2.1 Version 2.2 Version 23 2009 2010 2011 APP: Accelerated Parallel Processing Figure: Overview of AMD/ATI’s FireStream family-2

Nvidia Tesla cards Core type C870 C1060 C2050 C2070 Based on G80 GT200 T20 (GF100-based) Introduction 6/07 6/08 11/09 Core Core frequency 600 MHz 602 MHz 575 MHz ALU frequency 1350 MHz 1296 GHz 1150 MHz No. of SMs (cores) 16 30 14 No. of ALUs 128 240 448 Peak FP32 performance 345.6 GFLOPS 933 GFLOPS GFLOPS Peak FP64 performance - 77.76 GFLOPS 515.2 GFLOPS Memory Mem. transfer rate (eff) 1600 Gb/s 3000 Gb/s Mem. interface 384-bit 512-bit Mem. bandwidth 768 GB/s 102 GB/s 144 GB/s Mem. size 1.5 GB 4 GB 3 GB 6 GB Mem. type GDDR3 GDDR5 System ECC Interface PCIe *16 PCIe 2.0*16 Power (max) 171 W 200 W 238 W 247 W Table: Main features of Nvidia’s data parallel accelerator cards (Tesla line) [73]

AMD FireStream cards Core type 9170 9250 9350 9370 Based on RV670 RV770 RV870 Introduction 11/07 6/08 10/10 Core Core frequency 800 MHz 625 MHz 700 MHz 825 MHz ALU frequency 325 MHz No. of EUs 320 800 1440 1600 Peak FP32 performance 512 GFLOPS 1 TFLOPS 2016 GFLOPS 2640 GFLOPS Peak FP64 performance ~200 GFLOPS ~250 GFLOPS 403.2 GFLOPS 528 GFLOPS Memory Mem. transfer rate (eff) 1600 Gb/s 1986 Gb/s 4000 Gb/s 4600 Gb/s Mem. interface 256-bit Mem. bandwidth 51.2 GB/s 63.5 GB/s 128 GB/s 147.2 GB/s Mem. size 2 GB 1 GB 4 GB Mem. type GDDR3 GDDR5 System ECC - Interface PCIe 2.0*16 Power (max) 150 W 225 W Table: Main features of AMD/ATI’s data parallel accelerator cards (FireStream line) [67]

Price relations (as of 1/2011) Nvidia Tesla C ~ $ C ~ $ S ~ $ S ~ $ NVidia GTX GTX ~ $

1. Introduction (8) Background slides for intro to SIMT processing

1. Introduction (8) Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]

1. Introduction (9) Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]

5. References

5. References (1) 5. References (to all four sections)
[1]: Torricelli F., AMD in HPC, HPC07, 2007 [2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia [3] AMD FireStream 9170, 2008 [4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008, Nvidia, [5]: Tesla S870 GPU Computing System, Specification, Nvida, March , [6]: Torres G., Nvidia Tesla Technology, Nov. 2007, [7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD [8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU, ASPLOS 2006, June 2008 [9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007

5. References (2) [10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL), Version 2.0, AMD, Oct. 2008 [11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia [12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007, University of Illinois, Urbana-Champaign, lectures/lecture7-threading%20hardware.ppt [13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June , [14]: Goto H., Nvidia G80, PC Watch, April , [15]: Goto H., GeForce 8800GT (G92), PC Watch, Oct , [16]: Goto H., NVIDIA GT200 and AMD RV770, PC Watch, July , [17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review, PC Perspective, June ,

5. References (3) [18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia, [20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006, Nvidia, [21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for Visual Information Technology, IIIT Hyderabad, March 2007, [22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, [23]: Goto H., Graphics Pipeline Rendering History, Aug , PC Watch, [24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,” Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008, [25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies, Sept , [26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 1.1, Nov. 2007, Nvidia

5. References (4) [27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,” ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008 [28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008, [29]: Shrout R., IDF Fall 2007 Keynote, PC Perspective, Sept. 18, 2007, [30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Ars Technica, Aug , intels-biggest-leap-ahead-since-the-pentium-pro.html [31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug , [32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, [33]: AMD Stream Computing, User Guide, Oct. 2008, Rev Stream_Computing_User_Guide.pdf [34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007, doggett-radeon2900-gh07.pdf

5. References (5) [35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007, [36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, [37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008, [38]: Kogo H., RV770 Overview, PC Watch, July , [39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept , ArticleID=RWT &mode=print [40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed, Tech Report, Sept , [41]: Wasson S., AMD's Radeon HD 5870 graphics processor, Tech Report, Sept , [42]: Bell B., ATI Radeon HD 5870 Performance Preview , Firing Squad, Sept , ati_radeon_hd_5870_performance_preview/default.asp

5. References (6) [43]: Nvidia CUDA C Programming Guide, Version 3.2, October CUDA_C_Programming_Guide.pdf [44]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley, January Berkeley_2011.pdf [45]: Wasson S., Nvidia's GeForce GTX 580 graphics processor Tech Report, Nov , [46]: Shrout R., Nvidia GeForce 8800 GTX Review – DX10 and Unified Architecture, PC Perspective, Nov and-unified-architecture/g80-architecture [47]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors Tech Report, March , [48]: Gangar K., Tianhe-1A from China is world’s fastest Supercomputer Tech Ticker, Oct , from-china-is-worlds-fastest-supercomputer/ [49]: Smalley T., ATI Radeon HD 5870 Architecture Analysis, Bit-tech, Sept , architecture-analysis/8

5. References (7) [50]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 2.2, Oct , ptx_isa_2.2.pdf [51]: Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies, Sept [52]: Nvidia CUDATM FermiTM Compatibility Guide for CUDA Applications, Version 1.0, February 2010, docs/NVIDIA_FermiCompatibilityGuide.pdf [53]: Hallock R., Dissecting Fermi, NVIDIA’s next generation GPU, Icrontic, Sept , [54]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview, Legit Reviews, Jan , [55]: Hoenig M., NVIDIA GeForce GTX 460 SE 1GB Review, Hardware Canucks, Nov , nvidia-geforce-gtx-460-se-1gb-review-2.html [56]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture Sept 2009, P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf [57]: Kirk D. & Hwu W. W., ECE498AL Lectures 4: CUDA Threads – Part 2, , University of Illinois, Urbana-Champaign, al/lectures/lecture4%20cuda%20threads%20part2%20spring% ppt

5. References (8) [58]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009 Architecture_Whitepaper.pdf [59]: Kirk D. & Hwu W. W., ECE498AL Lectures 8: Threading Hardware in G80, , University of Illinois, Urbana-Champaign, al/lectures/lecture8-threading-hardware-spring-2009.ppt [60]: Wong H., Papadopoulou M.M., Sadooghi-Alvandi M., Moshovos A., Demystifying GPU Microarchitecture through Microbenchmarking, University of Toronto, 2010, [61]: Pettersson J., Wainwright I., Radar Signal Processing with Graphics Processors (GPUs), SAAB Technologies, Jan , [62]: Smith R., NVIDIA’s GeForce GTX 460: The $200 King, AnandTech, July , [63]: Angelini C., GeForce GTX 580 And GF110: The Way Nvidia Meant It To Be Played, Tom’s Hardware, Nov , gtx-580-gf110-geforce-gtx-480,2781.html [64]: NVIDIA G80: Architecture and GPU Analysis, Beyond3D, Nov , [65]: D. Kirk and W. Hwu, Programming Massively Parallel Processors, 2008 Chapter 3: CUDA Threads, Chapter3-CudaThreadingModel.pdf

5. References (9) [66]: NVIDIA Forums: General CUDA GPU Computing Discussion, 2008 [67]: Wikipedia: Comparison of AMD graphics processing units, 2011 [68]: Nvidia OpenCL Overview, 2009 [69]: Chester E., Nvidia GeForce GTX 460 1GB Fermi Review, Trusted Reviews, July , Nvidia-GeForce-GTX-460-1GB-Fermi/p1 [70]: NVIDIA GF100 Architecture Details, Geeks3D, , [71]: Murad A., Nvidia Tesla C2050 and C2070 Cards, Science and Technology Zone, 17 nov. 2009, [72]: New NVIDIA Tesla GPUs Reduce Cost Of Supercomputing By A Factor Of 10, Nvidia, Nov [73]: Nvidia Tesla, Wikipedia, [74]: Tesla M2050 and Tesla M2070/M2070Q Dual-Slot Computing Processor Modules, Board Specification, v. 03, Nvidia, Aug. 2010,

5. References (10) [75]: Tesla 1U gPU Computing System, Product Soecification, v. 04, Nvidia, June 2009, [76]: Kanter D., The Case for ECC Memory in Nvidia’s Next GPU, Realworkd Technologies, 19 Aug. 2009, [77]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010, 37789-nvidia-geforce-gtx-580-review-5.html [78]: Angelini C., AMD Radeon HD GB Review, Tom’s Hardware, March 8, 2011, [79]: Tom’s Hardware Gallery, jpg-.html [80]: Tom’s Hardware Gallery, jpg-.html [81]: CUDA, Wikipedia, [82]: GeForce Graphics Processors, Nvidia, [83]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at Nvidia’s 2009 GPU Technology Conference, (GTC), Sept ,

5. References (10) [84]: Tom’s Hardware Gallery,
[85]: Butler, M., Bulldozer, a new approach to multithreaded compute performance, Hot Chips 22, Aug [86]: Voicu A., NVIDIA Fermi GPU and Architecture Analysis, Beyond 3D, 23rd Oct 2010, [87]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology, AMD, March , Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf [88]: Smith R., AMD's Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD, AnandTech, Dec , [89] Christian, AMD renames ATI Stream SDK, updates its with APU, OpenCL 1.1 support, Jan , amd-renames-ati-stream-sdk-updates-its-apu-opencl-11-support [90]: User Guide: AMD Stream Computing, Revision 1.3.0, Dec. 2008, [91]: Programming Guide: ATI Stream Computing Compute Abstraction Layer (CAL), Revision 2.01, AMD, March 2010, SDK_CAL_Programming_Guide_v2.0.pdf

5. References (11) [92]: Technical Overview: AMD Stream Computing, Revision 1.2.1, Oct. 2008, [93]: AMD Accelerated Parallel Processing OpenCL Programming Guide, Revision 1.2, AMD, Jan. 2011, AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf [94]: An Introduction to OpenCL, AMD, stream-technology/opencl/pages/opencl-intro.aspx [95]: Behr D., Introduction to OpenCL PPAM 2009, Sept , [96]: Gohara D.W. PhD, OpenCL Episode 2 – OpenCL Fundamentals, Aug , MacResearch, [97]: Kanter D., AMD's Cayman GPU Architecture, Real World Technologies, Dec , [98]: Hoenig M., AMD Radeon HD 6970 and HD 6950 Review, Hardware Canucks, Dec , 38899-amd-radeon-hd-6970-hd-6950-review-3.html [99]: Reference Guide: AMD HD 6900 Series Instruction Set Architecture, Revision 1.0, Febr. 2011, AMD_HD_6900_Series_Instruction_Set_Architecture.pdf [100]: Howes L., AMD and OpenCL, AMD Application Engineering, Dec. 2010,

5. References (12) [101]: ATI R700-Family Instruction Set Architecture Reference Guide, Revision 1.0a, AMD, Febr. 2011, Set_Architecture.pdf [102]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor Graphics, Presentation ARCS002, IDF San Francisco, Sept. 2010 [103]: Bhaniramka P., Introduction to Compute Abstraction Layer (CAL), %20Introduction%20to%20CAL.pdf [104]: Villmow M., ATI Stream Computing, ATI Intermediate Language (IL), May , %20Computing%20-%20ATI%20Intermediate%20Language.ppt#547,9 [105]: Reference Guide: AMD Accelerated Parallel Processing Technology, AMD Intermediate Language (IL), Revision 2.0e, March 2011, _(IL)_Specification_v2.pdf [106]: Hensley J., Hardware and Compute Abstraction Layers for Accelerated Computing Using Graphics Hardware and Conventional CPUs, AMD, 2007, [107]: Hensley J., Yang J., Compute Abstraction Layer, AMD, Febr , [108]: AMD Accelerated Parallel Processing (APP) SDK, AMD Developer Central,

5. References (13) [109]: OpenCL™ and the AMD APP SDK v2.4, AMD Developer Central, April , SDK.aspx [110]: Stone J., An Introduction to OpenCL, U. of Illinois at Urbana-Champign, Dec. 2009, [111]: Introduction to OpenCL Programming, AMD, No , Rev. A, May 2010, to_OpenCL_Programming%20Training_Guide%20(201005).pdf [112]: Evergreen Family Instruction Set Architecture, Instructions and Microcode Reference Guide, AMD, Febr. 2011, AMD_Evergreen-Family_Instruction_Set_Architecture.pdf [113]: Intel 810 Chipset: Intel 82810/82810-DC100 Graphics and Memory Controller Hub (GMCH) Datasheet, June 1999 ftp://download.intel.com/design/chipsets/datashts/ pdf [114]: Huynh A.T., AMD Announces "Fusion" CPU/GPU Program, Daily Tech, Oct , [115]: Grim B., AMD Fusion Family of APUs, Dec , content/uploads/2011/01/AMD-Fusion-Press-Tour_EMEA.pdf [116]: Newell D., AMD Financial Analyst Day, Nov , [117]: De Maesschalck T., AMD starts shipping Ontario and Zacate CPUs, DarkVision Hardware, Nov ,

5. References (14) [118]: AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with OpenCLTM 1.1 Support????? [119]: Burgess B., „Bobcat” AMD’s New Low Power x86 Core Architecture, Aug , Bobcat-x86.pdf [120]: AMD Ontario APU pictures, Xtreme Systems, Sept , [121]: Stokes J., AMD reveals Fusion CPU+GPU, to challenge Intel in laptops, Febr , fusion-cpugpu-to-challege-intel-in-laptops.ars [122]: AMD Unveils Future of Computing at Annual Financial Analyst Day, CDRinfo, Nov , [123]: Shimpi A. L., The Intel Core i3 530 Review - Great for Overclockers & Gamers, AnandTech, Jan , [124]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review, Jan , [125]: Wikipedia: Intel GMA, 2011, [126]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core i Tested, AnandTech, Jan , 2600k-i5-2500k-core-i tested/11

5. References (15) [127]: Marques T., AMD Ontario, Zacate Die Sizes - Take 2 , Sept , take-2.html [128]: De Vries H., AMD Bulldozer, 8 core processor, Nov , [129]: Intel® 845G/845GL/845GV Chipset Datasheet: Intel® 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH), Mai 2002 [130]: Huynh A. T., Final AMD "Stars" Models Unveiled, Daily Tech, May , [131]: AMD Fusion, Wikipedia, [132]: Nita S., AMD Llano APU to Get Dual-GPU Technology Similar to Hybrid CrossFire, Softpedia, Jan , Get-Dual-GPU-Technology-Similar-to-Hybrid-CrossFire shtml [133]: Jotwani R., Sundaram S., Kosonocky S., Schaefer A., Andrade V. F., Novak A., Naffziger S., An x86-64 Core in 32 nm SOI CMOS, IEEE Xplore, Jan. 2011, [134]: Karmehed A., The graphical performance of the AMD A series APUs, Nordic Hardware, March , performance-of-the-amd-a-series-apus.html

5. References (16) [135]: Butler M., „Bulldozer” A new approach to multithreaded compute performance, Aug , -AMD-Bulldozer.pdf [136]: „Bulldozer” and „Bobcat” AMD’s Latest x86 Core Innovations, HotChips22, -presentation [137]: Altavilla D., Intel Arrandale Core i5 and Core i3 Mobile Unveiled, Hot Hardware, Jan , [138]: Dodeja A., Intel Arrandale, High Performance for the Masses, Hot Hardware, Review of the IDF San Francisco, Sept. 2009, [139]: Shimpi A., An Intel Arrandale: 32nm review for Notebooks, core to be assigned Core i5 540M Reviewed . Anand Tech, 1/4/2010 [140]: Chiappeta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware, Jan , [141]: Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem) Based Platform, Presentation ARCS001, IDF 2009 [142]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010

5. References (17) [143]: Valich T., Intel's "Anti AMD Fusion" Sandy Bridge CPU tapes out, July , bridge-cpu-tapes-out.aspx

GPGPUs/DPAs Dezső Sima April 2011 (v1.0, Last updated 04/15/2011)

Similar presentations

Presentation on theme: "GPGPUs/DPAs Dezső Sima April 2011 (v1.0, Last updated 04/15/2011)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPGPUs/DPAs Dezső Sima April 2011 (v1.0, Last updated 04/15/2011)

Similar presentations

Presentation on theme: "GPGPUs/DPAs Dezső Sima April 2011 (v1.0, Last updated 04/15/2011)"— Presentation transcript:

Similar presentations

About project

Feedback