1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Published byModified over 4 years ago
Presentation on theme: "1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing."— Presentation transcript:
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing These notes will introduce: The development of GPU devices from the 1970’s to the present day Their use in high performance computers today
2 GPUs have developed from graphics cards into a platform for high performance computing and perhaps the most important development in HPC for many years. Graphics Processing Units (GPUs)
Brief History 1970 2010 200019901980 Atari 8-bit computer text/graphics chip Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit IBM PC Professional Graphics Controller card S3 graphics cards- single chip 2D accelerator OpenGL graphics API Hardware-accelerated 3D graphics DirectX graphics API Playstation GPUs with programmable shading Nvidia GeForce GE 3 (2001) with programmable shading General-purpose computing on graphics processing units (GPGPUs) GPU Computing
4 CPU-GPU architecture evolution Co-processors -- very old idea that appeared in 1970s and 1980s with floating point co- processors attached to microprocessors that did not then have floating point capability. These coprocessors simply executed floating point instructions that were fetched from memory. Around same time, interest to provide hardware support for displays, especially with increasing use of graphics and PC games. Led to graphics processing units (GPUs) attached to CPU to create video display. CPU Graphics card Display Memory Early design
5 GPUs with dedicated pipelines (late1990s-early 2000s) By late1990’s, graphics chips needed to support 3-D graphics, especially for games and graphics APIs such as DirectX and OpenGL. Graphics chips generally had a pipeline structure with individual stages performing specialized operations, finally leading to loading frame buffer for display. Individual stages may have access to graphics memory for storing intermediate computed data. Input stage Vertex shader stage Geometry shader stage Rasterizer stage Frame buffer Pixel shading stage Graphics memory
6 General-Purpose GPU designs High performance pipelines call for high-speed (IEEE) floating point operations. People had been trying to use GPU cards to speed up scientific computations Known as GPGPU (General-purpose computing on graphics processing units) -- Difficult to do with specialized graphics pipelines, but possible.) By mid 2000’s, recognized that individual stages of graphics pipeline could be implemented by a more general purpose processor core (although with a data-parallel paradym) a
7 2006 -- First GPU for general high performance computing as well as graphics processing, NVIDIA GT 80 chip/GeForce 8800 card. Unified processors that could perform vertex, geometry, pixel, and general computing operations Could now write programs in C rather than graphics APIs. Single-instruction multiple thread (SIMT) programming model GPU design for general high performance computing
NVIDIA products NVIDIA Corp. is the leader in GPUs for high performance computing: 1993201019991995 http://en.wikipedia.org/wiki/GeForce 2009200720082000200120022003200420052006 Established by Jen- Hsun Huang, Chris Malachowsky, Curtis Priem NV1GeForce 1 GeForce 2 series GeForce FX series GeForce 8 series GeForce 200 series GeForce 400 series GTX460/465/470/475/ 480/485 GTX260/275/280/285/295 GeForce 8800 GT 80 Tesla Quadro NVIDIA's first GPU with general purpose processors C870, S870, C1060, S1070, C2050, … Tesla 2050 GPU has 448 thread processors Fermi Kepler (2011) Maxwell (2013)
12 Evolving GPU design NVIDIA Fermi architecture (announced Sept 2009) 512 stream processing engines (SPEs) Organized as 16 SPEs, each having 32 cores 3GB or 6 GB GDDR5 memory Many innovations including L1/L2 caches, unified device memory addressing, ECC memory, … First implementation: Tesla 20 series (single chip C2050/2070, 4 chip S2050/2070) 3 billion transistor chip? New Fermi chips planned (GT 300, GeForce 400 series)
13 Fermi Streaming Multiprocessor (SM) * Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
14 CUDA (Compute Unified Device Architecture) Architecture and programming model, introduced in NVIDIA in 2007 Enables GPUs to execute programs written in C. Within C programs, call SIMT “kernel” routines that are executed on GPU. CUDA syntax extension to C identify routine as a Kernel. Very easy to learn although to get highest possible execution performance requires understanding of hardware architecture
15 Can be 1 or 2 dimensions Can be 1, 2 or 3 dimensions CUDA C programming guide, v 3.2, 2010, NVIDIA CUDA SIMT Thread Structure Allows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU. Linked to internal organization Threads in one block execute together.
16 Memory Structure within GPU Local private memory -- per thread Shared memory -- per block Global memory -- per application GPU executes kernel grids. Streaming multiprocessor (SM) executes one or more thread blocks CUDA cores and other execution units in the SM execute threads. SM executes threads in groups of 32 threads called a warp.* * Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
17 GPU Clusters GPU systems for HPC GPU clusters GPU Grids GPU Clouds With advent of GPUs for scientific high performance computing, compute cluster now can incorporate, greatly increasing their compute capability.