COMPUTER ARCHITECTURE (for Erasmus students)

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
GPU Programming using BU Shared Computing Cluster
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
Parallel computer architecture classification
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
A many-core GPU architecture.. Price, performance, and evolution.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU – Graphic Processing Unit
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.
1 COMPUTER ARCHITECTURE (for Erasmus students) Assoc.Prof. Stasys Maciulevičius Computer Dept.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computationally Efficient Histopathological Image Analysis: Use of GPUs for Classification of Stromal Development Olcay Sertel 1,2, Antonio Ruiz 3, Umit.
Computer Graphics Graphics Hardware
An Introduction to 64-bit Computing. Introduction The current trend in the market towards 64-bit computing on desktops has sparked interest in the industry.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
1 Latest Generations of Multi Core Processors
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Sony PlayStation 3 Sony also laid out the technical specs of the device. The PlayStation 3 will feature the much-vaunted Cell processor, which will run.
Multicore – The future of Computing Chief Engineer Terje Mathisen.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
My Coordinates Office EM G.27 contact time:
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
High performance computing architecture examples Unit 2.
Graphic Processing Units Presentation by John Manning.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
General Purpose computing on Graphics Processing Units
Computer Graphics Graphics Hardware
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
GPU Architecture and Its Application
Single Instruction Multiple Data
What is GPU? how does it work?
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Computer Graphics Graphics Hardware
Graphics Processing Unit
CSE 502: Computer Architecture
Presentation transcript:

COMPUTER ARCHITECTURE (for Erasmus students) Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys@ecdl.lt stasys.maciulevicius@ktu.lt

3D graphics cards It wasn't so long ago that 3D graphics cards were only expected to deliver higher frames-per-second in your favorite 3D games The graphics companies fought over some image quality issues like the internal color processing precision and the quality of anti-aliasing or anisotropic filtering, but even that was targeted at game performance and quality Of course, there have been graphics cards for years now designed for the professional 3D market—CAD/CAM, industrial design 2009 ©S.Maciulevičius

GPU vs CPU 2009 ©S.Maciulevičius

Graphics and CPU Physical effects, which are necessary when working with images, graphics processor calculates much faster than it has a central processing unit, for example: 2009 ©S.Maciulevičius

GPGPU GPGPU stands for General-Purpose computation on Graphics Processing Units, also known as GPU Computing Graphics Processing Units (GPUs) are high-performance many-core processors capable of very high computation and data throughput Once specially designed for computer graphics and difficult to program, today’s GPUs are general-purpose parallel processors with support for accessible programming interfaces and industry-standard languages such as C 2009 ©S.Maciulevičius

GPGPU GPGPU technology is simple -- it will increase the speed of many types of tasks consumers do every day by using the GPU and the CPU in tandem for "general purpose" computations (or number crunching) that was once only handled by the CPU alone When this technology fully matures, consumers will see noticable performance increases when they convert audio and video files, play graphics-intensive games, and in other daily tasks 2009 ©S.Maciulevičius

GP GPU 2009 ©S.Maciulevičius

GPGPU The first GPGPU was initially created in 1978 (a programmable raster display system) Before 2006, there were only a handful of other systems that incorporated GPGPU technology In November 2006, AMD's website stated they started the "GPGPU revolution" with the first iteration of their GPGPU technology that has now evolved into ATI Stream 2009 ©S.Maciulevičius

Stream computing Stream computing (or stream processing) refers to a class of compute problems, applications or tasks that can be broken down into parallel, identical operations and run simultaneously on a single processor device These parallel data streams entering the processor device, computations taking place and the output from the device define stream computing 2009 ©S.Maciulevičius

Applications suited to Stream Computing Applications best suited to stream computing possess two fundamental characteristics: A high degree of arithmetic computation per system memory fetch Computational independence - arithmetic occurs on each processing unit without needing to be checked or verified by or with arithmetic occurring on any other processing unit 2009 ©S.Maciulevičius

Applications suited to Stream Computing Examples include: Engineering - fluid dynamics Mathematics - linear equations, matrix calculations Simulations - Monte Carlo, molecular modeling, etc. Financial - options pricing Biological - protein structure calculations Imaging - medical image processing 2009 ©S.Maciulevičius

GP-GPU development Today, there are two major competing standards for GP-GPU development, essentially: ATI's first attempt at exposing the GPU to general purpose computing tasks was through a "Close to the Metal" or CTM driver CUDA is Nvidia's "Compute Unified Device Architecture", a programming model based on C 2009 ©S.Maciulevičius

Using the GPU's stream processors ATI Stream and CUDA focuses on using the GPU's stream processors in tandem with the CPU to enable the entire system to handle computing-intensive applications 2009 ©S.Maciulevičius

ATI Stream ATI Stream technology is a set of advanced hardware and software technologies that enable AMD graphics processors (GPU), working in concert with the system’s CPU, to accelerate many applications beyond just graphics This enables better balanced platforms capable of running demanding computing tasks faster than ever 2009 ©S.Maciulevičius

ATI Stream ATI Stream technology is a set of advanced hardware and software technologies that enable AMD graphics processors (GPU), working in concert with the system’s CPU, to accelerate many applications beyond just graphics This enables better balanced platforms capable of running demanding computing tasks faster than ever 2009 ©S.Maciulevičius

ATI Stream ATI Stream uses parallel computing architecture that will take advantage of the graphics card's stream processors to compute problems, applications or tasks that can be broken down into parallel, identical operations and run simultaneously on a single processor device Stream computing also takes advantage of a SIMD methodology whereas a CPU is a modified SISD methodology 2009 ©S.Maciulevičius

ATI Stream ATI has evolved their stream computing efforts into the Stream SDK, which includes a number of features Low-level access is provided through a more accessible CAL (Compute Abstraction Layer), and code libraries include the AMD Core Math Library (ACML), AMD Performance Library (APL), and a video transcode library called COBRA 2009 ©S.Maciulevičius

AMD FireStream™ 9270 The AMD FireStream product line provides the industry's first double-precision floating point capability on a GPU The AMD FireStream 9270 is next generation DP-FP product With 2GB GDDR5 memory on board and single-precision performance of 1.2 TFLOPS and double precision performance of 240 GFLOPS, the FireStream 9270 is ideal for the most demanding compute-intensive, data-parallel tasks 2009 ©S.Maciulevičius

AMD FireStream™ 9270 Using 55nm process technology, even this large-memory board consumes less than 220 watts peak (160 watts typical) System requirements: PCIe 2.0 based server or workstation with available x16 lane graphics slot 600 W or greater power supply 512 MB of system memory 2009 ©S.Maciulevičius

AMD FireStream™ 9270 AMD FireStream 9270 specifications: Number of GPUs – 1 Stream Cores – 800 Floating point formats – IEEE single & double precision GPU local memory – 2GB GDDR5 SDRAM Memory interface – 256-bit @ 850 MHz Peak memory bandwidth – 108.8 GB/s 2009 ©S.Maciulevičius

NVidia's CUDA NVidia's Compute Unified Device Architecture or "CUDA" platform was announced together with G80 in November 2006 A public beta version of the CUDA SDK was released in February 2007 The first version of CUDA rolled out with Tesla in June 2007, which was based on G80 and designed for high performance computing 2009 ©S.Maciulevičius

NVidia's CUDA NVidia CUDA is a general purpose parallel computing architecture that leverages the parallel compute engine in NVidia graphics processing units to solve many complex computational problems in a fraction of the time required on a CPU It includes the CUDA Instruction Set Architecture and the parallel compute engine in the GPU CUDA performs two major functions that consumers should be aware of: it helps reduce or match CPU usage by engaging the GPU’s stream processors and it can accelerate any computing process where CUDA is enabled 2009 ©S.Maciulevičius

NVidia's CUDA CUDA is supported on all GeForce 8-, 9-, and GTX200-series cards, and is tied fairly closely to Nvidia's GPU architecture. It should be possible to make CUDA drivers for other hardware CUDA is relatively simple, as stream processing languages go. It's based on C, with some extensions, so it's pretty familiar to developers Writing code that is highly parallel and manages data to work optimally in a GPU's memory systems is tricky, but the payoffs are great 2009 ©S.Maciulevičius

NVidia's CUDA In the high performance computing (HPC) environment, where large clusters or supercomputers are purpose-built to perform specific tasks with custom software, CUDA has gained a lot of traction Financial analysis models, oil and gas exploration software, medical imaging, fluid dynamics, and other tough "big iron" tasks are already using CUDA together with Nvidia GPUs to realize speed improvements an order of magnitude or two greater than when running on CPUs 2009 ©S.Maciulevičius

Folding@Home Stanford University runs one of the most popular distributed computing applications around, Folding@Home. It calculates protein folding on a massive scale, using thousands of computers and PlayStation 3s around the world For awhile now the labs at Stanford have been working with ATI to produce a GPU-accelerated version of their folding software Now, the second generation of this GPU folding app uses the more reliable and better-performing CUDA for Nvidia GPUs, or CAL for ATI cards 2009 ©S.Maciulevičius

Folding@Home A quick look at the Client Stats page shows that there are, at the time of this writing, about 7600 active GPUs running the FAH app, generating around 840 teraflops of computational power (it's not a theoretical peak number, it's real work running the FAH computations) That's somewhere around 110 gigaflops per GPU, on average. To put that in perspective, the regular windows CPU client is about one gigaflop per client (it's a mix of the single-threaded client and the multi-core SMP version) 2009 ©S.Maciulevičius

GPGPU in supercomputer In 2009 China has unveiled its fastest supercomputer Tianhe (Milky Way) The supercomputer will reach a performance of 1.2 Petaflops and with a Linpack result of 563.1 Teraflops it would have climbed up to fourth place on the current Top500 list Tianhe is equipped with 6,144 Intel Xeon CPUs and 5,120 AMD GPUs (Radeon HD 4870X2 graphics cards) 2009 ©S.Maciulevičius

Intel Larrabee Intel has developed a new Intel® microarchitecture, codenamed Larrabee, to meet the increasing compute and memory intensive demands of the latest PC games and high-performance computing applications, such as image processing, physical simulation, and medical and financial analytics Larrabee architecture's programmability creates new opportunities for developers to innovate in the visual computing realm Larrabee microarchitecture features a number of hardware advances, including a many-core throughput design, for a wide range of highly parallel visual computing applications 2009 ©S.Maciulevičius

Intel Larrabee 2009 ©S.Maciulevičius

Intel Larrabee A conceptual model of the Larrabee architecture. The actual numbers of cores, texture units, memory controllers, and so on will vary a lot. Also, the structure of the bus and the placement of devices on the ring are more complex than shown.. 2009 ©S.Maciulevičius

Intel Larrabee: vector data types 2009 ©S.Maciulevičius

Intel Larrabee Larrabee can be considered a hybrid between a multi-core CPU and a GPU, and has similarities to both. Its coherent cache hierarchy and x86 architecture compatibility are CPU-like, while its wide SIMD vector units and texture sampling hardware are GPU-like As a GPU, Larrabee will support traditional rasterized 3D graphics (Direct3D & OpenGL) for games. However, Larrabee's hybrid of CPU and GPU features should be suitable for general purpose GPU (GPGPU) or stream processing tasks 2009 ©S.Maciulevičius

Intel Larrabee Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: Larrabee will use the x86 instruction set with Larrabee-specific extensions.[10] Larrabee will feature cache coherency across all its cores.[10] Larrabee will include very little specialized graphics hardware, instead performing tasks like z-buffering, clipping, and blending in software, using a tile-based rendering approach 2009 ©S.Maciulevičius

Intel Larrabee The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo or Core i7: Larrabee's x86 cores will be based on the much simpler Pentium P54C design which is still being maintained for use in embedded applications. The P54C-derived core is superscalar but does not include out-of-order execution, though it has been updated with modern features such as x86-64 support, similar to Intel Atom. In-order execution means lower performance for individual cores, but since they are smaller, more can fit on a single chip, increasing overall throughput 2009 ©S.Maciulevičius

Intel Larrabee Each Larrabee core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time Larrabee includes one major fixed-function graphics hardware feature: texture sampling units. These perform trilinear and anisotropic filtering and texture decompression Larrabee has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory Larrabee includes explicit cache control instructions to reduce cache thrashing during streaming operations which only read/write data once. Explicit prefetching into L2 or L1 cache is also supported. Each core supports 4-way simultaneous multithreading, with 4 copies of each processor register 2009 ©S.Maciulevičius