COMPUTER ARCHITECTURE (for Erasmus students)

COMPUTER ARCHITECTURE (for Erasmus students)
Assoc.Prof. Stasys Maciulevičius Computer Dept.

3D graphics cards It wasn't so long ago that 3D graphics cards were only expected to deliver higher frames-per-second in your favorite 3D games The graphics companies fought over some image quality issues like the internal color processing precision and the quality of anti-aliasing or anisotropic filtering, but even that was targeted at game performance and quality Of course, there have been graphics cards for years now designed for the professional 3D market—CAD/CAM, industrial design 2009 ©S.Maciulevičius

GPU vs CPU 2009 ©S.Maciulevičius

Graphics and CPU Physical effects, which are necessary when working with images, graphics processor calculates much faster than it has a central processing unit, for example: 2009 ©S.Maciulevičius

GPGPU GPGPU stands for General-Purpose computation on Graphics Processing Units, also known as GPU Computing Graphics Processing Units (GPUs) are high-performance many-core processors capable of very high computation and data throughput Once specially designed for computer graphics and difficult to program, today’s GPUs are general-purpose parallel processors with support for accessible programming interfaces and industry-standard languages such as C 2009 ©S.Maciulevičius

GPGPU GPGPU technology is simple -- it will increase the speed of many types of tasks consumers do every day by using the GPU and the CPU in tandem for "general purpose" computations (or number crunching) that was once only handled by the CPU alone When this technology fully matures, consumers will see noticable performance increases when they convert audio and video files, play graphics-intensive games, and in other daily tasks 2009 ©S.Maciulevičius

GP GPU 2009 ©S.Maciulevičius

GPGPU The first GPGPU was initially created in 1978 (a programmable raster display system) Before 2006, there were only a handful of other systems that incorporated GPGPU technology In November 2006, AMD's website stated they started the "GPGPU revolution" with the first iteration of their GPGPU technology that has now evolved into ATI Stream 2009 ©S.Maciulevičius

Stream computing Stream computing (or stream processing) refers to a class of compute problems, applications or tasks that can be broken down into parallel, identical operations and run simultaneously on a single processor device These parallel data streams entering the processor device, computations taking place and the output from the device define stream computing 2009 ©S.Maciulevičius

Applications suited to Stream Computing
Applications best suited to stream computing possess two fundamental characteristics: A high degree of arithmetic computation per system memory fetch Computational independence - arithmetic occurs on each processing unit without needing to be checked or verified by or with arithmetic occurring on any other processing unit 2009 ©S.Maciulevičius

Applications suited to Stream Computing
Examples include: Engineering - fluid dynamics Mathematics - linear equations, matrix calculations Simulations - Monte Carlo, molecular modeling, etc. Financial - options pricing Biological - protein structure calculations Imaging - medical image processing 2009 ©S.Maciulevičius

GP-GPU development Today, there are two major competing standards for GP-GPU development, essentially: ATI's first attempt at exposing the GPU to general purpose computing tasks was through a "Close to the Metal" or CTM driver CUDA is Nvidia's "Compute Unified Device Architecture", a programming model based on C 2009 ©S.Maciulevičius

Using the GPU's stream processors
ATI Stream and CUDA focuses on using the GPU's stream processors in tandem with the CPU to enable the entire system to handle computing-intensive applications 2009 ©S.Maciulevičius

ATI Stream ATI Stream technology is a set of advanced hardware and software technologies that enable AMD graphics processors (GPU), working in concert with the system’s CPU, to accelerate many applications beyond just graphics This enables better balanced platforms capable of running demanding computing tasks faster than ever 2009 ©S.Maciulevičius

ATI Stream ATI Stream uses parallel computing architecture that will take advantage of the graphics card's stream processors to compute problems, applications or tasks that can be broken down into parallel, identical operations and run simultaneously on a single processor device Stream computing also takes advantage of a SIMD methodology whereas a CPU is a modified SISD methodology 2009 ©S.Maciulevičius

ATI Stream ATI has evolved their stream computing efforts into the Stream SDK, which includes a number of features Low-level access is provided through a more accessible CAL (Compute Abstraction Layer), and code libraries include the AMD Core Math Library (ACML), AMD Performance Library (APL), and a video transcode library called COBRA 2009 ©S.Maciulevičius

AMD FireStream™ 9270 The AMD FireStream product line provides the industry's first double-precision floating point capability on a GPU The AMD FireStream 9270 is next generation DP-FP product With 2GB GDDR5 memory on board and single-precision performance of 1.2 TFLOPS and double precision performance of 240 GFLOPS, the FireStream 9270 is ideal for the most demanding compute-intensive, data-parallel tasks 2009 ©S.Maciulevičius

AMD FireStream™ 9270 Using 55nm process technology, even this large-memory board consumes less than 220 watts peak (160 watts typical) System requirements: PCIe 2.0 based server or workstation with available x16 lane graphics slot 600 W or greater power supply 512 MB of system memory 2009 ©S.Maciulevičius

AMD FireStream™ 9270 AMD FireStream 9270 specifications:
Number of GPUs – 1 Stream Cores – 800 Floating point formats – IEEE single & double precision GPU local memory – 2GB GDDR5 SDRAM Memory interface – 850 MHz Peak memory bandwidth – GB/s 2009 ©S.Maciulevičius

NVidia's CUDA NVidia's Compute Unified Device Architecture or "CUDA" platform was announced together with G80 in November 2006 A public beta version of the CUDA SDK was released in February 2007 The first version of CUDA rolled out with Tesla in June 2007, which was based on G80 and designed for high performance computing 2009 ©S.Maciulevičius

NVidia's CUDA NVidia CUDA is a general purpose parallel computing architecture that leverages the parallel compute engine in NVidia graphics processing units to solve many complex computational problems in a fraction of the time required on a CPU It includes the CUDA Instruction Set Architecture and the parallel compute engine in the GPU CUDA performs two major functions that consumers should be aware of: it helps reduce or match CPU usage by engaging the GPU’s stream processors and it can accelerate any computing process where CUDA is enabled 2009 ©S.Maciulevičius

NVidia's CUDA CUDA is supported on all GeForce 8-, 9-, and GTX200-series cards, and is tied fairly closely to Nvidia's GPU architecture. It should be possible to make CUDA drivers for other hardware CUDA is relatively simple, as stream processing languages go. It's based on C, with some extensions, so it's pretty familiar to developers Writing code that is highly parallel and manages data to work optimally in a GPU's memory systems is tricky, but the payoffs are great 2009 ©S.Maciulevičius

NVidia's CUDA In the high performance computing (HPC) environment, where large clusters or supercomputers are purpose-built to perform specific tasks with custom software, CUDA has gained a lot of traction Financial analysis models, oil and gas exploration software, medical imaging, fluid dynamics, and other tough "big iron" tasks are already using CUDA together with Nvidia GPUs to realize speed improvements an order of magnitude or two greater than when running on CPUs 2009 ©S.Maciulevičius

Stanford University runs one of the most popular distributed computing applications around, It calculates protein folding on a massive scale, using thousands of computers and PlayStation 3s around the world For awhile now the labs at Stanford have been working with ATI to produce a GPU-accelerated version of their folding software Now, the second generation of this GPU folding app uses the more reliable and better-performing CUDA for Nvidia GPUs, or CAL for ATI cards 2009 ©S.Maciulevičius

A quick look at the Client Stats page shows that there are, at the time of this writing, about 7600 active GPUs running the FAH app, generating around 840 teraflops of computational power (it's not a theoretical peak number, it's real work running the FAH computations) That's somewhere around 110 gigaflops per GPU, on average. To put that in perspective, the regular windows CPU client is about one gigaflop per client (it's a mix of the single-threaded client and the multi-core SMP version) 2009 ©S.Maciulevičius

GPGPU in supercomputer
In 2009 China has unveiled its fastest supercomputer Tianhe (Milky Way) The supercomputer will reach a performance of 1.2 Petaflops and with a Linpack result of Teraflops it would have climbed up to fourth place on the current Top500 list Tianhe is equipped with 6,144 Intel Xeon CPUs and 5,120 AMD GPUs (Radeon HD 4870X2 graphics cards) 2009 ©S.Maciulevičius

Intel Larrabee Intel has developed a new Intel® microarchitecture, codenamed Larrabee, to meet the increasing compute and memory intensive demands of the latest PC games and high-performance computing applications, such as image processing, physical simulation, and medical and financial analytics Larrabee architecture's programmability creates new opportunities for developers to innovate in the visual computing realm Larrabee microarchitecture features a number of hardware advances, including a many-core throughput design, for a wide range of highly parallel visual computing applications 2009 ©S.Maciulevičius

Intel Larrabee A conceptual model of the Larrabee architecture. The actual numbers of cores, texture units, memory controllers, and so on will vary a lot. Also, the structure of the bus and the placement of devices on the ring are more complex than shown.. 2009 ©S.Maciulevičius

Intel Larrabee Larrabee can be considered a hybrid between a multi-core CPU and a GPU, and has similarities to both. Its coherent cache hierarchy and x86 architecture compatibility are CPU-like, while its wide SIMD vector units and texture sampling hardware are GPU-like As a GPU, Larrabee will support traditional rasterized 3D graphics (Direct3D & OpenGL) for games. However, Larrabee's hybrid of CPU and GPU features should be suitable for general purpose GPU (GPGPU) or stream processing tasks 2009 ©S.Maciulevičius

Intel Larrabee Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: Larrabee will use the x86 instruction set with Larrabee-specific extensions.[10] Larrabee will feature cache coherency across all its cores.[10] Larrabee will include very little specialized graphics hardware, instead performing tasks like z-buffering, clipping, and blending in software, using a tile-based rendering approach 2009 ©S.Maciulevičius

Intel Larrabee The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo or Core i7: Larrabee's x86 cores will be based on the much simpler Pentium P54C design which is still being maintained for use in embedded applications. The P54C-derived core is superscalar but does not include out-of-order execution, though it has been updated with modern features such as x86-64 support, similar to Intel Atom. In-order execution means lower performance for individual cores, but since they are smaller, more can fit on a single chip, increasing overall throughput 2009 ©S.Maciulevičius

Intel Larrabee Each Larrabee core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time Larrabee includes one major fixed-function graphics hardware feature: texture sampling units. These perform trilinear and anisotropic filtering and texture decompression Larrabee has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory Larrabee includes explicit cache control instructions to reduce cache thrashing during streaming operations which only read/write data once. Explicit prefetching into L2 or L1 cache is also supported. Each core supports 4-way simultaneous multithreading, with 4 copies of each processor register 2009 ©S.Maciulevičius

COMPUTER ARCHITECTURE (for Erasmus students)

Similar presentations

Presentation on theme: "COMPUTER ARCHITECTURE (for Erasmus students)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMPUTER ARCHITECTURE (for Erasmus students)

Similar presentations

Presentation on theme: "COMPUTER ARCHITECTURE (for Erasmus students)"— Presentation transcript:

Similar presentations

About project

Feedback