Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,27 2014 1.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
CPE 631: Vector Processing (Appendix F in COA4)
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( ) [Adapted from Computer Organization.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Morgan Kaufmann Publishers
GPU Architecture and Programming
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Introduction to MMX, XMM, SSE and SSE2 Technology
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Appendix C Graphics and Computing GPUs
Prof. Zhang Gang School of Computer Sci. & Tech.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Copyright © 2012, Elsevier Inc. All rights reserved.
Parallel Processing - introduction
3- Parallel Programming Models
Morgan Kaufmann Publishers
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.
Morgan Kaufmann Publishers
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
COMP4211 : Advance Computer Architecture
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Introduction to CUDA Programming
NVIDIA Fermi Architecture
STUDY AND IMPLEMENTATION
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
The University of Adelaide, School of Computer Science
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

SIMD Introduction SIMD(Single Instruction Multiple Data) architectures can exploit significant data- level parallelism for : – Matrix-oriented scientific computing – Media-oriented image and sound processors SIMD has easier hardware design and working flow compared with MIMD(Multiple Instruction Multiple Data) – Only needs to fetch one instruction per data operation SIMD allows programmer to continue to think sequentially Introduction 2

SIMD Parallelism Vector architectures SIMD extensions Graphics Processing Units (GPUs) Introduction 3

Vector Architectures Basic idea: – Read sets of data elements into “vector registers” – Operate on those registers A single instruction -> dozens of operations on independent data elements – Disperse the results back into memory Registers are controlled by compiler – Used to hide memory latency Vector Architectures 4

VMIPS Example architecture: VMIPS – Loosely based on Cray-1 – Vector registers Each register holds a 64-element, 64 bits/element vector Register file has 16 read ports and 8 write ports – Vector functional units Fully pipelined Data and control hazards are detected – Vector load-store unit Fully pipelined One word per clock cycle after initial latency – Scalar registers 32 general-purpose registers 32 floating-point registers Vector Architectures 5

6 The basic structure of a vector architecture, VMIPS. This processor has a scalar architecture just like MIPS. There are also eight 64-element vector registers, and all the functional units are vector functional units.

VMIPS Instructions ADDVV.D: add two vectors MULVS.D: multiply each element of a vector by a scalar LV/SV: vector load and vector store from address Example: DAXPY – Y = a x X + Y, X and Y are vectors, a is a scalar L.D F0,a; load scalar a LVV1,Rx; load vector X MULVS.DV2,V1,F0; vector-scalar multiply LVV3,Ry; load vector Y ADDVVV4,V2,V3; add SVRy,V4; store the result Requires 6 instructions Vector Architectures 7

8 Copyright © 2012, Elsevier Inc. All rights reserved. DAXPY in MIPS Instructions Example: DAXPY (double precision a*X+Y) L.DF0,a; load scalar a DADDIUR4,Rx,#512 ; last address to load Loop: L.DF2,0(Rx); load X[i] MUL.DF2,F2,F0 ; a x X[i] L.DF4,0(Ry) ; load Y[i] ADD.DF4,F2,F2; a x X[i] + Y[i] S.DF4,9(Ry); store into Y[i] DADDIURx,Rx,#8; increment index to X DADDIURy,Ry,#8; increment index to Y SUBBUR20,R4,Rx; compute bound BNEZR20,Loop; check if done Requires almost 600 MIPS instructions Vector Architectures

SIMD Extensions Media applications operate on data types narrower than the native word size – Example: “partition” 64-bit adder --> eight 8-bit elements Limitations, compared to vector instructions: – Number of data operands encoded into op code – No sophisticated addressing modes – No mask registers SIMD Instruction Set Extensions for Multimedia 9

10 Example DAXPY: L.DF0,a;load scalar a MOVF1, F0;copy a into F1 for SIMD MUL MOVF2, F0;copy a into F2 for SIMD MUL MOVF3, F0;copy a into F3 for SIMD MUL DADDIUR4,Rx,#512;last address to load Loop:L.4D F4,0[Rx];load X[i], X[i+1], X[i+2], X[i+3] MUL.4DF4,F4,F0;a*X[i],a*X[i+1],a*X[i+2],a*X[i+3] L.4DF8,0[Ry];load Y[i], Y[i+1], Y[i+2], Y[i+3] ADD.4DF8,F8,F4;a*X[i]+Y[i],..., a*X[i+3]+Y[i+3] S.4DF8,0[Ry] ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] DADDIURx,Rx,#32;increment index to X DADDIURy,Ry,#32;increment index to Y DSUBUR20,R4,Rx;compute bound BNEZR20,Loop;check if done Example SIMD Code

SIMD Implementations Implementations: – Intel MMX (1996) Eight 8-bit integer ops or four 16-bit integer ops – Streaming SIMD Extensions (SSE) (1999) Eight 16-bit integer ops Four 32-bit integer/fp ops or two 64-bit integer/fp ops – Advanced Vector Extensions (2010) Four 64-bit integer/fp ops SIMD Instruction Set Extensions for Multimedia 11

Graphics Processing Units Given the hardware invested to do graphics well, how can we supplement it to improve performance of a wider range of applications? Graphics Processing Units 12

GPU Accelerated or Many-core Computation NVIDIA GPU – Reliable performance, communication between CPU and GPU using the PCIE channels – Tesla K CUDA cores 4.29 T(Tera)FLOPs (single precision) – TFLOPs: Trillion Floating-point operations per second 1.43 TFLOPs (double precision) 12GB memory with 288GB/sec bandwidth 13 Graphics Processing Units

14 Graphics Processing Units

Best civil level GPU for now AMD Firepro S Hawaii cores 5.07 TFLOPs (single precision) – TFLOPs: Trillion Floating-point operations per second 2.53 TFLOPs (double precision) 16GB memory with 320GB/sec bandwidth 15 Graphics Processing Units

GPU Accelerated or Many-core Computation (cont.) APU (Accelerated Processing Unit) – AMD – CPU and GPU functionality on a single chip – AMD FirePro A320 4 CPU cores and 384 GPU processors 736 GFLOPs (single 100W Up to 2GB dedicated memory 16 Graphics Processing Units

Programming the NVIDIA GPU Heterogeneous execution model – CPU is the host, GPU is the device Develop a programming language (CUDA) for GPU Unify all forms of GPU parallelism as CUDA thread Programming model is “Single Instruction Multiple Thread” Graphics Processing Units 17 We use NVIDIA GPU as the example for discussion

Fermi GPU 18 One Stream Processor (SM)

NVIDIA GPU Memory Structures Each SIMD Lane has private section of off-chip DRAM – “Private memory” Each multithreaded SIMD processor also has local memory – Shared by SIMD lanes Memory shared by SIMD processors is GPU Memory – Host can read and write GPU memory Graphics Processing Units 19

Fermi Multithreaded SIMD Proc. Graphics Processing Units 20 Each Streaming Processor (SM) – 32 SIMD lanes – 16 load/store units – 32,768 registers Each SIMD thread has up to – 64 vector registers of bit elements

GPU Compared to Vector Similarities to vector machines: – Works well with data-level parallel problems – Mask registers – Large register files Differences: – No scalar processor – Uses multithreading to hide memory latency – Has many functional units, as opposed to a few deeply pipelined units like a vector processor Graphics Processing Units 21