CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs1 CIS 501: Computer Architecture Unit 11: Data-Level Parallelism: Vectors & GPUs Slides developed.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs1 CIS 501: Computer Architecture Unit 11: Data-Level Parallelism: Vectors & GPUs Slides developed.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Architecture Basics ECE 454 Computer Systems Programming
Compsci 220 / ECE 252 (Lebeck)1 Compsci 220/ ECE 252 Computer Architecture Data Parallel: Vectors, SIMD, GPGPU Slides originally developed by Amir Roth.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 Chapter 04 Authors: John Hennessy & David Patterson.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Multi-Core Architectures
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
GPU Architecture Overview
GPU Architecture and Programming
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Exploiting Parallelism
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013.
® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 Lecture 5a: CPU architecture 101 boris.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Single Instruction Multiple Data
SIMD Multimedia Extensions
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
ESE532: System-on-a-Chip Architecture
Exploiting Parallelism
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Vector Processing => Multimedia
SIMD Programming CS 240A, 2017.
Code Optimization(II)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
EE 193: Parallel Computing
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
CSC3050 – Computer Architecture
Lecture 11: Machine-Dependent Optimization
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs1 CIS 501: Computer Architecture Unit 11: Data-Level Parallelism: Vectors & GPUs Slides developed by Joe Devietti, Milo Martin & Amir Roth at UPenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood

How to Compute This Fast? Performing the same operations on many data items Example: SAXPY Instruction-level parallelism (ILP) - fine grained Loop unrolling with static scheduling –or– dynamic scheduling Wide-issue superscalar (non-)scaling limits benefits Thread-level parallelism (TLP) - coarse grained Multicore Can we do some “medium grained” parallelism? L1:ldf [X+r1]->f1 // I is in r1 mulf f0,f1->f2 // A is in f0 ldf [Y+r1]->f3 addf f2,f3->f4 stf f4->[Z+r1} addi r1,4->r1 blti r1,4096,L1 for (I = 0; I < 1024; I++) { Z[I] = A*X[I] + Y[I]; } 2CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs

Data-Level Parallelism Data-level parallelism (DLP) Single operation repeated on multiple data elements SIMD (Single-Instruction, Multiple-Data) Less general than ILP: parallel insns are all same operation Exploit with vectors Old idea: Cray-1 supercomputer from late 1970s Eight 64-entry x 64-bit floating point “vector registers” 4096 bits (0.5KB) in each register! 4KB for vector register file Special vector instructions to perform vector operations Load vector, store vector (wide memory operation) Vector+Vector or Vector+Scalar addition, subtraction, multiply, etc. In Cray-1, each instruction specifies 64 operations! ALUs were expensive, so one operation per cycle (not parallel) CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs3

4 Example Vector ISA Extensions (SIMD) Extend ISA with floating point (FP) vector storage … Vector register: fixed-size array of 32- or 64- bit FP elements Vector length: For example: 4, 8, 16, 64, … … and example operations for vector length of 4 Load vector: ldf.v [X+r1]->v1 ldf [X+r1+0]->v1 0 ldf [X+r1+1]->v1 1 ldf [X+r1+2]->v1 2 ldf [X+r1+3]->v1 3 Add two vectors: addf.vv v1,v2->v3 addf v1 i,v2 i ->v3 i (where i is 0,1,2,3) Add vector to scalar: addf.vs v1,f2,v3 addf v1 i,f2->v3 i (where i is 0,1,2,3) Today’s vectors: short (128 or 256 bits), but fully parallel

CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs5 Example Use of Vectors – 4-wide Operations Load vector: ldf.v [X+r1]->v1 Multiply vector to scalar: mulf.vs v1,f2->v3 Add two vectors: addf.vv v1,v2->v3 Store vector: stf.v v1->[X+r1] Performance? Best case: 4x speedup But, vector instructions don’t always have single-cycle throughput Execution width (implementation) vs vector width (ISA) ldf [X+r1]->f1 mulf f0,f1->f2 ldf [Y+r1]->f3 addf f2,f3->f4 stf f4->[Z+r1] addi r1,4->r1 blti r1,4096,L1 ldf.v [X+r1]->v1 mulf.vs v1,f0->v2 ldf.v [Y+r1]->v3 addf.vv v2,v3->v4 stf.v v4,[Z+r1] addi r1,16->r1 blti r1,4096,L1 7x1024 instructions7x256 instructions (4x fewer instructions)

Vector Datapath & Implementatoin Vector insn. are just like normal insn… only “wider” Single instruction fetch (no extra N 2 checks) Wide register read & write (not multiple ports) Wide execute: replicate floating point unit (same as superscalar) Wide bypass (avoid N 2 bypass problem) Wide cache read & write (single cache tag check) Execution width (implementation) vs vector width (ISA) Example: Pentium 4 and “Core 1” executes vector ops at half width “Core 2” executes them at full width Because they are just instructions… …superscalar execution of vector instructions Multiple n-wide vector instructions per cycle CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs6

7 Intel’s SSE2/SSE3/SSE4/AVX/AVX2… Intel SSE2 (Streaming SIMD Extensions 2) bit floating point registers ( xmm0–xmm15 ) Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) Or 1x64b or 1x32b FP (just normal scalar floating point) Original SSE: only 8 registers, no packed integer support Other vector extensions AMD 3DNow!: 64b (2x32b) PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) ARM NEON: 128b (2x64b, 4x32b, 8x16b) Looking forward for x86 Intel’s “Sandy Bridge” brought 256-bit vectors to x86 Intel’s “Xeon Phi” multicore brings 512-bit vectors to x86

CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs8 Other Vector Instructions These target specific domains: e.g., image processing, crypto Vector reduction (sum all elements of a vector) Geometry processing: 4x4 translation/rotation matrices Saturating (non-overflowing) subword add/sub: image processing Byte asymmetric operations: blending and composition in graphics Byte shuffle/permute: crypto Population (bit) count: crypto Max/min/argmax/argmin: video codec Absolute differences: video codec Multiply-accumulate: digital-signal processing Special instructions for AES encryption More advanced (but in Intel’s Xeon Phi) Scatter/gather loads: indirect store (or load) from a vector of pointers Vector mask: predication (conditional execution) of specific elements

Vector Masks (Predication) Vector Masks: 1 bit per vector element Implicit predicate in all vector operations for (I=0; I<N; I++) if (mask I ) { vop… } Usually stored in a “scalar” register (up to 64-bits) Used to vectorize loops with conditionals in them cmp_eq.v, cmp_lt.v, etc.: sets vector predicates for (I=0; I<32; I++) if (X[I] != 0.0) Z[I] = A/X[I]; ldf.v [X+r1] -> v1 cmp_ne.v v1,f0 -> r2 // 0.0 is in f0 divf.sv {r2} v1,f1 -> v2 // A is in f1 stf.v {r2} v2 -> [Z+r1] CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs9

Scatter Stores & Gather Loads How to vectorize: for(int i = 1, i<N, i++) { int bucket = val[i] / scalefactor; found[bucket] = 1; Easy to vectorize the divide, but what about the load/store? Solution: hardware support for vector “scatter stores” stf.v v2->[r1+v1] Each address calculated from r1+v1 i stf v2 0 ->[r1+v1 0 ], stf v2 1 ->[r1+v1 1 ], stf v2 2 ->[r1+v1 2 ], stf v2 3 ->[r1+v1 3 ] Vector “gather loads” defined analogously ldf.v [r1+v1]->v2 Scatter/gathers slower than regular vector load/store ops Still provides a throughput advantage over non-vector version CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs10

Using Vectors in Your Code Write in assembly Ugh Use “intrinsic” functions and data types For example: _mm_mul_ps() and “__m128” datatype Use vector data types typedef double v2df __attribute__ ((vector_size (16))); Use a library someone else wrote Let them do the hard work Matrix and linear algebra packages Let the compiler do it (automatic vectorization, with feedback) GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n Limited impact for C/C++ code (old, hard problem) 11CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs

Recap: Vectors for Exploiting DLP Vectors are an efficient way of capturing parallelism Data-level parallelism Avoid the N 2 problems of superscalar Avoid the difficult fetch problem of superscalar Area efficient, power efficient The catch? Need code that is “vector-izable” Need to modify program (unlike dynamic-scheduled superscalar) Requires some help from the programmer Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors More flexible (vector “masks”, scatter, gather) and wider Should be easier to exploit, more bang for the buck CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs12

Graphics Processing Units (GPU) Tesla S870 Killer app for parallelism: graphics (3D games) CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs13

GPUs and SIMD/Vector Data Parallelism How do GPUs have such high peak FLOPS & FLOPS/Joule? Exploit massive data parallelism – focus on total throughput Remove hardware structures that accelerate single threads Specialized for graphics: e.g., data-types & dedicated texture units “SIMT” execution model Single instruction multiple threads Similar to both “vectors” and “SIMD” A key difference: better support for conditional control flow Program it with CUDA or OpenCL Extensions to C Perform a “shader task” (a snippet of scalar computation) over many elements Internally, GPU uses scatter/gather and vector mask operations CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs14

following slides c/o Kayvon Fatahalian’s “Beyond Programmable Shading” course CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs15

diffuse reflection example CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs18 diffuse reflection specular reflection c/o

SIMD vs SIMT SIMD: single insn multiple data write 1 insn that operates on a vector of data handle control flow via explicit masking operations SIMT: single insn multiple thread write 1 insn that operates on scalar data each of many threads runs this insn compiler+hw aggregate threads into groups that execute on SIMD hardware compiler+hw handle masking for control flow CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs37

Intel Larrabee Xeon Phi GPU-like manycore accelerator 61 inorder x86 cores running at ~1 GHz 2-way superscalar 4 hyperthreads per core 512-bit vector operations Leverages CPU programming models like OpenMP Powers China’s Tianhe-2 supercomputer PFlops the biggest in the world! no GPUs, just Xeon multicores and Xeon Phis CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs52

By the numbers: CPUs vs GPUs CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs53 Intel Xeon E v3 “Haswell” Nvidia Tesla K80 Intel Xeon Phi 7120X frequency2.6 GHz745 MHz1.23 GHz cores / threads14 / 2826 (“4992”) / 10Ks61 / 244 RAM1 TB+24 GB16 GB DP GFLOPS1201,8701,200 Transistors5.7 B14 B~5 B Price$2,700$5,000$4,000

Data Parallelism Summary Data Level Parallelism “medium-grained” parallelism between ILP and TLP Still one flow of execution (unlike TLP) Compiler/programmer must explicitly expresses it (unlike ILP) Hardware support: new “wide” instructions (SIMD) Wide registers, perform multiple operations in parallel Trends Wider: 64-bit (MMX, 1996), 128-bit (SSE2, 2000), 256-bit (AVX, 2011), 512-bit (Xeon Phi, 2012?) More advanced and specialized instructions GPUs Embrace data parallelism via “SIMT” execution model Becoming more programmable all the time Today’s chips exploit parallelism at all levels: ILP, DLP, TLP CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs54

CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs55

Tomorrow’s “CPU” Vectors CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs56

Beyond Today’s Vectors Today’s vectors are limited Wide compute Wide load/store of consecutive addresses Allows for “SOA” (structures of arrays) style parallelism Looking forward (and backward)... Vector masks Conditional execution on a per-element basis Allows vectorization of conditionals Scatter/gather a[i] = b[y[i]] b[y[i]] = a[i] Helps with sparse matrices, “AOS” (array of structures) parallelism Together, enables a different style vectorization Translate arbitrary (parallel) loop bodies into vectorized code (later) CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs57

CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs58

SAXPY Example: Best Case Code void saxpy(float* x, float* y, float* z, float a, int length) { for (int i = 0; i < length; i++) { z[i] = a*x[i] + y[i]; } } Scalar.L3: movss (%rdi,%rax), %xmm1 mulss %xmm0, %xmm1 addss (%rsi,%rax), %xmm1 movss %xmm1, (%rdx,%rax) addq $4, %rax cmpq %rcx, %rax jne.L3 Auto Vectorized.L6: movaps (%rdi,%rax), %xmm1 mulps %xmm2, %xmm1 addps (%rsi,%rax), %xmm1 movaps %xmm1, (%rdx,%rax) addq $16, %rax incl %r8d cmpl %r8d, %r9d ja.L6 + Scalar loop to handle last few iterations (if length % 4 != 0) “mulps”: multiply packed ‘single’ CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs59

SAXPY Example: Actual Code void saxpy(float* x, float* y, float* z, float a, int length) { for (int i = 0; i < length; i++) { z[i] = a*x[i] + y[i]; } Scalar.L3: movss (%rdi,%rax), %xmm1 mulss %xmm0, %xmm1 addss (%rsi,%rax), %xmm1 movss %xmm1, (%rdx,%rax) addq $4, %rax cmpq %rcx, %rax jne.L3 Auto Vectorized.L8: movaps %xmm3, %xmm1 movaps %xmm3, %xmm2 movlps (%rdi,%rax), %xmm1 movlps (%rsi,%rax), %xmm2 movhps 8(%rdi,%rax), %xmm1 movhps 8(%rsi,%rax), %xmm2 mulps %xmm4, %xmm1 incl %r8d addps %xmm2, %xmm1 movaps %xmm1, (%rdx,%rax) addq $16, %rax cmpl %r9d, %r8d jb.L8 + Explicit alignment test + Explicit aliasing test CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs60

Bridging “Best Case” and “Actual” Align arrays typedef float afloat __attribute__ ((__aligned__(16))); void saxpy(afloat* x, afloat* y, afloat* z, float a, int length) { for (int i = 0; i < length; i++) { z[i] = a*x[i] + y[i]; } } Avoid aliasing check typedef float afloat __attribute__ ((__aligned__(16))); void saxpy(afloat* __restrict__ x, afloat* __restrict__ y, afloat* __restrict__ z, float a, int length) Even with both, still has the “last few iterations” code CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs61

Reduction Example Code float diff = 0.0; for (int i = 0; i < N; i++) { diff += (a[i] - b[i]); } return diff; Scalar.L4: movss (%rdi,%rax), %xmm1 subss (%rsi,%rax), %xmm1 addq $4, %rax addss %xmm1, %xmm0 cmpq %rdx, %rax jne.L4 Auto Vectorized.L7: movaps (%rdi,%rax), %xmm0 incl %ecx subps (%rsi,%rax), %xmm0 addq $16, %rax addps %xmm0, %xmm1 cmpl %ecx, %r8d ja.L7 haddps %xmm1, %xmm1 movaps %xmm1, %xmm0 je.L3 “haddps”: Packed Single- FP Horizontal Add CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs62

SSE2 on Pentium 4 CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs63

Using Vectors in Your Code CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs64

Today’s GPU’s “SIMT” Model CIS 501: Comp. Arch. | Prof. Joe Devietti | Vectors & GPUs65