Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE431 Chapter 7B.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( www.cse.psu.edu/~mji.

Similar presentations


Presentation on theme: "CSE431 Chapter 7B.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( www.cse.psu.edu/~mji."— Presentation transcript:

1 CSE431 Chapter 7B.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( )www.cse.psu.edu/~mji [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, © 2008, MK]

2 CSE431 Chapter 7B.2Irwin, PSU, 2008 Flynn’s Classification Scheme  Now obsolete terminology except for...  SISD – single instruction, single data stream l aka uniprocessor - what we have been talking about all semester  SIMD – single instruction, multiple data streams l single control unit broadcasting operations to multiple datapaths  MISD – multiple instruction, single data l no such machine (although some people put vector machines in this category)  MIMD – multiple instructions, multiple data streams l aka multiprocessors (SMPs, MPPs, clusters, NOWs)

3 CSE431 Chapter 7B.3Irwin, PSU, 2008 SIMD Processors  Single control unit (one copy of the code)  Multiple datapaths (Processing Elements – PEs) running in parallel l Q1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit l Q2 – Each PE performs the same operation on its own local data PE Control

4 CSE431 Chapter 7B.4Irwin, PSU, 2008 Example SIMD Machines MakerYear# PEs# b/ PE Max memory (MB) PE clock (MHz) System BW (MB/s) Illiac IVUIUC ,560 DAPICL19804, ,560 MPPGoodyear198216, ,480 CM-2Thinking Machines , ,384 MP-1216MasPar198916, ,000  Did SIMDs die out in the early 1990s ??

5 CSE431 Chapter 7B.5Irwin, PSU, 2008 Multimedia SIMD Extensions  The most widely used variation of SIMD is found in almost every microprocessor today – as the basis of MMX and SSE instructions added to improve the performance of multimedia programs l A single, wide ALU is partitioned into many smaller ALUs that operate in parallel  There are now hundreds of SSE instructions in the x86 to support multimedia operations 32 bit adder 16 bit adder 8 bit + l Loads and stores are simply as wide as the widest ALU, so the same data transfer can transfer one 32 bit value, two 16 bit values or four 8 bit values

6 CSE431 Chapter 7B.6Irwin, PSU, 2008 Vector Processors  A vector processor (e.g., Cray) pipelines the ALUs to get good performance at lower cost. A key feature is a set of vector registers to hold the operands and results. l Collect the data elements from memory, put them in order into a large set of registers, operate on them sequentially in registers, and then write the results back to memory l They formed the basis of supercomputers in the 1980’s and 90’s  Consider extending the MIPS instruction set (VMIPS) to include vector instructions, e.g., addv.d to add two double precision vector register values addvs.d and mulvs.d to add (or multiply) a scalar register to (by) each element in a vector register lv and sv do vector load and vector store and load or store an entire vector of double precision data

7 CSE431 Chapter 7B.7Irwin, PSU, 2008 MIPS vs VMIPS DAXPY Codes: Y = a × X + Y l.d $f0,a($sp);load scalar a addiu r4,$s0,#512;upper bound to load to loop:l.d $f2,0($s0);load X(i) mul.d $f2,$f2,$f0;a × X(i) l.d $f4,0($s1);load Y(i) add.d $f4,$f4,$f2;a × X(i) + Y(i) s.d $f4,0($s1);store into Y(i) addiu $s0,$s0,#8;increment X index addiu $s1,$s1,#8;increment Y index subu $t0,r4,$s0;compute bound bne $t0,$zero,loop;check if done

8 CSE431 Chapter 7B.8Irwin, PSU, 2008 MIPS vs VMIPS DAXPY Codes: Y = a × X + Y l.d $f0,a($sp);load scalar a addiu r4,$s0,#512;upper bound to load to loop:l.d $f2,0($s0);load X(i) mul.d $f2,$f2,$f0;a × X(i) l.d $f4,0($s1);load Y(i) add.d $f4,$f4,$f2;a × X(i) + Y(i) s.d $f4,0($s1);store into Y(i) addiu $s0,$s0,#8;increment X index addiu $s1,$s1,#8;increment Y index subu $t0,r4,$s0;compute bound bne $t0,$zero,loop;check if done l.d $f0,a($sp);load scalar a lv $v1,0($s0);load vector X mulvs.d $v2,$v1,$f0;vector-scalar multiply lv $v3,0($s1);load vector Y addv.d $v4,$v2,$v3;add Y to a × X sv $v4,0($s1);store vector result

9 CSE431 Chapter 7B.9Irwin, PSU, 2008 Vector verus Scalar  Instruction fetch and decode bandwidth is dramatically reduced (also saves power) l Only six instructions in VMIPS versus almost 600 in MIPS for 64 element DAXPY  Hardware doesn’t have to check for data hazards within a vector instruction. A vector instruction will only stall for the first element, then subsequent elements will flow smoothly down the pipeline. And control hazards are nonexistent. l MIPS stall frequency is about 64 times higher than VMIPS for DAXPY  Easier to write code for data-level parallel app’s  Have a known access pattern to memory, so heavily interleaved memory banks work well. The cost of latency to memory is seen only once for the entire vector

10 CSE431 Chapter 7B.10Irwin, PSU, 2008 Example Vector Machines MakerYearPeak perf.# vector Processors PE clock (MHz) STAR-100CDC1970??113 2 ASCTI MFLOPS 1, 2, or 4 16 Cray 1Cray to 240 MFLOPS 80 Cray Y-MPCray MFLOPS 2, 4, or 8167 Earth Simulator NEC TFLOPS 8  Did Vector machines die out in the late 1990s ??

11 CSE431 Chapter 7B.11Irwin, PSU, 2008 The PS3 “Cell” Processor Architecture  Composed of a non-SMP architecture l 234M 4Ghz l 1 Power Processing Element (PPE) “control” processor. The PPE is similar to a Xenon core -Slight ISA differences, and fine-grained MT instead of real SMT l And 8 “Synergistic” (SIMD) Processing Elements (SPEs). The real compute power and differences lie in the SPEs (21M transistors each) -An attempt to ‘fix’ the memory latency problem by giving each SPE complete control over it’s own 256KB “scratchpad” memory – 14M transistors –Direct mapped for low latency -4 vector units per SPE, 1 of everything else – 7M transistors l 512KB L2$ and a massively high bandwidth (200GB/s) processor-memory bus

12 CSE431 Chapter 7B.12Irwin, PSU, 2008 How to make use of the SPEs

13 CSE431 Chapter 7B.13Irwin, PSU, 2008 What about the Software?  Uses special IBM “Hypervisor” l Like an OS for OS’s l Runs both a real time OS (for sound) and non-real time OS (for things like AI)  Software must be specially coded to run well l The single PPE will be quickly bogged down l Must make use of SPEs wherever possible l This isn’t easy, by any standard  What about Microsoft? l Development suite identifies which 6 threads you’re expected to run l Four of them are DirectX based, and handled by the OS l Only need to write two threads, functionally

14 CSE431 Chapter 7B.14Irwin, PSU, 2008 Graphics Processing Units (GPUs)  GPUs are accelerators that supplement a CPU so they do not need to be able to perform all of the tasks of a CPU. They dedicate all of their resources to graphics l CPU-GPU combination – heterogeneous multiprocessing  Programming interfaces that are free from backward binary compatibility constraints resulting in more rapid innovation in GPUs than in CPUs l Application programming interfaces (APIs) such as OpenGL and DirectX coupled with high-level graphics shading languages such as NVIDIA’s Cg and CUDA and Microsoft’s HLSL  GPU data types are vertices (x, y, z, w) coordinates and pixels (red, green, blue, alpha) color components  GPUs execute many threads (e.g., vertex and pixel shading) in parallel – lots of data-level parallelism

15 CSE431 Chapter 7B.15Irwin, PSU, 2008 Typical GPU Architecture Features  Rely on having enough threads to hide the latency to memory (not caches as in CPUs) l Each GPU is highly multithreaded  Use extensive parallelism to get high performance l Have extensive set of SIMD instructions; moving towards multicore  Main memory is bandwidth, not latency driven l GPU DRAMs are wider and have higher bandwidth, but are typically smaller, than CPU memories  Leaders in the marketplace (in 2008) l NVIDIA GeForce 8800 GTX (16 multiprocessors each with 8 multithreaded processing units) l AMD’s ATI Radeon and ATI FireGL l Watch out for Intel’s Larrabee

16 CSE431 Chapter 7B.16Irwin, PSU, 2008 Multicore Xbox360 – “Xenon” processor  To provide game developers with a balanced and powerful platform l Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache l 165M transistors total l 3.2 Ghz “near” POWER ISA l 2-issue, 21 stage pipeline, with bit registers l Weak branch prediction – supported by software hinting l In order instructions l Narrow cores – 2 INT units, bit VMX units, 1 of anything else  An ATI-designed 500MZ GPU w/ 512MB of DDR3 DRAM l 337M transistors, 10MB framebuffer l 48 pixel shader cores, each with 4 ALUs

17 CSE431 Chapter 7B.17Irwin, PSU, 2008 Xenon Block Diagram Core 0 L1D L1I Core 1 L1D L1I Core 2 L1D L1I 1MB UL2 512MB DRAM GPU BIU/IO Intf 3D Core 10MB EDRAM Video Out MC0 MC1 Analog Chip XMA Dec SMC DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Video Out

18 CSE431 Chapter 7B.18Irwin, PSU, 2008 Next Lecture and Reminders  Next lecture l Multiprocessor network topologies -Reading assignment – PH, Chapter PH  Reminders l HW6 out November 13 th and due December 11 th l Check grade posting on-line (by your midterm exam number) for correctness l Second evening midterm exam scheduled -Tuesday, November 18, 20:15 to 22:15, Location 262 Willard -Please let me know ASAP (via ) if you have a conflict


Download ppt "CSE431 Chapter 7B.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7B: SIMDs, Vectors, and GPUs Mary Jane Irwin ( www.cse.psu.edu/~mji."

Similar presentations


Ads by Google