Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

A quiet revolution and potential buildup Computation: TFLOPs vs. 100 GFLOPs CPU in every PC – massive volume and potential impact Introduction to Massively Parallel Processing – Slide 2 T12 Westmere NV30NV40 G70 G80 GT200 3GHz Core2 Duo 3GHz Xeon Quad

Introduction to Massively Parallel Processing – Slide 2

Solve computation-intensive problems faster Make infeasible problems feasible Reduce design time Solve larger problems in the same amount of time Improve an answer’s precision Reduce design time Gain competitive advantage Introduction to Massively Parallel Processing – Slide 4

Another factor: modern scientific method Introduction to Massively Parallel Processing – Slide 5 Nature Observation Theory Physical Experimentation Numerical Simulation

A growing number of applications in science, engineering, business and medicine are requiring computing speeds that cannot be achieved by the current conventional computers because Calculations too computationally intensive Too complicated to model Too hard, expensive, slow or dangerous for the laboratory. Parallel processing is an approach which helps to make these computations feasible. Introduction to Massively Parallel Processing – Slide 6

One example: grand challenge problems that cannot be solved in a reasonable amount of time with today’s computers. Modeling large DNA structures Global weather forecasting Modeling motion of astronomical bodies Another example: real-time applications which have time constrains. If data are not processed during some time the result become meaningless. Introduction to Massively Parallel Processing – Slide 7

Atmosphere modeled by dividing it into 3- dimensional cells. Calculations of each cell repeated many times to model passage of time. For example Suppose whole global atmosphere divided into cells of size 1 mile  1 mile  1 mile to a height of 10 miles (10 cells high) – about 5  10 8 cells. If each calculation requires 200 floating point operations, 10 11 floating point operations necessary in one time step. Introduction to Massively Parallel Processing – Slide 8

Example continued To forecast the weather over 10 days using 10-minute intervals, a computer operating at 100 megaflops (10 8 floating point operations per second) would take 10 7 seconds or over 100 days. To perform the calculation in 10 minutes would require a computer operating at 1.7 teraflops (1.7  10 12 floating point operations per second). Introduction to Massively Parallel Processing – Slide 9

Two main types Parallel computing: Using a computer with more than one processor to solve a problem. Distributed computing: Using more than one computer to solve a problem. Motives Usually faster computation Very simple idea – that n computers operating simultaneously can achieve the result n times faster It will not be n times faster for various reasons. Other motives include: fault tolerance, larger amount of memory available, … Introduction to Massively Parallel Processing – Slide 10

“... There is therefore nothing new in the idea of parallel programming, but its application to computers. The author cannot believe that there will be any insuperable difficulty in extending it to computers. It is not to be expected that the necessary programming techniques will be worked out overnight. Much experimenting remains to be done. After all, the techniques that are commonly used in programming today were only won at considerable toil several years ago. In fact the advent of parallel programming may do something to revive the pioneering spirit in programming which seems at the present to be degenerating into a rather dull and routine occupation … ” - S. Gill, “Parallel Programming”, The Computer Journal, April 1958 Introduction to Massively Parallel Processing – Slide 11

Applications are typically written from scratch (or manually adapted from a sequential program) assuming a simple model, in a high-level language (i.e. C) with explicit parallel computing primitives. Which means Components difficult to reuse so similar components coded again and again Code difficult to develop, maintain or debug Not really developing a long-term software solution Introduction to Massively Parallel Processing – Slide 12

Introduction to Massively Parallel Processing – Slide 13 Processors Memory level 1 Memory level 2 Memory level 3 Main Memory Highest level Small, fast, expensive Lowest level Large, slow, cheap Registers

Execution speed relies on exploiting data locality temporal locality: a data item just accessed is likely to be used again in the near future, so keep it in the cache spatial locality: neighboring data is also likely to be used soon, so load them into the cache at the same time using a ‘wide’ bus (like a multi-lane motorway) from a programmer point of view, all handled automatically – good for simplicity but maybe not for best performance Introduction to Massively Parallel Processing – Slide 14

Useful work (i.e. floating point ops) can only be done at the top of the hierarchy. So data stored lower must be transferred to the registers before we can work on it. Possibly displacing other data already there. This transfer among levels is slow. This then becomes the bottleneck in most computations. Spend more time moving data than doing useful work. The result: speed depends to a large extent on problem size. Introduction to Massively Parallel Processing – Slide 15

Good algorithm design then requires Keeping active data at the top of the hierarchy as long as possible. Minimizing movement between levels. Need enough work to do at the top of the hierarchy to mask the transfer time at the lower levels. The more processors you have to work with, the larger the problem has to be to accomplish this. Introduction to Massively Parallel Processing – Slide 16

Each cycle, CPU takes data from registers, does an operation, and puts the result back Load/store operations (memory registers) also take one cycle CPU can do different operations each cycle Output of one operation can be input to next CPUs haven’t been this simple for a long time! Introduction to Massively Parallel Processing – Slide 18

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer. ” - Gordon E. Moore, co-founder, Intel Electronics Magazine, 4/19/1965 Introduction to Massively Parallel Processing – Slide 19 Photo credit: Wikipedia

Translation: The number of transistors on an integrated circuit doubles every 1½ - 2 years. Data credit: Wikipedia Introduction to Massively Parallel Processing – Slide 20

As the chip size (amount of circuitry) continues to double every 18-24 months, the speed of a basic microprocessor also doubles over that time frame. This happens because, as we develop tricks, they are built into the newest generation of CPUs by the manufacturers. Introduction to Massively Parallel Processing – Slide 21

Instruction-level parallelism (ILP) Out-of-order execution, speculation, … Vanishing opportunities in power-constrained world Data-level parallelism Vector units, SIMD execution, … Increasing opportunities; SSE, AVX, Cell SPE, Clearspeed, GPU, … Thread-level parallelism Increasing opportunities; multithreading, multicore, … Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi, … Introduction to Massively Parallel Processing – Slide 22

Most processors have multiple pipelines for different tasks and can start a number of different operations each cycle. For example consider the Intel core architecture. Introduction to Massively Parallel Processing – Slide 23

Each core in an Intel Core 2 Duo chip has 14 stage pipeline 3 integer units (ALU) 1 floating-point addition unit (FPU) 1 floating-point multiplication unit (FPU) FP division is very slow 2 load/store units In principle capable of producing 3 integer and 2 FP results per cycle Introduction to Massively Parallel Processing – Slide 24

Technical challenges Compiler to extract best performance, reordering instructions if necessary Out-of-order CPU execution to avoid delays waiting for read/write or earlier operations Branch prediction to minimize delays due to conditional branching (loops, if-then-else) Memory hierarchy to deliver data to registers fast enough to feed the processor These all limit the number of pipelines that can be used and increase the chip complexity 90% of Intel chip devoted to control and data Introduction to Massively Parallel Processing – Slide 25

(courtesy Mark Horowitz and Kevin Skadron) Introduction to Massively Parallel Processing – Slide 26 Power Performance

CPU clock stuck at about 3 GHz since 2006 due to problems with power consumption (up to 130W per chip) Chip circuitry still doubling every 18-24 months More on-chip memory and memory management units (MMUs) Specialized hardware (e.g. multimedia, encryption) Multi-core (multiple CPUs on one chip) Thus peak performance of chip still doubling every 18- 24 months Introduction to Massively Parallel Processing – Slide 27

Handful of processors each supporting ~1 hardware thread On-chip memory near processors (cache, RAM or both) Shared global memory space (external DRAM) Introduction to Massively Parallel Processing – Slide 28 Processor Memory Processor Memory Global Memory

Many processors each supporting many hardware threads On-chip memory near processors (cache, RAM or both) Shared global memory space (external DRAM) Introduction to Massively Parallel Processing – Slide 29 Global Memory Processor Memory Processor Memory

Four 3.33 GHz cores, each of which can run 2 threads Integrated MMU Introduction to Massively Parallel Processing – Slide 30

Cannot continue to scale processor frequencies No 10 GHz chips Cannot continue to increase power consumption Can’t melt chips Can continue to increase transistor density As per Moore’s Law Introduction to Massively Parallel Processing – Slide 31

Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Molecular dynamic simulation, video and audio coding and manipulation, 3D imaging and visualization, consumer game physics, virtual reality products, … These “super apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… Programming model must not hinder parallel implementation Data delivery needs careful management Introduction to Massively Parallel Processing – Slide 32

Computers no longer get faster, just wider You must rethink your algorithms to be parallel! Data-parallel computing is most scalable solution Otherwise: refactor code for 2 cores, 4 cores, 8 cores, 16 cores, … You will always have more data than cores – build the computation around the data Introduction to Massively Parallel Processing – Slide 33

Traditional parallel architectures cover some super apps DSP, GPU, network and scientific applications, … The game is to grow mainstream architectures “out” or domain-specific architectures “in” CUDA is the latter Introduction to Massively Parallel Processing – Slide 34

Massive economies of scale Massively parallel Introduction to Massively Parallel Processing – Slide 36

Produced in vast numbers for computer graphics Increasingly being used for Computer game “physics” Video (e.g. HD video decoding) Audio (e.g. MP3 encoding) Multimedia (e.g. Adobe software) Computational finance Oil and gas Medical imaging Computational science Introduction to Massively Parallel Processing – Slide 37

The GPU sits on a PCIe graphics card inside a standard PC/server with one or two multicore CPUs Introduction to Massively Parallel Processing – Slide 38

Up to 448 cores on a single chip Simplified logic (no out-of-order execution, no branch prediction) means much more of the chip is devoted to computation Arranged as multiple units with each unit being effectively a vector unit, all cores doing the same thing at the same time Very high bandwidth (up to 140GB/s) to graphics memory (up to 4GB) Not general purpose – for parallel applications like graphics and Monte Carlo simulations Can also build big clusters out of GPUs Introduction to Massively Parallel Processing – Slide 39

Four major vendors: NVIDIA AMD: bought ATI several years ago IBM: co-developed Cell processor with Sony and Toshiba for Sony Playstation, but now dropped it for high-performance computing Intel: was developing “Larrabee” chip as GPU, but now aimed for high-performance computing Introduction to Massively Parallel Processing – Slide 40

A quiet revolution and potential buildup Computation: TFLOPs vs. 100 GFLOPs CPU in every PC – massive volume and potential impact Introduction to Massively Parallel Processing – Slide 41 T12 Westmere NV30NV40 G70 G80 GT200 3GHz Core2 Duo 3GHz Xeon Quad

A quiet revolution and potential buildup Bandwidth: ~10x CPU in every PC – massive volume and potential impact Introduction to Massively Parallel Processing – Slide 42 NV30 NV40 G70 G80 GT200 T12 3GHz Core2 Duo 3GHz Xeon Quad Westmere

High throughput computation GeForce GTX 280: 933 GFLOPs/sec High bandwidth memory GeForce GTX 280: 140 GB/sec High availability to all 180M+ CUDA-capable GPUs in the wild Introduction to Massively Parallel Processing – Slide 43

Introduction to Massively Parallel Processing – Slide 44 1995200020052010 RIVA 128 3M xtors GeForce ® 256 23M xtors GeForce FX 125M xtors GeForce 8800 681M xtors GeForce 3 60M xtors “Fermi” 3B xtors

Throughput is paramount Must paint every pixel within frame time Scalability Create, run and retire lots of threads very rapidly Measured 14.8 Gthreads/sec on increment() kernel Use multithreading to hide latency One stalled thread is o.k. if 100 are ready to run Introduction to Massively Parallel Processing – Slide 45

Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by one thread Big on-chip caches, sophisticated control logic GPU: maximize throughput of all thread Number of threads in flight limited by resources  lots of resources (registers, bandwidth, etc.) Multithreading can hide latency  skip the big caches Share control logic across many threads Introduction to Massively Parallel Processing – Slide 46

GeForce 8800 (2007) Introduction to Massively Parallel Processing – Slide 47

16 highly threaded streaming multiprocessors (SMs) > 128 floating point units (FPUs) 367 GFLOPs peak performance (25-50 times that of current high-end microprocessors) 265 GFLOPs sustained for applications such as visual molecular dynamics (VMD) 768 MB DRAM 86.4 GB/sec memory bandwidth 4 GB/sec bandwidth to CPU Introduction to Massively Parallel Processing – Slide 48

Massively parallel, 128 cores, 90 watts Massively threaded, sustains 1000s of threads per application 30-100 times speedup over high-end microprocessors on scientific and media applications Medical imaging, molecular dynamics, … Introduction to Massively Parallel Processing – Slide 49

Fermi GF100 (2010) ~ 1.5 TFLOPs (single precision), ~800 GFLOPs (double precision) 230 GB/sec DRAM bandwidth Introduction to Massively Parallel Processing – Slide 50

32 CUDA cores per SM (512 total) 8x peak FP64 performance 50% of peak FP32 performance Direct load/store to memory Usual linear sequence of bytes High bandwidth (hundreds of GB/sec) 64K of fast, on-chip RAM Software or hardware managed Shared amongst CUDA cores Enables thread communication Introduction to Massively Parallel Processing – Slide 52

SIMT (Single Instruction Multiple Thread) execution Threads run in groups of 32 called warps Minimum of 32 threads all doing the same thing at (almost) the same time Threads in a warp share instruction unit (IU) All cores execute the same instructions simultaneously, but with different data Similar to vector computing on CRAY supercomputers Hardware automatically handles divergence Natural for graphics processing and much scientific computing; also a natural choice for many-core chips to simplify each core Introduction to Massively Parallel Processing – Slide 53

Hardware multithreading Hardware resource allocation and thread scheduling Hardware relies on threads to hide latency Threads have all resources needed to run Any warp not waiting for something can run Context switching is (basically) free Introduction to Massively Parallel Processing – Slide 54

Lots of active threads is the key to high performance: no “context switching”; each thread has its own registers, which limits the number of active threads threads on each SM execute in warps – execution alternates between “active” warps, with warps becoming temporarily “inactive” when waiting for data for each thread, one operation completes long before the next starts – avoids the complexity of pipeline overlaps which can limit the performance of modern processors memory access from device memory has a delay of 400- 600 cycles; with 40 threads this is equivalent to 10-15 operations, so hopefully there’s enough computation to hide the latency Introduction to Massively Parallel Processing – Slide 55

Intel Core 2 / Xeon / i7 4-6 MIMD (Multiple Instruction / Multiple Data) cores each core operates independently each can be working with a different code, performing different operations with entirely different data few registers, multilevel caches 10-30 GB/s bandwidth to main memory Introduction to Massively Parallel Processing – Slide 56

NVIDIA GTX280: 240 cores, arranged as 30 units each with 8 SIMD (Single Instruction / Multiple Data) cores all cores executing the same instruction at the same time, but working on different data only one instruction de-coder needed to control all cores functions like a vector unit lots of registers, almost no cache 5 GB/s bandwidth to host processor (PCIe x16 gen 2) 140 GB/s bandwidth to graphics memory Introduction to Massively Parallel Processing – Slide 57

One issue with SIMD / vector execution: what happens with if-then-else branches? Standard vector treatment: execute both branches but only store the results for the correct branch. NVIDIA treatment: works with warps of length 32. If a particular warp only goes one way, then that is all that is executed. Introduction to Massively Parallel Processing – Slide 58

How do NVIDIA GPUs deal with same architectural challenges as CPUs? Very heavily multi-threaded, with at least 8 threads per core: Solves pipeline problem, because output of one calculation can be used an input for next calculation for same thread Switch from one set of threads to another when waiting for data from graphics memory Introduction to Massively Parallel Processing – Slide 59

Scalable programming model Minimal extensions to familiar C/C++ environment Heterogeneous serial-parallel computing Introduction to Massively Parallel Processing – Slide 61

Introduction to Massively Parallel Processing – Slide 62 110-240X 45X 100X 35X 17X 13–457x

Big breakthrough in GPU computing has been NVIDIA’s development of CUDA programming environment Initially driven by needs of computer games developers Now being driven by new markets (e.g. HD video decoding) C plus some extensions and some C++ features (e.g. templates) Introduction to Massively Parallel Processing – Slide 63

Host code runs on CPU, CUDA code runs on GPU Explicit movement of data across the PCIe connection Very straightforward for Monte Carlo applications, once you have a random number generator Harder for finite difference applications Introduction to Massively Parallel Processing – Slide 64

At the top level, we have a master process which runs on the CPU and performs the following steps: 1. initialize card 2. allocate memory in host and on device 3. repeat as needed 1. copy data from host to device memory 2. launch multiple copies of execution “kernel” on device 3. copy data from device memory to host 4. de-allocate all memory and terminate Introduction to Massively Parallel Processing – Slide 65

At a lower level, within the GPU: each copy of the execution kernel executes on an SM if the number of copies exceeds the number of SMs, then more than one will run at a time on each SM if there are enough registers and shared memory, and the others will wait in a queue and execute later all threads within one copy can access local shared memory but can’t see what the other copies are doing (even if they are on the same SM) there are no guarantees on the order in which the copies will execute Introduction to Massively Parallel Processing – Slide 66

Augment C/C++ with minimalist abstractions Let programmers focus on parallel algorithms, not mechanics of parallel programming language Provide straightforward mapping onto hardware Good fit to GPU architecture Maps well to multi-core CPUs too Scale to 100s of cores and 10000s of parallel threads GPU threads are lightweight – create/switch is free GPU needs 1000s of threads for full utilization Introduction to Massively Parallel Processing – Slide 67

Hierarchy of concurrent threads Lightweight synchronization primitives Shared memory model for cooperating threads Introduction to Massively Parallel Processing – Slide 68

Parallel kernels composed of many threads All threads execute the same sequential program Threads are grouped into thread blocks Threads in the same block can cooperate Threads/blocks have unique IDs Introduction to Massively Parallel Processing – Slide 69 Thread t t0 t1 … tB Block b

CUDA virtualizes the physical hardware Thread is a virtualized scalar processor: registers, PC, state Block is a virtualized multiprocessor: threads, shared memory Scheduled onto physical hardware without pre- emption Threads/blocks launch and run to completion Blocks should be independent Introduction to Massively Parallel Processing – Slide 70

Introduction to Massively Parallel Processing – Slide 71 Block Memory Block Memory Global Memory

Cf. PRAM (Parallel Random Access Machine) model Global synchronization isn’t cheap Global memory access times are expensive Introduction to Massively Parallel Processing – Slide 72 Processors Global Memory

Cf. BSP (Bulk Synchronous Parallel) model, MPI Distributed computing is a different setting Introduction to Massively Parallel Processing – Slide 73 Interconnection Network Processor Memory Processor Memory

Multicore CPU Manycore GPU Introduction to Massively Parallel Processing – Slide 74

Philosophy: provide minimal set of extensions necessary to expose power Function and variable qualifiers Execution configuration Built-in variables and functions valid in device code Introduction to Massively Parallel Processing – Slide 75

Introduction to Massively Parallel Processing – Slide 77 ApplicationDescriptionSourceKernel% time H.264SPEC ‘06 version, change in guess vector34,81119435% LBMSPEC ‘06 version, change to single precision and print fewer reports 1,481285>99% RC5-72Distributed.net RC5-72 challenge client code1,979218>99% FEMFinite element modeling, simulation of 3D graded materials 1,87414699% RPESRye Polynomial Equation Solver, quantum chem, 2- electron repulsion 1,10428199% PNSPetri Net simulation of a distributed system322160>99% SAXPYSingle-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine 95231>99% TPACFTwo Point Angular Correlation Function5369896% FDTDFinite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1,3659316% MRI-QComputing a matrix Q, a scanner’s configuration in MRI reconstruction 49033>99%

GeForce 8800 GTX vs. 2.2 GHz Opteron 248 Introduction to Massively Parallel Processing – Slide 78

10x speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25x to 400x speedup if the function’s data requirements and control flow suit the GPU and the application is optimized Introduction to Massively Parallel Processing – Slide 79

Parallel hardware is here to stay GPUs are massively parallel manycore processors Easily available and fully programmable Parallelism and scalability are crucial for success This presents many important research challenges Not to mention the educational challenges Introduction to Massively Parallel Processing – Slide 80

Reading: Chapter 1, “Programming Massively Parallel Processors” by Kirk and Hwu. Based on original material from The University of Akron: Tim O’Neil, Kathy Liszka Hiram College: Irena Lomonosov The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu The University of North Carolina at Charlotte Barry Wilkinson, Michael Allen Oregon State University: Michael Quinn Oxford University: Mike Giles Stanford University: Jared Hoberock, David Tarjan Revision history: last updated 6/4/2011. Introduction to Massively Parallel Processing – Slide 81

Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations

Presentation on theme: "Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations

Presentation on theme: "Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

Similar presentations

About project

Feedback