6 Course Objectives Understanding of fundamental concepts and programming principles for development of high performance applications Able to program a range of parallel computers: PC clusters supercomputers Make efficient use of high performance parallel computing in your own research
9 What & Why What is high performance computing (HPC)? The use of the most efficient algorithms on computers capable of the highest performance to solve the most demanding problems. Why HPC? Large problems – spatially/temporally 10,000 x 10,000 x 10,000 grid 10^12 grid points 4x10^12 double variables 32x10^12 bytes = 32 Tera-Bytes. Usually need to simulate tens of millions of time steps. On-demand/urgent computing; real-time computing; Weather forecasting; protein folding; turbulence simulations/CFD; aerospace structures; Full-body simulation/ Digital human …
10 HPC Examples: Blood Flow in Human Vascular Network Cardiovascular disease accounts for about 50% of deaths in western world; Formation of arterial disease strongly correlated to blood flow patterns; Computational challenges: Enormous problem size In one minute, the heart pumps the entire blood supply of 5 quarts through 60,000 miles of vessels, that is a quarter of the distance between the moon and the earth Blood flow involves multiple scales
11 HPC Examples Earthquake simulation Surface velocity 75 sec after earthquake Flu pandemic simulation 300 million people tracked Density of infected population, 45 days after breakout
12 HPC Example: Homogeneous Turbulence Direct Numerical Simulation of Homogeneous Turbulence: 4096^3 Zoom-in Vorticity iso- surface
13 How HPC fits into Scientific Computing Physical Processes Mathematical Models Numerical Solutions Data Visualization, Validation, Physical insight Air flow around an airplane Navier-stokes equations Algorithms, BCs, solvers, Application codes, supercomputers Viz software HPC
14 Performance Metrics FLOPS, or FLOP/S: FLoating-point Operations Per Second MFLOPS: MegaFLOPS, 10^6 flops GFLOPS: GigaFLOPS, 10^9 flops, home PC TFLOPS: TeraGLOPS, 10^12 flops, present-day supercomputers (www.top500.org)www.top500.org PFLOPS: PetaFLOPS, 10^15 flops, by 2011 EFLOPS: ExaFLOPS, 10^18 flops, by 2020 MIPS=Mega Instructions per Second = MegaHertz (if only 1IPS) Note: von Neumann computer MIPS
15 Performance Metrics Theoretical peak performance R_theor: maximum FLOPS a machine can reach in theory. Clock_rate*no_cpus*no_FPU/CPU 3GHz, 2 cpus, 1 FPU/CPU R_theor=3x10^9 * 2 = 6 GFLOPS Real performance R_real: FLOPS for specific operations, e.g. vector multiplication Sustained performance R_sustained: performance on an application, e.g. CFD R_sustained << R_real << R_theor Not uncommon R_sustained < 10%R_theor
16 Top 10 Supercomputers November 2007, LINPACK performance R_real R_theor
17 Number of Processors
18 Fastest Supercomputers At present Projections Japanese Earth Simulator My Laptop
IBM BG/L ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s Scalar Super Scalar Parallel Vector (Floating Point operations / second, Flop/s) ,000 (1 KiloFlop/s, KFlop/s) , , ,000,000 (1 MegaFlop/s, MFlop/s) ,000, ,000, ,000,000,000 (1 GigaFlop/s, GFlop/s) ,000,000, ,000,000, ,000,000,000,000 (1 TeraFlop/s, TFlop/s) ,000,000,000, ,000,000,000,000 (131 Tflop/s) Super Scalar/Vector/Parallel (10 3 ) (10 6 ) (10 9 ) (10 12 ) (10 15 ) 2X Transistors/Chip Every 1.5 Years A Growth-Factor of a Billion in Performance in a Career
Japanese “Life Simulator” Effort for a 10 Pflop/s System From the Nikkei newspaper, May 30th morning edition. Collaboration of industry, academia and government is organized by NEC, Hitachi, U of Tokyo, Kyusyu U, and RIKEN. Competition component similar to the DARPA HPCS program. This year allocated about $4 M each to do advanced development towards petascale. Total of ¥100,000 M ($909 M) will be invested in this development. Plan to be operational in 2011.
Japan’s Life Simulator: Original concept design in 2005 Needs of Multi- scale Multi- physic simulation Integration of multiple architecture Tightly-coupled heterogeneous computer Needs of multiple computation components SwitchPresent Faster interconnect Vector Node Scalar Node MD Node Slower connection Faster interconnect Vector Node Scalar Node FPGA Node MD Node Faster interconnect Proposing architecture
Major Applications of Next Generation Supercomputer Targeted as grand challenges
Basic Concept for Simulations in Nano-Science
Basic Concept for Simulations in Life Sciences Genes VascularSystemOrganism Organ Tissue Cell Protein GenomeGenome Bio-MDBio-MD Tissue Structure Multi-physicsMulti-physics ChemicalProcessChemicalProcess BloodCirculationBloodCirculation DDSDDS Gene Therapy HIFUHIFU Micro-machineMicro-machine CatheterCatheter Micro Meso Macro vale.edu/ RIKEN
25 Petascale Era: NCSA: Blue Waters 1PTF/s, 2011
26 Bell versus Moore
27 Grand Challenge Applications
28 The von Neumann Computer Walk-Through: c=a+b 1.Get next instruction 2.Decode: Fetch a 3.Fetch a to internal register 4.Get next instruction 5.Decode: fetch b 6.Fetch b to internal register 7.Get next instruction 8.Decode: add a and b (c in register) 9.Do the addition in ALU 10.Get next instruction 11.Decode: store c in main memory 12.Move c from internal register to main memory Note: Some units are idle while others are working…waste of cycles. Pipelining (modularization) & Cashing (advance decoding)…parallelism
30 Computer Performance CPU operates on data. If no data, CPU has to wait; performance degrades. typical workstation: 3.2GHz CPU, Memory 667MHz. Memory 5 times slower. Moore’s law: CPU speed doubles every 18 months Memory speed increases much much slower; Fast CPU requires sufficiently fast memory. Rule of thumb: Memory size in GB=R_theor in GFLOPS 1CPU cycle (1 FLOPS) handles 1 byte of data 1MFLOPS needs 1MB of data/memory 1GFLOPS needs 1GB of data/memory Many “tricks” designed for performance improvement targets the memory
31 CPU Performance Computer time is measured in terms of CPU cycles Minimum time to execute 1 instruction is 1 CPU cycle Time to execute a given program: n_c: total number of CPU cycles n_i: total number of instructions CPI = n_c/n_i, average cycles per instruction t_c: cycle time, 1GHz t_c=1/(10^9Hz) = 10^(-9)sec = 1ns
32 To Make a Program/Computer Faster… Reduce cycle time t_c: Increase clock frequency; however, there is a physical limit In 1ns, light travels 30cm Currently ~ GHz; 3GHz cpu light travels 10cm within 1 cpu cycle length/size must be < 10cm. 1 atom about 0.2 nm; Reduce number of instructions n_i: More efficient algorithms Better compilers Reduce CPI -- The key is parallelism. Instruction-level parallelism. Pipelining technology Internal parallelism, multiple functional units; superscalar processors; multi-core processors External parallelism, multiple CPUs, parallel machine
33 Processor Types Vector processor; Cray X1/T90; NEC SX#; Japan Earth Simulator; Early Cray machines; Japan Life Simulator (hybrid) Scalar processor CISC: Complex Instruction Set Computer Intel 80x86 (IA32) RISC: Reduced Instruction Set Computer Sun SPARC, IBM Power #, SGI MIPS VLIW: Very Long Instruction Word; Explicitly parallel instruction computing (EPIC); Probably dying Intel IA64 (Itanium)
34 CISC Processor CISC Complex instructions; Large number of instructions; Can complete more complicated functions at instruction level Instruction actually invokes microcode. Microcodes are small programs in processor memory Slower; Many instructions access memory; varying instruction length; allow no pipelining;
35 RISC Processor No microcode Simple instructions; Fewer instructions; Fast Only load and store instructions access memory Common instruction word length Allows pipelining Almost all present-day high performance computers use RISC processors
36 Locality of References Spatial/Temporal locality If processor executes an instruction at time t, it is likely to execute an adjacent/next instruction at (t+delta_t); If processor accesses a memory location/data item x at time t, it is likely to access an adjacent memory location/data item (x+delta_x) at (t+delta_t); Pipelining, Caching and many other techniques all based on the locality of references
37 Pipelining Overlapping execution of multiple instructions 1 instruction per cycle Sub-divide instruction into multiple stages; Processor handles different stages of adjacent instructions simultaneously Suppose 4 stages in instruction: Instruction fetch and decode (IF) Read data (RD) Execute (EX) Write-back results (WB)
38 Instruction Pipeline IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB cycle instruction Depth of pipeline: number of stages in an instruction After the pipeline is full, 1 result per cycle! CPI = (n+depth-1)/n With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28 cycles
39 Inhibitors of Pipelining Dependencies between instructions interrupts pipelining, degrading performance Control dependence. Data dependence.
40 Control Dependence Branching: when an instruction occurs after an conditional branch; so it is unknown whether that instruction will be executed beforehand Loop: for(i=0;iy) n=5; Branching in programs interrupts pipeline degrades performance Avoid excessive branching!
41 Data Dependence when an instruction depends on data from a previous instruction x = 3*j; y = x+5.0; // depends on previous instruction
42 Vector Pipeline Vector processors: with vector registers which can hold a vector, e.g. of 128 elements; Commonly encountered processors are scalar processors, e.g. home PC Efficient for loops involving vectors. for (i=0;i<128;i++) z[i] = x[i] + y[i] Instructions: Vector Load X(1:128) Vector Load Y(1:128) Vector Add Z=X+Y Vector Store Z
43 Vector Pipeline 123…133cycle Load X(1:128) Load Y(1:128) Add Z=X+Y Store Z instruction IF X(1)RDX(128) … IF Y(1)RDY(128) … IF Z(1)ADZ(128) … IF Z(1)STZ(128) … time
45 Exceeding Cache Size CACHE: 32 Kb Cache line: 64 bytes NOTE: Asymptotic 5Mflops: result every 15 clocks – time to reload a cache line following a miss
46 Internal Parallelism Functional units: components in processor that actually do the work Memory operations (MU): load, store; Integer arithmetic (IU): integer add, bit shift … Floating point arithmetic (FPU): floating-point add, multiply … Instruction type Latency (cycles) Integer add1 Floating-point add 3 Floating-point multiply 3 Floating-point divide 31 Typical instruction latencies Division is much slower than add/multiply! Minimize or avoid divisions!
47 Internal Parallelism Superscalar RISC processors: multiple functional units in processor, e.g. multiple FPUs, Capable of executing more than one instruction (producing more than one result) per cycle. Shared registers, L1 cache etc. Need faster memory access to provide data to multiple functional units! Limiting factor: memory-processor bandwidth
48 Internal Parallelism Multi-core processors: Intel dual-core, quad-core Multiple execution cores (functional units, registers, L1 cache) Multiple cores share L2 cache, memory Lower energy consumption Need FAST memory access to provide data to multiple cores Effective memory bandwidth per core is reduced Limiting factor: memory- processor bandwidth Functional units + L1 cache Shared L2 cache Between cores CPU chip
49 Heat Flux also Increases with Speed!
50 New Processors are Too Hot! ~ ~ ~
52 Your Next PC?
53 External Parallelism Parallel machines: Will be discussed later
54 Memory: Next Lecture Bit: 0, 1; Byte: 8 bits Memory size PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes Memory performance measures: Access time, or response time, latency: interval between time of issuance of memory request and time when request is satisfied. Cycle time: minimum time between two successive memory requests t0 t1 t2 Memory request satisfied Access time: t1-t0 Cycle time: t2-t0 If there is another request at t0 t2 Memory busy t0 < t < t2