# 1 Parallel Scientific Computing: Algorithms and Tools Lecture #1 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

## Presentation on theme: "1 Parallel Scientific Computing: Algorithms and Tools Lecture #1 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."— Presentation transcript:

1 Parallel Scientific Computing: Algorithms and Tools Lecture #1 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

2 Logistics  Contact:  Office hours: GK: M 2-4 pm; LG: W 2-4 pm  Email: {gk,lgrinb}@dam.brown.edu  Web: www.cfm.brown.edu/people/gk/APMA2821A  Textbook:  Karniadakis & Kirby, “Parallel scientific computing in C++/MPI”  Other books:  Shonkwiler & Lefton, “Parallel and Vector Scientific Computing”  Wadleigh & Crawford, “Software Optimization for High Performance Computing”  Foster, “Designing and Building Parallel Programs” (available online)

3 Logistics  CCV Accounts  Email: Sharon_King@brown.eduSharon_King@brown.edu  Prerequisite: C/Fortran programming  Grading:  5 assignments/mini-projects: 50%  1 Final project/presentation : 50%

4 History

5

6 Course Objectives  Understanding of fundamental concepts and programming principles for development of high performance applications  Able to program a range of parallel computers: PC  clusters  supercomputers  Make efficient use of high performance parallel computing in your own research

7 Course Objectives

8 Content Overview  Parallel computer architecture: 2-3 weeks  CPU, Memory; Shared-/distributed-memory parallel machines; network connections;  Parallel programming: 5 weeks  MPI; OpenMP; UPC  Parallel numerical algorithms: 4 weeks  Matrix algorithms; direct/iterative solvers; eigensolvers; Monte Carlo methods (simulated annealing, genetic algorithms)  Grid computing: 1 week  Globus, MPICH-G2

9 What & Why  What is high performance computing (HPC)?  The use of the most efficient algorithms on computers capable of the highest performance to solve the most demanding problems.  Why HPC?  Large problems – spatially/temporally 10,000 x 10,000 x 10,000 grid  10^12 grid points  4x10^12 double variables  32x10^12 bytes = 32 Tera-Bytes. Usually need to simulate tens of millions of time steps. On-demand/urgent computing; real-time computing;  Weather forecasting; protein folding; turbulence simulations/CFD; aerospace structures; Full-body simulation/ Digital human …

10 HPC Examples: Blood Flow in Human Vascular Network  Cardiovascular disease accounts for about 50% of deaths in western world;  Formation of arterial disease strongly correlated to blood flow patterns; Computational challenges: Enormous problem size In one minute, the heart pumps the entire blood supply of 5 quarts through 60,000 miles of vessels, that is a quarter of the distance between the moon and the earth Blood flow involves multiple scales

11 HPC Examples Earthquake simulation Surface velocity 75 sec after earthquake Flu pandemic simulation 300 million people tracked Density of infected population, 45 days after breakout

12 HPC Example: Homogeneous Turbulence Direct Numerical Simulation of Homogeneous Turbulence: 4096^3 Zoom-in Vorticity iso- surface

13 How HPC fits into Scientific Computing Physical Processes Mathematical Models Numerical Solutions Data Visualization, Validation, Physical insight Air flow around an airplane Navier-stokes equations Algorithms, BCs, solvers, Application codes, supercomputers Viz software HPC

14 Performance Metrics  FLOPS, or FLOP/S: FLoating-point Operations Per Second  MFLOPS: MegaFLOPS, 10^6 flops  GFLOPS: GigaFLOPS, 10^9 flops, home PC  TFLOPS: TeraGLOPS, 10^12 flops, present-day supercomputers (www.top500.org)www.top500.org  PFLOPS: PetaFLOPS, 10^15 flops, by 2011  EFLOPS: ExaFLOPS, 10^18 flops, by 2020  MIPS=Mega Instructions per Second = MegaHertz (if only 1IPS) Note: von Neumann computer -- 0.00083 MIPS

15 Performance Metrics  Theoretical peak performance R_theor: maximum FLOPS a machine can reach in theory.  Clock_rate*no_cpus*no_FPU/CPU  3GHz, 2 cpus, 1 FPU/CPU  R_theor=3x10^9 * 2 = 6 GFLOPS  Real performance R_real: FLOPS for specific operations, e.g. vector multiplication  Sustained performance R_sustained: performance on an application, e.g. CFD R_sustained << R_real << R_theor Not uncommon R_sustained < 10%R_theor

16 Top 10 Supercomputers www.top500.org November 2007, LINPACK performance R_real R_theor

17 Number of Processors

18 Fastest Supercomputers At present Projections www.top500.org Japanese Earth Simulator My Laptop

IBM BG/L ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red 1950196019701980199020002010 1 KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s Scalar Super Scalar Parallel Vector 1941 1 (Floating Point operations / second, Flop/s) 1945 100 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2005131,000,000,000,000 (131 Tflop/s) Super Scalar/Vector/Parallel (10 3 ) (10 6 ) (10 9 ) (10 12 ) (10 15 ) 2X Transistors/Chip Every 1.5 Years A Growth-Factor of a Billion in Performance in a Career

Japanese “Life Simulator” Effort for a 10 Pflop/s System  From the Nikkei newspaper, May 30th morning edition.  Collaboration of industry, academia and government is organized by NEC, Hitachi, U of Tokyo, Kyusyu U, and RIKEN.  Competition component similar to the DARPA HPCS program.  This year allocated about \$4 M each to do advanced development towards petascale.  Total of ¥100,000 M (\$909 M) will be invested in this development.  Plan to be operational in 2011.

Japan’s Life Simulator: Original concept design in 2005 Needs of Multi- scale Multi- physic simulation Integration of multiple architecture Tightly-coupled heterogeneous computer Needs of multiple computation components SwitchPresent Faster interconnect Vector Node Scalar Node MD Node Slower connection Faster interconnect Vector Node Scalar Node FPGA Node MD Node Faster interconnect Proposing architecture

Major Applications of Next Generation Supercomputer Targeted as grand challenges

Basic Concept for Simulations in Nano-Science

Basic Concept for Simulations in Life Sciences Genes VascularSystemOrganism Organ Tissue Cell Protein GenomeGenome Bio-MDBio-MD Tissue Structure Multi-physicsMulti-physics ChemicalProcessChemicalProcess BloodCirculationBloodCirculation DDSDDS Gene Therapy HIFUHIFU Micro-machineMicro-machine CatheterCatheter Micro Meso Macro http://ridge.icu.ac.jp http://info.med. vale.edu/ RIKEN

25 Petascale Era: 2008- NCSA: Blue Waters 1PTF/s, 2011

26 Bell versus Moore

27 Grand Challenge Applications

28 The von Neumann Computer Walk-Through: c=a+b 1.Get next instruction 2.Decode: Fetch a 3.Fetch a to internal register 4.Get next instruction 5.Decode: fetch b 6.Fetch b to internal register 7.Get next instruction 8.Decode: add a and b (c in register) 9.Do the addition in ALU 10.Get next instruction 11.Decode: store c in main memory 12.Move c from internal register to main memory Note: Some units are idle while others are working…waste of cycles. Pipelining (modularization) & Cashing (advance decoding)…parallelism

29 Basic Architecture -CPU, pipelining -Memory hierarchy, cache

30 Computer Performance  CPU operates on data. If no data, CPU has to wait; performance degrades.  typical workstation: 3.2GHz CPU, Memory 667MHz. Memory 5 times slower.  Moore’s law: CPU speed doubles every 18 months  Memory speed increases much much slower;  Fast CPU requires sufficiently fast memory.  Rule of thumb: Memory size in GB=R_theor in GFLOPS  1CPU cycle (1 FLOPS) handles 1 byte of data  1MFLOPS needs 1MB of data/memory  1GFLOPS needs 1GB of data/memory Many “tricks” designed for performance improvement targets the memory

31 CPU Performance  Computer time is measured in terms of CPU cycles  Minimum time to execute 1 instruction is 1 CPU cycle  Time to execute a given program: n_c: total number of CPU cycles n_i: total number of instructions CPI = n_c/n_i, average cycles per instruction t_c: cycle time, 1GHz  t_c=1/(10^9Hz) = 10^(-9)sec = 1ns

32 To Make a Program/Computer Faster…  Reduce cycle time t_c:  Increase clock frequency; however, there is a physical limit  In 1ns, light travels 30cm  Currently ~ GHz; 3GHz cpu  light travels 10cm within 1 cpu cycle  length/size must be < 10cm.  1 atom about 0.2 nm;  Reduce number of instructions n_i:  More efficient algorithms  Better compilers  Reduce CPI -- The key is parallelism.  Instruction-level parallelism. Pipelining technology  Internal parallelism, multiple functional units; superscalar processors; multi-core processors  External parallelism, multiple CPUs, parallel machine

33 Processor Types  Vector processor;  Cray X1/T90; NEC SX#; Japan Earth Simulator; Early Cray machines; Japan Life Simulator (hybrid)  Scalar processor  CISC: Complex Instruction Set Computer Intel 80x86 (IA32)  RISC: Reduced Instruction Set Computer Sun SPARC, IBM Power #, SGI MIPS  VLIW: Very Long Instruction Word; Explicitly parallel instruction computing (EPIC); Probably dying Intel IA64 (Itanium)

34 CISC Processor  CISC  Complex instructions; Large number of instructions; Can complete more complicated functions at instruction level  Instruction actually invokes microcode. Microcodes are small programs in processor memory  Slower; Many instructions access memory; varying instruction length; allow no pipelining;

35 RISC Processor  No microcode  Simple instructions; Fewer instructions; Fast  Only load and store instructions access memory  Common instruction word length  Allows pipelining Almost all present-day high performance computers use RISC processors

36 Locality of References  Spatial/Temporal locality  If processor executes an instruction at time t, it is likely to execute an adjacent/next instruction at (t+delta_t);  If processor accesses a memory location/data item x at time t, it is likely to access an adjacent memory location/data item (x+delta_x) at (t+delta_t); Pipelining, Caching and many other techniques all based on the locality of references

37 Pipelining  Overlapping execution of multiple instructions  1 instruction per cycle  Sub-divide instruction into multiple stages; Processor handles different stages of adjacent instructions simultaneously  Suppose 4 stages in instruction:  Instruction fetch and decode (IF)  Read data (RD)  Execute (EX)  Write-back results (WB)

38 Instruction Pipeline IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB 12345678910cycle 1 2 3 4 5 6 7 instruction Depth of pipeline: number of stages in an instruction After the pipeline is full, 1 result per cycle! CPI = (n+depth-1)/n With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28 cycles

39 Inhibitors of Pipelining  Dependencies between instructions interrupts pipelining, degrading performance  Control dependence.  Data dependence.

40 Control Dependence  Branching: when an instruction occurs after an conditional branch; so it is unknown whether that instruction will be executed beforehand  Loop: for(i=0;iy) n=5; Branching in programs interrupts pipeline  degrades performance Avoid excessive branching!

41 Data Dependence  when an instruction depends on data from a previous instruction x = 3*j; y = x+5.0; // depends on previous instruction

42 Vector Pipeline  Vector processors: with vector registers which can hold a vector, e.g. of 128 elements;  Commonly encountered processors are scalar processors, e.g. home PC  Efficient for loops involving vectors. for (i=0;i<128;i++) z[i] = x[i] + y[i] Instructions: Vector Load X(1:128) Vector Load Y(1:128) Vector Add Z=X+Y Vector Store Z

43 Vector Pipeline 123…133cycle Load X(1:128) Load Y(1:128) Add Z=X+Y Store Z instruction IF X(1)RDX(128) … IF Y(1)RDY(128) … IF Z(1)ADZ(128) … IF Z(1)STZ(128) … time

44 Vector Operations: Hockney’s Formulas CACHE: 64 Kb

45 Exceeding Cache Size CACHE: 32 Kb Cache line: 64 bytes NOTE: Asymptotic 5Mflops: result every 15 clocks – time to reload a cache line following a miss

46 Internal Parallelism  Functional units: components in processor that actually do the work  Memory operations (MU): load, store;  Integer arithmetic (IU): integer add, bit shift …  Floating point arithmetic (FPU): floating-point add, multiply … Instruction type Latency (cycles) Integer add1 Floating-point add 3 Floating-point multiply 3 Floating-point divide 31 Typical instruction latencies Division is much slower than add/multiply! Minimize or avoid divisions!

47 Internal Parallelism  Superscalar RISC processors: multiple functional units in processor, e.g. multiple FPUs,  Capable of executing more than one instruction (producing more than one result) per cycle.  Shared registers, L1 cache etc.  Need faster memory access to provide data to multiple functional units!  Limiting factor: memory-processor bandwidth

48 Internal Parallelism  Multi-core processors: Intel dual-core, quad-core  Multiple execution cores (functional units, registers, L1 cache)  Multiple cores share L2 cache, memory  Lower energy consumption  Need FAST memory access to provide data to multiple cores  Effective memory bandwidth per core is reduced  Limiting factor: memory- processor bandwidth Functional units + L1 cache Shared L2 cache Between cores CPU chip

49 Heat Flux also Increases with Speed!

50 New Processors are Too Hot! ~ ~ ~

51