Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #1 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Similar presentations

Presentation on theme: "1 Parallel Scientific Computing: Algorithms and Tools Lecture #1 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."— Presentation transcript:

1 1 Parallel Scientific Computing: Algorithms and Tools Lecture #1 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

2 2 Logistics  Contact:  Office hours: GK: M 2-4 pm; LG: W 2-4 pm   Web:  Textbook:  Karniadakis & Kirby, “Parallel scientific computing in C++/MPI”  Other books:  Shonkwiler & Lefton, “Parallel and Vector Scientific Computing”  Wadleigh & Crawford, “Software Optimization for High Performance Computing”  Foster, “Designing and Building Parallel Programs” (available online)

3 3 Logistics  CCV Accounts   Prerequisite: C/Fortran programming  Grading:  5 assignments/mini-projects: 50%  1 Final project/presentation : 50%

4 4 History

5 5

6 6 Course Objectives  Understanding of fundamental concepts and programming principles for development of high performance applications  Able to program a range of parallel computers: PC  clusters  supercomputers  Make efficient use of high performance parallel computing in your own research

7 7 Course Objectives

8 8 Content Overview  Parallel computer architecture: 2-3 weeks  CPU, Memory; Shared-/distributed-memory parallel machines; network connections;  Parallel programming: 5 weeks  MPI; OpenMP; UPC  Parallel numerical algorithms: 4 weeks  Matrix algorithms; direct/iterative solvers; eigensolvers; Monte Carlo methods (simulated annealing, genetic algorithms)  Grid computing: 1 week  Globus, MPICH-G2

9 9 What & Why  What is high performance computing (HPC)?  The use of the most efficient algorithms on computers capable of the highest performance to solve the most demanding problems.  Why HPC?  Large problems – spatially/temporally 10,000 x 10,000 x 10,000 grid  10^12 grid points  4x10^12 double variables  32x10^12 bytes = 32 Tera-Bytes. Usually need to simulate tens of millions of time steps. On-demand/urgent computing; real-time computing;  Weather forecasting; protein folding; turbulence simulations/CFD; aerospace structures; Full-body simulation/ Digital human …

10 10 HPC Examples: Blood Flow in Human Vascular Network  Cardiovascular disease accounts for about 50% of deaths in western world;  Formation of arterial disease strongly correlated to blood flow patterns; Computational challenges: Enormous problem size In one minute, the heart pumps the entire blood supply of 5 quarts through 60,000 miles of vessels, that is a quarter of the distance between the moon and the earth Blood flow involves multiple scales

11 11 HPC Examples Earthquake simulation Surface velocity 75 sec after earthquake Flu pandemic simulation 300 million people tracked Density of infected population, 45 days after breakout

12 12 HPC Example: Homogeneous Turbulence Direct Numerical Simulation of Homogeneous Turbulence: 4096^3 Zoom-in Vorticity iso- surface

13 13 How HPC fits into Scientific Computing Physical Processes Mathematical Models Numerical Solutions Data Visualization, Validation, Physical insight Air flow around an airplane Navier-stokes equations Algorithms, BCs, solvers, Application codes, supercomputers Viz software HPC

14 14 Performance Metrics  FLOPS, or FLOP/S: FLoating-point Operations Per Second  MFLOPS: MegaFLOPS, 10^6 flops  GFLOPS: GigaFLOPS, 10^9 flops, home PC  TFLOPS: TeraGLOPS, 10^12 flops, present-day supercomputers (  PFLOPS: PetaFLOPS, 10^15 flops, by 2011  EFLOPS: ExaFLOPS, 10^18 flops, by 2020  MIPS=Mega Instructions per Second = MegaHertz (if only 1IPS) Note: von Neumann computer MIPS

15 15 Performance Metrics  Theoretical peak performance R_theor: maximum FLOPS a machine can reach in theory.  Clock_rate*no_cpus*no_FPU/CPU  3GHz, 2 cpus, 1 FPU/CPU  R_theor=3x10^9 * 2 = 6 GFLOPS  Real performance R_real: FLOPS for specific operations, e.g. vector multiplication  Sustained performance R_sustained: performance on an application, e.g. CFD R_sustained << R_real << R_theor Not uncommon R_sustained < 10%R_theor

16 16 Top 10 Supercomputers November 2007, LINPACK performance R_real R_theor

17 17 Number of Processors

18 18 Fastest Supercomputers At present Projections Japanese Earth Simulator My Laptop

19 IBM BG/L ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s Scalar Super Scalar Parallel Vector (Floating Point operations / second, Flop/s) ,000 (1 KiloFlop/s, KFlop/s) , , ,000,000 (1 MegaFlop/s, MFlop/s) ,000, ,000, ,000,000,000 (1 GigaFlop/s, GFlop/s) ,000,000, ,000,000, ,000,000,000,000 (1 TeraFlop/s, TFlop/s) ,000,000,000, ,000,000,000,000 (131 Tflop/s) Super Scalar/Vector/Parallel (10 3 ) (10 6 ) (10 9 ) (10 12 ) (10 15 ) 2X Transistors/Chip Every 1.5 Years A Growth-Factor of a Billion in Performance in a Career

20 Japanese “Life Simulator” Effort for a 10 Pflop/s System  From the Nikkei newspaper, May 30th morning edition.  Collaboration of industry, academia and government is organized by NEC, Hitachi, U of Tokyo, Kyusyu U, and RIKEN.  Competition component similar to the DARPA HPCS program.  This year allocated about $4 M each to do advanced development towards petascale.  Total of ¥100,000 M ($909 M) will be invested in this development.  Plan to be operational in 2011.

21 Japan’s Life Simulator: Original concept design in 2005 Needs of Multi- scale Multi- physic simulation Integration of multiple architecture Tightly-coupled heterogeneous computer Needs of multiple computation components SwitchPresent Faster interconnect Vector Node Scalar Node MD Node Slower connection Faster interconnect Vector Node Scalar Node FPGA Node MD Node Faster interconnect Proposing architecture

22 Major Applications of Next Generation Supercomputer Targeted as grand challenges

23 Basic Concept for Simulations in Nano-Science

24 Basic Concept for Simulations in Life Sciences Genes VascularSystemOrganism Organ Tissue Cell Protein GenomeGenome Bio-MDBio-MD Tissue Structure Multi-physicsMulti-physics ChemicalProcessChemicalProcess BloodCirculationBloodCirculation DDSDDS Gene Therapy HIFUHIFU Micro-machineMicro-machine CatheterCatheter Micro Meso Macro RIKEN

25 25 Petascale Era: NCSA: Blue Waters 1PTF/s, 2011

26 26 Bell versus Moore

27 27 Grand Challenge Applications

28 28 The von Neumann Computer Walk-Through: c=a+b 1.Get next instruction 2.Decode: Fetch a 3.Fetch a to internal register 4.Get next instruction 5.Decode: fetch b 6.Fetch b to internal register 7.Get next instruction 8.Decode: add a and b (c in register) 9.Do the addition in ALU 10.Get next instruction 11.Decode: store c in main memory 12.Move c from internal register to main memory Note: Some units are idle while others are working…waste of cycles. Pipelining (modularization) & Cashing (advance decoding)…parallelism

29 29 Basic Architecture -CPU, pipelining -Memory hierarchy, cache

30 30 Computer Performance  CPU operates on data. If no data, CPU has to wait; performance degrades.  typical workstation: 3.2GHz CPU, Memory 667MHz. Memory 5 times slower.  Moore’s law: CPU speed doubles every 18 months  Memory speed increases much much slower;  Fast CPU requires sufficiently fast memory.  Rule of thumb: Memory size in GB=R_theor in GFLOPS  1CPU cycle (1 FLOPS) handles 1 byte of data  1MFLOPS needs 1MB of data/memory  1GFLOPS needs 1GB of data/memory Many “tricks” designed for performance improvement targets the memory

31 31 CPU Performance  Computer time is measured in terms of CPU cycles  Minimum time to execute 1 instruction is 1 CPU cycle  Time to execute a given program: n_c: total number of CPU cycles n_i: total number of instructions CPI = n_c/n_i, average cycles per instruction t_c: cycle time, 1GHz  t_c=1/(10^9Hz) = 10^(-9)sec = 1ns

32 32 To Make a Program/Computer Faster…  Reduce cycle time t_c:  Increase clock frequency; however, there is a physical limit  In 1ns, light travels 30cm  Currently ~ GHz; 3GHz cpu  light travels 10cm within 1 cpu cycle  length/size must be < 10cm.  1 atom about 0.2 nm;  Reduce number of instructions n_i:  More efficient algorithms  Better compilers  Reduce CPI -- The key is parallelism.  Instruction-level parallelism. Pipelining technology  Internal parallelism, multiple functional units; superscalar processors; multi-core processors  External parallelism, multiple CPUs, parallel machine

33 33 Processor Types  Vector processor;  Cray X1/T90; NEC SX#; Japan Earth Simulator; Early Cray machines; Japan Life Simulator (hybrid)  Scalar processor  CISC: Complex Instruction Set Computer Intel 80x86 (IA32)  RISC: Reduced Instruction Set Computer Sun SPARC, IBM Power #, SGI MIPS  VLIW: Very Long Instruction Word; Explicitly parallel instruction computing (EPIC); Probably dying Intel IA64 (Itanium)

34 34 CISC Processor  CISC  Complex instructions; Large number of instructions; Can complete more complicated functions at instruction level  Instruction actually invokes microcode. Microcodes are small programs in processor memory  Slower; Many instructions access memory; varying instruction length; allow no pipelining;

35 35 RISC Processor  No microcode  Simple instructions; Fewer instructions; Fast  Only load and store instructions access memory  Common instruction word length  Allows pipelining Almost all present-day high performance computers use RISC processors

36 36 Locality of References  Spatial/Temporal locality  If processor executes an instruction at time t, it is likely to execute an adjacent/next instruction at (t+delta_t);  If processor accesses a memory location/data item x at time t, it is likely to access an adjacent memory location/data item (x+delta_x) at (t+delta_t); Pipelining, Caching and many other techniques all based on the locality of references

37 37 Pipelining  Overlapping execution of multiple instructions  1 instruction per cycle  Sub-divide instruction into multiple stages; Processor handles different stages of adjacent instructions simultaneously  Suppose 4 stages in instruction:  Instruction fetch and decode (IF)  Read data (RD)  Execute (EX)  Write-back results (WB)

38 38 Instruction Pipeline IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB IFEXRDWB cycle instruction Depth of pipeline: number of stages in an instruction After the pipeline is full, 1 result per cycle! CPI = (n+depth-1)/n With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28 cycles

39 39 Inhibitors of Pipelining  Dependencies between instructions interrupts pipelining, degrading performance  Control dependence.  Data dependence.

40 40 Control Dependence  Branching: when an instruction occurs after an conditional branch; so it is unknown whether that instruction will be executed beforehand  Loop: for(i=0;iy) n=5; Branching in programs interrupts pipeline  degrades performance Avoid excessive branching!

41 41 Data Dependence  when an instruction depends on data from a previous instruction x = 3*j; y = x+5.0; // depends on previous instruction

42 42 Vector Pipeline  Vector processors: with vector registers which can hold a vector, e.g. of 128 elements;  Commonly encountered processors are scalar processors, e.g. home PC  Efficient for loops involving vectors. for (i=0;i<128;i++) z[i] = x[i] + y[i] Instructions: Vector Load X(1:128) Vector Load Y(1:128) Vector Add Z=X+Y Vector Store Z

43 43 Vector Pipeline 123…133cycle Load X(1:128) Load Y(1:128) Add Z=X+Y Store Z instruction IF X(1)RDX(128) … IF Y(1)RDY(128) … IF Z(1)ADZ(128) … IF Z(1)STZ(128) … time

44 44 Vector Operations: Hockney’s Formulas CACHE: 64 Kb

45 45 Exceeding Cache Size CACHE: 32 Kb Cache line: 64 bytes NOTE: Asymptotic 5Mflops: result every 15 clocks – time to reload a cache line following a miss

46 46 Internal Parallelism  Functional units: components in processor that actually do the work  Memory operations (MU): load, store;  Integer arithmetic (IU): integer add, bit shift …  Floating point arithmetic (FPU): floating-point add, multiply … Instruction type Latency (cycles) Integer add1 Floating-point add 3 Floating-point multiply 3 Floating-point divide 31 Typical instruction latencies Division is much slower than add/multiply! Minimize or avoid divisions!

47 47 Internal Parallelism  Superscalar RISC processors: multiple functional units in processor, e.g. multiple FPUs,  Capable of executing more than one instruction (producing more than one result) per cycle.  Shared registers, L1 cache etc.  Need faster memory access to provide data to multiple functional units!  Limiting factor: memory-processor bandwidth

48 48 Internal Parallelism  Multi-core processors: Intel dual-core, quad-core  Multiple execution cores (functional units, registers, L1 cache)  Multiple cores share L2 cache, memory  Lower energy consumption  Need FAST memory access to provide data to multiple cores  Effective memory bandwidth per core is reduced  Limiting factor: memory- processor bandwidth Functional units + L1 cache Shared L2 cache Between cores CPU chip

49 49 Heat Flux also Increases with Speed!

50 50 New Processors are Too Hot! ~ ~ ~

51 51

52 52 Your Next PC?

53 53 External Parallelism  Parallel machines: Will be discussed later

54 54 Memory: Next Lecture  Bit: 0, 1; Byte: 8 bits  Memory size  PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes  Memory performance measures:  Access time, or response time, latency: interval between time of issuance of memory request and time when request is satisfied.  Cycle time: minimum time between two successive memory requests t0 t1 t2 Memory request satisfied Access time: t1-t0 Cycle time: t2-t0 If there is another request at t0 t2 Memory busy t0 < t < t2

Download ppt "1 Parallel Scientific Computing: Algorithms and Tools Lecture #1 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."

Similar presentations

Ads by Google