Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS591x -Cluster Computing and Parallel Programming

Similar presentations


Presentation on theme: "CS591x -Cluster Computing and Parallel Programming"— Presentation transcript:

1 CS591x -Cluster Computing and Parallel Programming
Parallel Computer Architecture and Software Models

2 It all about performance
Greater performance is the reason for parallel computing Many types of scientific and engineering programs are too large and too complex for traditional uniprocessors Such large problems are common is – Ocean modeling, weather modeling, astrophysics, solid state physics, power systems….

3 FLOPS – a measure of performance
FLOPS – Floating Point Operations per Second … a measure of how much computation can be done in a certain amount of time MegaFLOPS – MFLOPS FLOPS GigaFLOPS – GFLOPS – 109 FLOPS TeraFLOPS – TFLOPS – 1012 FLOPS PetaFLOPS – PFLOPS – 1015 FLOPS

4 How fast … Cray 1 - ~150 MFLOPS Pentium 4 – 3-6 GFLOPS
IBM’s BlueGene TFLOPS PSC’s Big Ben – 10 TFLOPS Humans --- it depends as calculators – MFLOPS as information processors – 10PFLOPS

5 FLOPS vs. MIPS FLOPS only concerned with floating pointing calculations other performance issues memory latency cache performance I/O capacity

6 See… www.Top500.org biannual performance reports and …
rankings of the fastest computers in the world

7 Performance Speedup(n processors) =
time(1 processor)/time(n processors) ** Culler, Singh and Gupta, Parallel Computing Architecture, A Hardware/Software Approach

8 Consider… from:

9 … a model of the Indian Ocean -
73,000,000 square kilometer One data point per 100 meters 7,300,000,000 surface points Need to model the ocean at depth – say every 10 meters up to 200 meters 20 depth data points Every 10 minutes for 4 hours – 24 time steps

10 So – 73 x 106 (points on the surface) x 102 (points per sq. km) x 20 points per sq km of depth) x 24 (time steps) 3,504,000,000,000 data points in the model grid Suppose 100 instruction per grid point 350,400,000,000,000 instructions in model

11 Then - Imagine that you have a computer that can run 1 billion (109)instructions per second 3.504 x / 109 = seconds or 9.7 hours

12 But – On a 10 teraflops computer – 3.504 x / 1013 = 35.0 seconds

13 Gaining performance Pipelining More instructions –faster
More instructions in execution at the same time in a single processor Not usually an attractive strategy these days – why?

14 Instruction Level Parallelism (ILP)
based on the fact that many instructions do not depend on instructions that are before them… Processor has extra hardware to execute several instructions at the same time …multiple adders…

15 Pipelining and ILP not the solution to our problem – why?
near incremental improvements in performance been done already we need orders of magnitude improvements in performance

16 Gaining Performance Vector Processors
Scientific and Engineering computations are often vector and matrix operations graphic transformations – i.e. shift object x to the right Redundant arithmetic hardware and vector registers to operate on an entire vector in one step (SIMD)

17 Gaining Performance Vector Processors
Declining popularity for a while – Hardware expensive Popularity returning – Applications – science, engineering, cryptography, media/graphics Earth Simulator

18 Parallel Computer Architecture
Shared Memory Architectures Distributed Memory

19 Shared Memory Systems Multiple processors connected to/share the same pool of memory SMP Every processor has, potentially, access to and control of every memory location

20 Shared Memory Computers
Processor Processor Processor Memory Processor Processor Processor

21 Shared Memory Computers
Processor Processor Processor

22 Shared Memory Computer
Switch Processor Processor Processor

23 Share Memory Computers
SGI Origin2000 – at NCSA Balder mhz R10000 processors 128 Gbyte Memory

24 Shared Memory Computers
Rachel at PSC Ghz EV7 processors 256 Gbytes of shared memory

25 Distributed Memory Systems
Multiple processors each with their own memory Interconnected to share/exchange data, processing Modern architectural approach to supercomputers Supercomputers and Clusters similar

26 Clusters – distributed memory
Processor Processor Processor Interconnect Processor Processor Processor Memory Memory Memory

27 Cluster Distributed Memory with SMP
Proc1 Proc2 Proc1 Proc2 Proc1 Proc2 Interconnect Proc1 Proc2 Proc1 Proc2 Proc1 Proc2 Memory Memory Memory

28 Distributed Memory Supercomputer
BlueGene/L DOE/IBM 0.7 Ghz PowerPC 440 32768 Processors 70 Teraflops

29 Distributed Memory Supercomputer
Thunder at LLNL Number 5 20 Teraflops 1.4 Ghz Itanium processors 4096 processors

30 Grid Computing Systems
What is a Grid Means different things to different people Distributed Processors Around campus Around the state Around the world

31 Grid Computing Systems
Widely distributed Loosely connected (i.e. Internet) No central management

32 Grid Computing Systems
Connected Clusters/other dedicated scientific computers I2/Abilene

33 Grid Computer Systems Harvested Idle Cycles Internet Control/Scheduler

34 Grid Computing Systems
Dedicated Grids TeraGrid Sabre NASA Information Power Grid Cycle Harvesting Grids Condor *GlobalGridForum (Parabon)

35 Let’s revisit speedup…
we can achieve speedup (theoretically) by using more processors,… but, of factors may limit speedup… Interprocessor communications Interprocess synchronization Load balance

36 Amdahl’s Law According to Amdahl’s Law… Speedup = 1/(S + (1-S)/N)
where S is the purely sequential part of the program N is the number of processors

37 Amdahl’s Law What does it mean – Amdahl’s law says –
Part of a program can is parallelizable Part of the program must be sequential (S) Amdahl’s law says – Speedup is constrained by the portion of the program that must remain sequential relative to the part that is parallelized. Note: If S is very small – “embarrassingly parallel problem”

38 Software models for parallel computing
Shared Memory Distributed Memory Data Parallel

39 Flynn’s Taxonomy Single Instruction/Single Data - SISD
Multiple Instruction/Single Data - MISD Single Instruction/Multiple Data - SIMD Multiple Instruction/Multiple Data - MIMD Single Program/Multiple Data - SPMD

40 Next Cluster Computer Architecture Linux


Download ppt "CS591x -Cluster Computing and Parallel Programming"

Similar presentations


Ads by Google