Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Lecture 1 Parallel Processing for Scientific Applications.

Similar presentations


Presentation on theme: "1 Lecture 1 Parallel Processing for Scientific Applications."— Presentation transcript:

1

2 1 Lecture 1 Parallel Processing for Scientific Applications

3 2 Parallel Computing Multiple processes cooperating to solve a single problem

4 3 Why Parallel Computing ? Easy to get huge computational problems *physical simulation in 3D: 100 x 100 x 100 = 10 6 *oceanography example: 48 M cells, several variables per cell, one time step = 30 Gflop = 30,000,000,000 floating point operations)

5 4 Why Parallel Computing ? Numerical Prototyping: *real phenomena are too complicated to model *real experiments are too hard, too expensive, or too dangerous for a laboratory: Examples: simulate aging effects on nuclear weapon (ASCI Project), oil reservoir simulation, large wind tunnels, galactic evolution, whole factory or product life cycle design and optimization, DNA matching (bioinformatics)

6 5 An Example -- Climate Prediction Grid point

7 6 An Example -- Climate Prediction What is Climate?  Climate (longitude, latitude, height, time) *return a vector of 6 values: temperature, pressure, humidity, and wind velocity (3 words) Discretize: only evaluate on a grid point ; *Climate(i, j, k, n), where t = n*dt, dt is a fixed time step, n an integer, i,j,k are integers indexing the grid cells.

8 7 An Example -- Climate Prediction Area: 3000 x 3000 miles, Height: 11 miles --- 3000x3000x11 cube mile domain Segment size: 0.1x0.1x0.1 cube miles --10 11 different segments Two-day period, dt = 0.5 hours (2x24x2=96) 100 instructions per segment *the computation of parameters inside a segment uses the initial values and the values from neighboring segments

9 8 An Example -- Climate Prediction A single updating of the parameters in the entire domain requires 10 11 x100, or 10 13 instructions (10 Trillion instructions). Update 96 times -- 10 15 instructions Single-CPU supercomputer: *1000 MHz RISC CPU *Execution time: 280 hours. ??? Taking 280 hours to predict the weather for next 48 hours.

10 9 Issues in Parallel Computing Design of Parallel Computers Design of Efficient Algorithms Methods for Evaluating Parallel Algorithms Parallel Programming Languages Parallel Programming Tools Portable Parallel Programs Automatic Programming of Parallel Computers

11 10 Some Basic Studies

12 11 Design of Parallel Computer Parallel computing is information processing that emphasizes the concurrent manipulation of data elements belonging to one or more processes solving a single problem [Quinn:1994] Parallel computer: a multiple-processor computer capable of parallel computing.

13 12 Efficient Algorithms Throughput: the number of results per second Speedup: S = T 1 / T p Efficiency = S/P (P= no. of processor)

14 13 Scalability Algorithmic scalability: an algorithm is scalable if the available parallelism increases at least linearly with problem size. Architectural scalability: an architecture is scalable if it continues to yield the same performance per processor, as the number of processors is increased and as the problem size is increased. *Solve larger problems in the same amount of time by buying a parallel computer with more processors. ($$$$$ ??)

15 14 Parallel Architectures SMP: Symmetric Multiprocessor (SGI Power Challenger, SUN Enterprise 6000) MPP: Massively Parallel Processors  INTEL ASCI Red : 9152 processors (1997)  SGI/Cray T3E 120 LC1080-512 1080 nodes (1998) Cluster: True distributed systems -- tightly- coupled software on a loosely-coupled (LAN- based) hardware. *NOW: Network of Workstation or COW: Cluster of Workstations, Pile-of-PC (PoPC)

16 15 Levels of Abstraction Applications (Sequential ?) (Parallel ?) Programming Models (Shared Memory ?) (Message Passing ?) Addressing Space (Shared Memory?) (Distributed Memory ?) Hardware Architecture

17 16 Is Parallel Computing Simple ?

18 17 A Simple Example Take a paper and pen. Algorithm: * Step 1: Write a number on your pad * Step 2: Compute the sum of your neighbor's values * Step 3: Write the sum on the paper

19 18 ** Questions 1 How do you get values from your neighbors?

20 19 Shared Memory Model 5, 0, 4

21 20 Message Passing Model Hey !! What’s your number ?

22 21 ** Questions 2 Are you sure the sum is correct ?

23 22 Some processor starts earlier 5+0+4 = 9

24 23 Synchronization Problem !! 9 9+5+0=14 Step 3. Step 2.

25 24 ** Questions 3 How do you decide when you are done? (throw away the paper)

26 25 Some processor finished earlier 5+0+4 = 9

27 26 Some processor finished earlier 9

28 27 Some processor finished earlier Sorry !! We closed !!

29 28 Some processor finished earlier Sorry !! We closed !! ?+5+0=? Step 2.

30 29 Classification of Parallel Architectures

31 30 1. Based on Control Mechanism Flynn’s Classification : data or instruction stream :  SISD : single instruction stream single data streams  SIMD : single instruction stream multiple data streams  MIMD : multiple instruction streams multiple data streams  MISD : multiple instruction stream single data stream

32 31 SIMD Examples: *Thinking Machines: CM-1, CM-2 *MasPar MP-1 and MP-2 Simple processor: e.g., 1- or 4-bit CPU Fast global synchronization (global clock) Fast neighborhood communication Applications: image/signal processing, numerical analysis, data compression,...

33 32 2. Based on Address-space organization Bell’s Classification on MIMD architecture *Message-passing architecture local or private memory multicomputer = MIMD message-passing computer (or distributed-memory computer) *Shared-address-space architecture hardware support for one-side communication (read/write) multiprocessor = MIMD shared-address-space computers

34 33 Address Space A region of a computer’s total memory within which addresses are continuous and may refer to one another directly by hardware. A shared memory computer has only one user- visible address space A disjoin memory computer can have several. Disjoint memory is more commonly called distributed memory, but the memory of many shared memory computer (multiprocessors) is physically distributed.

35 34 Multiprocessors vs. Multicomputers Shared-Memory Multiprocessors Models  UMA : uniform memory access (all SMP servers)  NUMA : nonuniform-memory-access (DASH, T3E)  COMA : cache-only memory architecture (KSR) Distributed-Memory Multicomputers Model *message-passing network  NORMA model (no-remote-memory-access) *IBM SP2, Intel Paragon, TMC CM-5, INTEL ASCI Red, cluster

36 Symmetric Multiprocessors (SMPs) SGI PowerChallenge Cluster: IBM PowerPC Clusters Distributed Memory Machine IBM SP2 Parallel Computers at HKU CYC 807 CYC LG 102. Computer Center

37 36 Symmetric Multiprocessors (SMPs) Processors are connected to a shared memory module through a shared bus Each processor has equal right to access : the shared memory all I/O devices A single copy of OS

38 37 P6 PCI Bridge DRAM Controller Data Path MIC Memory controller MIC: memory interface controller NIC PCI Bus PCI Device PCI Device To Network Pentium Pro processor bus Interleave data (288 bits) 32-bit address 64-bit data 533 MB/s 32-bit address 32-bit data 132 MB/s Mem data (72 bits) CYC414 SRG Lab.

39 38 SMP Machine SGI POWER CHALLENGE POWER CHALLENGE XL –2-36 CPUs –16 GB memory (for 36 CPUs) –The bus performance: up to 1.2GB/sec Runs on a 64 bits OS (IRIX6.2) Common memory is shared which suitable for single-address-space programming

40 Distributed Memory Machine Consists of multiple computers (nodes) Nodes are communicated by message passing Each node is an autonomous computer Processor(s) (may be an SMP) Local memory Disks, network adapter, and other I/O peripherals No-remote-memory-access (NORMA)

41 40 Distributed Memory Machine IBM SP2 SP2 => Scalable POWERparallel System Developed based on RISC System/6000 workstation Power 2 processor, 66.6 MHz, 266 MFLOP

42 41 SP2 - Message Passing

43 42 8x8 Switch Switch among the nodes simultaneously and quickly Maximum 40MB point-to-point bandwidth SP2 - High Performance Switch

44 43 SP2 - Nodes (POWER 2 processor) Two types of nodes: –Thin node (smaller capacity, used to process individual works) 4 micro-channel slots, 96KB cache, 64-512MB memory, 1-4 GB disk –Wide node (larger capacity, used to be servers of the system) 8 micro-channel slots, 288KB cache, 64-2048MB memory, 1-8 GB disk

45 44 SP2 – The largest SP (P2SC, 120 MHz) machine: Pacific Northwest National Lab. U.S., 512 processors, TOP 26, 1998.

46 45 What’s a Cluster ? A cluster is a group of whole computers that works cooperatively as a single system to provide fast and efficient computing service.

47 Switched Ethernet Node 1Node 2Node 3Node 4 I need variable A from Node 2! OK! Thank You!

48 47 Clusters Advantages *Cheaper *Easy to scale *Coarse-grain parallelism Disadvantages *Poor communication performance (typically the latency) as compared with other parallel systems

49 48 TOP 500 (1997) TOP 1 INTEL: ASCI Red at Sandia Nat’l Lab. USA, June 1997 TOP 2 Hitachi/Tsukuba: CP-PACS (2048 processors), 0.368 Tflops at Univ. Tsukuba Japan, 1996 TOP 3 SGI/Gray: T3E 900 LC696-128 (696 processors), 0.264 Tflops at UK, Meteorological Office UK, 1997

50 49 TOP 500 (June, 1998) TOP 1 INTEL: ASCI Red (9152 Pentium Pro processors, 200 MHz), 1.3 Teraflops at Sandia Nat’l Lab. U.S., since June 1997 TOP 2 SGI/Gray: T3E 1200 LC1080-512, 1080 processors, 0.891 Tflops, U.S. government, 1998 installed TOP 3. SGI/Cray : T3E900 LC1248-128, 1248 processors, 0.634 Tflops, U.S. government

51 50 INTEL ASCI Red Compute node: 4536 (Dual Pentium Pro 200 MHz sharing a 533 MB/s bus) Peak speed.: 1.8 Teraflops (Trillion: 10 12 ) 1,600 square feet 85 cabinets

52 51 INTEL ASCI Red (Network) Split 2-D Mesh Interconnect Node-to-node bidirectional bandwidth: 800 Mbytes/sec 10 times faster than SP2 (one-way:40 MB/s)

53 52 Cray T3E 1200 Processor performance: 600 MHz, 1200 Mflops Overall system peak performance: 7.2 gigaflops to 2.5 teraflops, scale to thousands of processors. Interconnect: a three-dimensional bidirectional torus (a peak interconnect speed of 650MB/sec) Cray UNICOS/mk distribute OS Scalable GigaRing I/O system

54 53 Cray T3E Interconnect 3-D Torus

55 54 CP-PACS/2048, Japan Peak Perf. 0.614 TFLOPS CPU: PA-RISC 1.1, 150 MHz

56 55 CP-PACS Interconnect Comm. Bandwidth: 300 MB/s per link

57 56 TOP 500 (Asia) 1996: *Japan: (1) SR2201/1024 (1996) *Taiwan: (76) SP2/80 *Korea: (97) Cray Y-MP/16 *China: (231) SP2/32 *Hong Kong: (232) -- SP2/32 (HKU/CC) 1997: *Japan: (2)CP-PACS/2048 (1996) (5) SR2201/1024 (1996) *Korea: (34)T3E 900 LC128-128 (154) Ultra HPC 1000 *Taiwan: (167) SP2/80 *Hong Kong: (426) SGI Origin 2000 (CU) *(500): SP2/38 (UCLA)

58 57 TOP500 Asia 1998 Japan: *TOP 6: CP-PACS/2048 (1997, TOP 2) *TOP 12: NEC SX-4/128M4 *TOP 13: NEC SX-4/128H4 *TOP 14: Hitachi SR2201/1024 *…more Korea: SGI/Cray T3E900 LC128-128 (TOP 52),... Taiwan: IBM SP2/110, 1998 (TOP 241)

59 58 More Information TOP500: http://www.top500.org/ ASCI Red: http://www.sandia.gov/ASCI/Red.htm Cray T3E 1200: http://www.cray.com/products/systems/cray t3e/1200/ Chapter: 1.2-1.4, 2.1-2.4.1


Download ppt "1 Lecture 1 Parallel Processing for Scientific Applications."

Similar presentations


Ads by Google