Presentation is loading. Please wait.

Presentation is loading. Please wait.

SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan

Similar presentations


Presentation on theme: "SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan"— Presentation transcript:

1 SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan

2 2 PARALLEL ARCHITECTURE Parallel Machine: a computer system with more than one processor  Special parallel machines designed to make this interaction overhead less Questions: What about Multicores? What about a network of machines?  Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent

3 3 Classification of Parallel Machines Flynn’s Classification In terms of number of Instruction streams and Data streams Instruction stream: path to instruction memory (PC) Data stream: path to data memory SISD: single instruction stream single data stream SIMD MIMD

4 4 PARALLEL PROGRAMMING Recall: Flynn SIMD, MIMD  Programming side: SPMD, MPMD  Shared memory: threads  Message passing: using messages Speedup =

5 5 Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors How Much Speedup is Possible? Let s be the fraction of sequential execution time of a given program that can not be parallelized Speedup = Maximum speedup achievable is limited by the sequential fraction of the sequential program Amdahl’s Law

6 6 Understanding Amdahl’s Law concurrency Time p 1 p n 2 /p concurrency (c) Parallel n 2 /pn2n2 p 1 concurrency (b) Naïve Parallel Time 1 n2n2 n2n2 (a) Serial Time Compute : min ( |A[i,j]- B[i,i]| )

7 7 Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space  Communication can be done through this shared memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine P P PP PP P Interconnect Main Memory P P PP PP P Interconnect MMMMMMM

8 8 Shared Memory Machines The shared memory could itself be distributed among the processor nodes  Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time  Terms: Shared vs Private  Terms: Local vs Remote  Terms: Centralized vs Distributed Shared  Terms: NUMA vs UMA architecture Non-Uniform Memory Access

9 © 9 M Network  Centralized Shared Memory Uniform Memory Access (UMA) M M $ P $ P $ P  Network Distributed Shared Memory Non-Uniform Memory Access (NUMA) M $ P M $ P  Shared Memory Architecture

10 © 10 MultiCore Structure L2-Cache C0C2 L1$ L2-Cache C4C6 L1$ L2-Cache C1C3 L1$ L2-Cache C5C7 L1$ Memory

11 © 11 NUMA Architecture L2$ C0C2 L1$ C4C6 L1$ L2$ L3-Cache IMCQPI L2$ C1C3 L1$ C5C7 L1$ L2$ L3-Cache QPI IMC Memory

12 © 12 Distributed Memory Architecture Network M $ P  M $ P M $ P Message Passing Architecture  Memory is private to each node  Processes communicate by messages Cluster

13 13 Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to interconnection medium, not directly to each other  Shared bus, multiple bus, crossbar, MIN Direct interconnects: nodes are connected directly to each other  Topology: linear, ring, star, mesh, torus, hypercube  Routing techniques: how the route taken by the message from source to destination is decided

14 14 Indirect Interconnects Shared bus Multiple bus Crossbar switch Multistage Interconnection Network 2x2 crossbar

15 15 Direct Interconnect Topologies Linear Ring Star Mesh 2D Torus Hypercube (binary n-cube) n=2 n=3

16 16 X: 0 Shared Memory Architecture: Caches X: 0 Read X X: 0 Read X Cache hit: Wrong data!! P1 P2 Write X=1 X: 1

17 17 Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache  Potential data consistency problem: the cache coherence problem  Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions  Hardware: cache coherence mechanisms  Software: compiler assisted cache coherence

18 18 Example: Write Once Protocol Assumption: shared bus interconnect where all cache controllers monitor all bus activity  Called snooping There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches  Corrective action could involve updating or invalidating a cache block

19 19 X: 0 Invalidation Based Cache Coherence X: 0 Read X X: 0 Read X Invalidate P1 P2 Write X=1 X: 1

20 20 Snoopy Cache Coherence and Locks Test&Set(L) L: 1 Test&Set(L) L: 1 L: 0 Test&Set(L) Holds lock; in critical section Release L; L: 0 L: 1 L: 0

21 21 Space of Parallel Computing Programming Models What programmer uses in coding applns. Specifies synch. And communication. Programming Models:  Shared address space  Message passing  Data parallel Parallel Architecture  Shared Memory  Centralized shared memory (UMA)  Distributed Shared Memory (NUMA)  Distributed Memory  A.k.a. Message passing  E.g., Clusters

22 22 Memory Consistency Model Memory consistency model  Order in which memory operations will appear to execute  What value can a read return?  Contract between appln. software and system.  Affects ease-of-programming and performance

23 23 Implicit Memory Model Sequential consistency (SC) [Lamport]  Result of an execution appears as if Operations from different processors executed in some sequential (interleaved) order Memory operations of each process in program order MEMORY P1P3P2Pn

24 © 24 Distributed Memory Architecture Network M $ P  M $ P M $ P Message Passing Architecture  Memory is private to each node  Processes communicate by messages Proc. Node Proc. Node Proc. Node

25 © 25 NUMA Architecture L2$ C0C2 L1$ C4C6 L1$ L2$ L3-Cache IMCQPI L2$ C1C3 L1$ C5C7 L1$ L2$ L3-Cache QPI IMC NIC Memory IO-Hub

26 © 26 Using MultiCores to Build Cluster Memory NIC Memory NIC N/W Switch Node 0Node 1 Node 3 Node 2

27 Multi-core Workshop © 27 Using MultiCores to Build Cluster Memory NIC Memory NIC N/W Switch Node 0Node 1 Node 3 Node 2 Send A(1:N)


Download ppt "SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan"

Similar presentations


Ads by Google