Presentation is loading. Please wait.

Presentation is loading. Please wait.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Similar presentations


Presentation on theme: "3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2."— Presentation transcript:

1 3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2

2 Parallel Computing ● Parallel computing: the use of multiple computers or processors working together on a common task ● Parallel computer: a computer that contains multiple processors: ➔ each processor works on its section of the problem ➔ processors are allowed to exchange information with other processors

3 Parallel vs. Serial Computers Two big advantages of parallel computers: 1. total performance 2. total memory ● Parallel computers enable us to solve problems that: ➔ benefit from, or require, fast solution ➔ require large amounts of memory ➔ example that requires both: weather forecasting

4 Parallel vs. Serial Computers Some benefits of parallel computing include: more data points ➔ bigger domains ➔ better spatial resolution ➔ more particles ● more time steps ➔ longer runs ➔ better temporal resolution ● faster execution ➔ faster time to solution ➔ more solutions in same time ➔ lager simulations in real time

5 Serial Processor Performance Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached

6 Types of Parallel Processor The simplest and most useful way to classify modern parallel computers is by their memory model: ➔ shared memory ➔ distributed memory

7 Shared vs Distributed memory Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000) Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)

8 Shared Memory Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000) Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)

9 Distributed Memory Processor-memory nodes are connected by some type of interconnect network ➔ Massively Parallel Processor (MPP): tightly integrated, single system image. ➔ Cluster: individual computers connected by s/w

10 Processor, Memory & Network Both shared and distributed memory systems have: ➔ processors: now generally commodity RISC processors ➔ memory: now generally commodity DRAM ➔ network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)

11 Processor-Related Terms Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz). Instruction: an action executed by a processor, such as a mathematical operation or a memory operation. Register: a small, extremely fast location for storing data or instructions in the processor.

12 Processor-Related Terms Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc. Pipeline : technique enabling multiple instructions to be overlapped in execution. Superscalar: multiple instructions are possible per clock period. Flops: floating point operations per second.

13 Processor-Related Terms Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly. Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)

14 Memory-Related Terms SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable. DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper). Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.

15 Interconnect-Related Terms Latency: Networks: How long does it take to start sending a "message"? Measured in microseconds. Processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec or Gbytes/sec

16 Interconnect-Related Terms ● Topology: the manner in which the nodes are connected. Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. Instead, processors are arranged in some variation of a grid, torus, or hypercube.

17 Putting the Pieces Together ● Shared memory architectures: ➔ Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000 ➔ Non-Uniform Memory Access (NUMA): Most common are Distributed Shared Memory (DSM), or cc-NUMA (cache coherent NUMA) systems. Ex: SGI Origin 2000 ● Distributed memory architectures: ➔ Massively Parallel Processor (MPP): tightly integrated system, single system image. Ex: CRAY T3E, IBM SP ➔ Clusters: commodity nodes connected by interconnect. Example: Beowulf clusters.

18 Symmetric Multiprocessors (SMPs) ● SMPs connect processors to global shared memory using one of: ➔ Bus ➔ crossbar ● Provides simple programming model, but has problems: ➔ buses can become saturated ➔ crossbar size must increase with # processors ● Problem grows with number of processors, limiting maximum size of SMPs

19 Shared Memory Programming Programming models are easier since message passing is not necessary Techniques: ➔ autoparallelization via compiler options ➔ oop-level parallelism via compiler directives ➔ OpenMP ➔ pthreads

20 Massively Parallel Processors ● Each processor has it’s own memory: ➔ memory is not shared globally ➔ adds another layer to memory hierarchy (remote memory) ● Processor/memory nodes are connected by interconnect network ➔ many possible topologies ➔ processors must pass data via messages ➔ communication overhead must be minimized

21 Types of Interconnections ● Fully connected ➔ not feasible ● Array and torus ➔ Intel Paragon (2D array), CRAY T3E (3D torus) ● Crossbar ➔ IBM SP (8 nodes) ● Hypercube ➔ SGI Origin 2000 (hypercube), Meiko CS-2 (fat tree) ● Combinations of some of the above ➔ IBM SP (crossbar & fully connected for 80 nodes) ➔ IBM SP (fat tree for > 80 nodes)

22 Distributed Memory Programming ● Message passing is most efficient ➔ MPI ➔ MPI-2 ➔ Active/one-sided messages Vendor: SHMEM (T3E), LAPI (SP Coming in MPI-2 ● Shared memory models can be implemented in software, but are not as efficient.

23 Distributed Shared Memory ● More generally called cc-NUMA (cache coherent NUMA) ● Consists of m SMPs with n processors in a global address space: ➔ Each processor has some local memory (SMP) ➔ All processors can access all memory: extra “directory” hardware on each SMP tracks values stored in all SMPs ➔ Hardware guarantees cache coherency ➔ Access to memory on other SMPs slower (NUMA)

24 Distributed Shared Memory ● Easier to build because of slower access to remote memory (no expensive bus/crossbar) ● Similar cache problems ● Code writers should be aware of data distribution ● Load balance: Minimize access of “far” memory


Download ppt "3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2."

Similar presentations


Ads by Google