1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

2 Levels of Parallelism  Job level parallelism: Capacity computing  Goal: run as many jobs as possible on a system for given time period. Concerned about throughput; Individual user’s jobs may not run faster.  Of interest to administrators  Program/Task level parallelism: Capability computing  Use multiple processors to solve a single problem.  Controlled by users.  Instruction level parallelism:  Pipeline, multiple functional units, multiple cores.  Invisible to users.  Bit-level parallelism:  Of concern to hardware designers of arithmetic-logic units

3 Granularity of Parallel Tasks  Large/coarse grain parallelism:  Amount of operations that run in parallel is fairly large  e.g., on the order of an entire program  Small/fine grain parallelism:  Amount of operations that run in parallel is relatively small  e.g., on the order of single loop. Coarse/large grains usually result in more favorable parallel performance

4 Flynn’s Taxonomy of Computers  SISD: Single instruction stream, single data stream  MISD: Multiple instruction streams, single data stream  SIMD: Single instruction stream, multiple data streams  MIMD: Multiple instruction streams, multiple data streams

5 Classification of Computers  SISD: single instruction single data  Conventional computers  CPU fetches from one instruction stream and works on one data stream.  Instructions may run in parallel (superscalar).  MISD: multiple instruction single data  No real world implementation.

6 Classification of Computers  SIMD: single instruction multiple data  Controller + processing elements (PE)  Controller dispatches an instruction to PEs; All PEs execute same instruction, but on different data  e.g., MasPar MP-1, Thinking machines CM-1, vector computers (?)  MIMD: multiple instruction multiple data  Processors execute own instructions on different data streams  Processors communicate with one another directly, or through shared memory.  Usual parallel computers, clusters of workstations

7 Flynn’s Taxonomy

8 Programming Model  SPMD: Single program multiple data  MPMD: multiple programs multiple data

9 Programming Model  SPMD: Single program multiple data  Usual parallel programming model  All processors execute same program, on multiple data sets (domain decomposition)  Processor knows its own ID if(my_cpu_id == N){} else {}

10 Programming Model  MPMD: Multiple programs multiple data  Different processors execute different programs, on different data  Usually a master-slave model is used. Master CPU spawns and dispatches computations to slave CPUs running a different program.  Can be converted into SPMD model if(my_cpu_id==0) run function_containing_program_1; else run function_containing_program_2;

11 Classification of Parallel Computers  Flynn’s MIMD computers contain a wide variety of parallel computers  Based on memory organization (address space):  Shared-memory parallel computers Processors can access all memories  Distributed-memory parallel computers Processor can only access local memory Remote memory access through explicit communication

12 Shared-Memory Parallel Computer  Superscalar processors with L2 cache connected to memory modules through a bus or crossbar  All processors have access to all machine resources including memory and I/O devices  SMP (symmetric multiprocessor): if processors are all the same and have equal access to machine resources, i.e. it is symmetric.  SMP are UMA (Uniform Memory Access) machines  e.g., A node of IBM SP machine; SUN Ultraenterprise 10000 Prototype shared-memory parallel computer P – processor; C – cache; M – memory. Bus or crossbar M1 P1 C M2 P2 C M3 P3 C Mn Pn C … … memory

13 Shared-Memory Parallel Computer  If bus,  Only one processor can access the memory at a time.  Processors contend for bus to access memory  If crossbar,  Multiple processors can access memory through independent paths  Contention when different processors access same memory module  Crossbar can be very expensive.  Processor count limited by memory contention and bandwidth  Max usually 64 or 128 … … P1 C M1M2Mn P2 C Pn C bus M1 P1 C M2 P2 C M3 P3 C Mn Pn C … … memory crossbar memory

14 Shared-Memory Parallel Computer  Data flows from memory to cache, to processors  Performance depends dramatically on reuse of data in cache  Fetching data from memory with potential memory contention can be expensive  L2 cache plays of the role of local fast memory; Shared memory is analogous to extended memory accessed in blocks

15 Cache Coherency  If a piece of data in one processor’s cache is modified, then all other processors’ cache that contain that data must be updated.  Cache coherency: the state that is achieved by maintaining consistent values of same data in all processors’ caches.  Usually hardware maintains cache coherency; System software can also do this, but more difficult.

16 Programming Shared-Memory Parallel Computers  All memory modules have the same global address space.  Closest to single-processor computer  Relatively easy to program.  Multi-threaded programming:  Auto-parallelizing compilers can extract fine-grain (loop-level) parallelism automatically;  Or use OpenMP;  Or use explicit POSIX (portable operating system interface) threads or other thread libraries.  Message passing:  MPI (Message Passing Interface).

17 Distributed-Memory Parallel Computer  Superscalar processors with local memory connected through communication network.  Each processor can only work on data in local memory  Access to remote memory requires explicit communication.  Present-day large supercomputers are all some sort of distributed- memory machines Communication Network P1 M P2 M Pn M … Prototype distributed-memory computer e.g. IBM SP, BlueGene; Cray XT3/XT4

18 Distributed-Memory Parallel Computer  High scalability  No memory contention such as those in shared-memory machines  Now scaled to > 100,000 processors.  Performance of network connection crucial to performance of applications.  Ideal: low latency, high bandwidth Communication much slower than local memory read/write Data locality is important. Frequently used data  local memory

19 Programming Distributed-Memory Parallel Computer  “Owner computes” rule  Problem needs to be broken up into independent tasks with independent memory  Each task assigned to a processor  Naturally matches data based decomposition such as a domain decomposition  Message passing: tasks explicitly exchange data by message passing.  Transfers all data using explicit send/receive instructions  User must optimize communications  Usually MPI (used to be PVM), portable, high performance  Parallelization mostly at large granularity level controlled by user  Difficult for compilers/auto-parallelization tools

20 Programming Distributed-Memory Parallel Computer  A global address space is provided on some distributed- memory machine  Memory physically distributed, but globally addressable; can be treated as “shared-memory” machine; so-called distributed shared-memory.  Cray T3E; SGI Altix, Origin.  Multi-threaded programs (OpenMP, POSIX threads) can also be used on such machines  User accesses remote memory as if it were local; OS/compilers translate such accesses to fetch/store over the communication network.  But difficult to control data locality; performance may suffer.  NUMA (non-uniform memory access); ccNUMA (cache coherent non-uniform memory access); overhead

21 Hybrid Parallel Computer  Overall distributed memory, SMP nodes  Most modern supercomputers and workstation clusters are of this type  Message passing; or hybrid message passing/threading. MM Bus or crossbar PP MM PP Communication network …… Hybrid parallel computer e.g. IBM SP, Cray XT3

22 Interconnection Network/Topology  Nodes, links  Neighbors: nodes with a link between them  Degree of a node: number of neighbors it has  Scalability: increase in complexity when more nodes are added. RingFully connected network

23 Topology Hypercube

24 Topology 1D/2D mesh/torus 3D mesh/torus

25 Topology Tree Star

26 Topology  Bisection width: minimum number of links that must be cut in order to divide the topology into two independent networks of the same size (plus/minus one node)  Bisection bandwidth: communication bandwidth across the links that are cut in defining bisection width Larger bisection bandwidth  better

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Similar presentations

Presentation on theme: "1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Similar presentations

Presentation on theme: "1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg."— Presentation transcript:

Similar presentations

About project

Feedback