Presentation is loading. Please wait.

Presentation is loading. Please wait.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Similar presentations


Presentation on theme: "Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster."— Presentation transcript:

1 Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster computers, BlueGene Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF Applications N-body problems, search algorithms, bioinformatics Grid computing Multimedia content analysis on Grids (guest lecture Frank Seinstra)

2 Parallel Machines Parallel Computing – Techniques and Applications Using Networked Workstations and Parallel Computers (2/e) Section 1.3 + (part of) 1.4 Barry Wilkinson and Michael Allen Pearson, 2005

3 Overview Processor organizations Types of parallel machines – Processor arrays – Shared-memory multiprocessors – Distributed-memory multicomputers Cluster computers Blue Gene

4 Processor Organization Network topology is a graph – A node is a processor – An edge is a communication path Evaluation criteria –Diameter (maximum distance) –Bisection width (minimum number of edges that should be removed to split the graph into 2 -almost- equal halves) –Number of edges per node

5 Mesh q-dimensional lattice q=2 -> 2-D grid Number of nodesk² Diameter 2(k - 1) Bisection width k Edges per node 4

6 Binary Tree Number of nodes 2 k - 1 Diameter 2 (k -1) Bisection width1 Edges per node3

7 Hypertree Tree with multiple roots (see Figure 3-3), gives better bisection width 4-ary tree: –Number of nodes 2 k ( 2 k+1 - 1) –Diameter 2k –Bisection width 2 k+1 –Edges per node 6

8 Engineering solution: fat tree Tree with more bandwidth at links near the root CM-5

9 Hypercube k-dimensional cube, each node has binary value, nodes that differ in 1 bit are connected Number of nodes2 k Diameterk Bisection width2 k-1 Edges per nodek

10 Hypercube Label nodes with binary value, connect nodes that differ in 1 coordinate Number of nodes2 k Diameterk Bisection width2 k-1 Edges per nodek

11 Types of parallel machines Processor arrays Shared-memory multiprocessors Distributed-memory multicomputers

12 Processor Arrays Instructions operate on scalars or vectors Processor array = front-end + synchronized processing elements Front-end –Sequential machine that executes program –Vector operations are broadcast to PEs Processing element –Performs operation on its part of the vector –Communicates with other PEs through a network

13 Examples of Processor Arrays CM-200, Maspar MP-1, MP-2, ICL DAP (~1970s) Japanese Earth Simulator (2002, former #1 of top-500)

14 Shared-Memory Multiprocessors Bus easily gets saturated => add caches to CPUs Central problem: cache coherency –Snooping cache: monitor bus, invalidate copy on write –Write-through or copy-back Bus-based multiprocessors do not scale

15 Other Multiprocessor Designs (1/2) Switch-based multiprocessors (e.g., crossbar) Expensive (requires many very fast components)

16 Other Multiprocessor Designs (2/2) Non-Uniform Memory Access (NUMA) multiprocessors Memory is distributed Some memory is faster to access than other memory Example: –Teras at Sara, Dutch National Supercomputer (1024-node SGI)

17 Distributed-Memory Multicomputers Each processor only has a local memory Processors communicate by sending messages over a network Routing of messages: –Packet-switched message routing: split message into packets, buffered at intermediate nodes Store-and-forward Cut-through routing, wormhole routing –Circuit-switched message routing: establish path between source and destination

18 Store-and-forward Routing Messages are forwarded one node at a time Forwarding is done in software Every processor on path from source to destination is involved Latency linear todistance x message length –Examples: Parsytec GCel (T800 transputers), Intel Ipsc

19 Circuit-switched Message Routing Each node has a routing module Circuit set up between source and destination Latency linear to distance + message length Example: Intel iPSC/2

20 Modern routing techniques Circuit switching: needs to reserve all links in the path (cf. old telephone system) Packet switching: high latency, buffering space (cf. postal mail) Cut-through routing: packet switching, but immediately forward (without buffering) packets if outgoing link is available Wormhole routing: transmit head (few bits) of message, rest follows like a worm

21 Distributed Shared Memory Shared memory is relatively easy to program, but doesn’t scale Distributed memory is hard to program, but does scale Distributed Shared Memory (DSM): provide shared-memory programming model on top of distributed memory hardware –Shared Virtual Memory (SVM): use memory management hardware (paging), copy pages over the network –Object-based: provide replicated shared objects (Orca language) Was hot research topic in 1990s, but performance remained the bottleneck

22 Flynn's Taxonomy Instruction stream: sequence of instructions Data stream: sequence of data manipulated by instructions Single DataMultiple Data Single InstructionSISDSIMD Multiple InstructionMISDMIMD SISD: Single Instruction Single DataTraditional uniprocessors SIMD: Single Instruction Multiple DataProcessor arrays MISD: Multiple Instruction Single DataNonexistent? MIMD: Multiple Instruction Multiple Data Multiprocessors and multicomputers


Download ppt "Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster."

Similar presentations


Ads by Google