EECS 570: Fall 2003 -- rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors.

EECS 570: Fall 2003 -- rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors

EECS 570: Fall 2003 -- rev3 2 Outline Scalability –bus doesn't hack it: need scalable interconnect (network) Realizing Programming Models on Scalable Systems –network transactions –protocols shared address space message passing (synchronous, asynchronous, etc.) active messages –safety: buffer overflow, fetch deadlock Communication Architecture Design Space –where does network attach to processor node? –how much hardware interpretation of the network transaction? –impact on cost & performance

EECS 570: Fall 2003 -- rev3 3 Scalability How do bandwidth, latency, cost, and packaging scale with P? Ideal: –latency, per-processor bandwidth, per-processor cost are constants –packaging does not create upper bound, does not exacerbate others Bus: –per-processor BW scales as 1/P –latency increases with P: queuing delays for fixed bus length (linear at saturation?) as bus length is increased to accommodate more CPUs, clock must slow Reality: –“scalable” may just mean sub-Linear dependence (e.g., logarithmic) –practical limits ($/customer), sweet spot + growth path –switched interconnect (network)

EECS 570: Fall 2003 -- rev3 4 Aside on Cost-Effective Computing Traditional view: efficiency = speedup(P)/P Efficiency < I → parallelism not worthwhile? But much of a computer's cost is NOT in the processor (memory, packaging, interconnect, etc.) [Wood & Hill, IEEE Computer 2/95] Let Costup(P) = Cost(P)/Cost(1) Parallel computing cost-effective: Speedup(P) > Costup(P) E.g. for SGI PowerChallenge w/500MB: Costup(32) = 86

EECS 570: Fall 2003 -- rev3 5 Network Transaction Primitive one-way transfer of information from source to destination causes some action at the destination process info and/or deposit in buffer state change (e.g., set flag, interrupt program) maybe initiate reply (separate network transaction)

EECS 570: Fall 2003 -- rev3 6 Network Transaction Correctness Issues protection –User/user, user/system what if VM doesn't apply? –Fault containment (large machine component MTBF may be low) format –Variable length? Header info? –Affects efficiency, ability to handle in HW buffering/flow control –Finite buffering in network itself –Messages show up announced e.g., many-to-one pattern deadlock avoidance –If you're not careful -- details later action –What happens on delivery? How may options are provided? system guarantees –Delivery, ordering

EECS 570: Fall 2003 -- rev3 7 Performance Issues Key parameters: latency, overhead, bandwidth LogP: lat/over/gap(BW) as function of P NI S CPU S Network NI R CPU R

EECS 570: Fall 2003 -- rev3 8 Programming Models What is user's view of network transaction? –Depends on system architecture –Remember layered approach: OS/compiler/Iibrary may implement alternative model on top of HW- provided model We'll look at three basic ones: –Active Messages: “assembly language” for msg- passing systems –Message Passing"- MPI-style interface, as seen by appl. programmers –Shared Address Space: ignoring cache coherence for now

EECS 570: Fall 2003 -- rev3 9 Active Messages User-level analog of network transaction –invoke handler function at receiver to extract packet from network –grew out of attempts to do dataflow on msg-passing machines & remote procedure calls –handler may send reply, but no other messages –Event notification: interrupts, polling, events? –May also perform memory-to-memory transfer Flexible (can do almost any action on msg reception), but requires tight cooperation between CPU and network for high performance –May be better to have HW do a few things faster Request handler Reply

EECS 570: Fall 2003 -- rev3 10 Message Passing Basic idea: –Send(dest, tag, buffer) -- tag is arbitrary integer –Recv(src, tag, buffer ) -- src/tag may be wildcard (“any”) Completion semantics: –Receive completes after data transfer complete from matching send –Synchronous send completes after matching receive and data sent –Asynchronous send completes after send buffer may be reused msg may simply be copied into alternate buffer, on src or dest node Blocking vs. non-blocking: –does function wait for “completion” before returning –non-blocking: extra function calls to check for completion –assume blocking for now

EECS 570: Fall 2003 -- rev3 11 Synchronous Message Passing Three-phase operation: ready-to-send, ready-to-receive, transfer –Can skip 1st phase if receiver initiates & specifies source Overhead, latency tend to be high Transfer can achieve high bandwidth w/sufficient msg length Programmer must avoid deadlock (e.g. pairwise exchange)

EECS 570: Fall 2003 -- rev3 12 Asynch. Msg Passing: Conservative Same as synchronous. except msg can be buffered on sender –Allows computation to continue sooner –Deadlock still (not so much) an issue -- buffering is finite

EECS 570: Fall 2003 -- rev3 13 Asynch. Message Passing: Optimistic Sender just ships data, hopes receiver can handle it Benefit: lower latency Problems. –receive was posted need fast lookup while data streams in –receive not posted buffer? nack? discard?

EECS 570: Fall 2003 -- rev3 14 Key Features of Msg Passing Abstraction Source knows send data address, dest. knows receive data address –after handshake they both know both Arbitrary storage for asynchronous protocol –may post many sends before any receives Fundamentally a 3-phase transaction –can use optimistic 1-phase in limited cases Latency, overhead tend to be higher than SAS: high BW easier Hardware support? –DMA: physical or virtual (better/harder)

EECS 570: Fall 2003 -- rev3 15 Shared Address Space Abstraction Two-way request/response protocol –reads require data response –writes have acknowledgment (for consistency) Issues –virtual or physical address on net? (where does translation happen) –coherence, consistency, etc (later)

EECS 570: Fall 2003 -- rev3 16 Key Properties of SAS Abstraction Data addresses are specified by the source of the request –no dynamic buffer allocation –protection achieved through virtual memory translation Low overhead initiation: one instruction (load or store) High bandwidth more challenging –may require prefetching, separate “block transfer engine” Synchronization less straightforward (no explicit event notification) Simple request-response pairs –few fixed message types –practical to implement in hardware w/o remote CPU involvement Input buffering I flow control issue –what if request rate exceeds local memory bandwidth?

EECS 570: Fall 2003 -- rev3 17 Challenge 1: Input Buffer Overflow Options refuse input when full –creates “back pressure” (in reliable network) –to avoid deadlock: low-level ack/nack –assumes dedicated network path for ack/nack (common in rings) –retry on nack drop packets –retry on timeout avoid by reserving space per source (“credit-based”) –when available for reuse? –scalability?

EECS 570: Fall 2003 -- rev3 18 Challenge 2: Fetch Deadlock Processing a message may require sending a reply; what if reply can't be sent due to input buffer overflow? –step 1: guarantee that replies can be sunk @ destination requester reserves buffer space for reply –step 2: guarantee that replies can be sent into network backpressure: logically independent request/reply networks –physical networks or virtual channels –credit-based: bound outgoing requests to K per node buffer space for K(P-l ) requests + K responses at each node –low-level ack/nack, packet dropping guarantee that replies will never be nacked or dropped For cache coherence protocols, some requests may require more: forward request to other node, send multiple invalidations –must extend techniques or nack these requests up front

EECS 570: Fall 2003 -- rev3 19 Outline Scalability [7.1 - read] Realizing Programming Models –network transactions –protocols: SAS, MP, Active Messages –safety: buffer overflow, fetch deadlock Communication Architecture Design Space –where does hardware fit into node architecture? –how much hardware interpretation of the network transaction? –how much gap between hardware and user semantics? remainder must be done in software increased flexibility, increased latency & overhead –main CPU or dedicated/specialized processor?

EECS 570: Fall 2003 -- rev3 20 Massively Parallel Processor (MPP) Architectures I/O Bus Memory Bus Processor Cache Main Memory Disk Controller Disk Network Interface Network I/O Bridge Network interface typically close to processor –Memory bus: locked to specific processor architecture/bus protocol –Registers/cache: only in research machines Time-to-market is long –processor already available or work closely with processor designers Maximize performance and cost

EECS 570: Fall 2003 -- rev3 21 Network of Workstations Network interface on I/0 bus Standards (e.g., PCI) => longer life, faster to market Slow (microseconds) to access network interface “System Area Network” (SAN): between LAN & MPP I/O Bus Processor Cache Main Memory Disk Controller Disk Graphics Controller Network Interface Graphics Network interrupts Core Chip Set

EECS 570: Fall 2003 -- rev3 22 Transaction Interpretation Simple: HW doesn't interpret much if anything –DMA from/to buffer, interrupt or set flag on completion –nCUBE, conventional LAN –Requires OS for address translation, often a user/kernel copy User-level messaging: get the OS out of the way –HW does protection checks to allow direct user access to network –may have minimal interpretation otherwise –May be on I/O bus (Myrinet), memory bus (CM-5), or in regs (J- Machine, *T) –May require CPU involvement in all data transfers (explicit memory-to-network copy)

EECS 570: Fall 2003 -- rev3 23 Transaction Interpretation (cont'd) Virtual DMA: get the CPU out of the way (maybe) –basic protection plus address translation: user-level bulk DMA –Usually to limited region of addr space (pinned) –Can he done in hardware (VIA, Meiko CS-2) or software (some Myrinet, Intel Paragon) Reflective memory –DEC memory channel, Princeton SHRIMP Global physical address space (NUMA): everything in hardware –complexity increases, but performance does too (if done right) Cache coherence: even more so –stay tuned

EECS 570: Fall 2003 -- rev3 24 Net Transactions: Physical DMA Physical addresses: OS must initiate transfers –system call per message on both ends: ouch Sending OS copies data to kernel buffer w/ header/trailer –can avoid copy if interface does scatter/gather Receiver copies packet into OS buffer, then interprets –user message then copied (or mapped) into user space

EECS 570: Fall 2003 -- rev3 25 nCUBE/2 Network Interface independent DMA channel per link direction segmented messages: can inspect header to direct remainder of DMA directly to user buffer –avoids copy at expense of extra interrupt + DMA setup cost –can't let buffer be paged out (did nCUBE have VM?)

EECS 570: Fall 2003 -- rev3 26 Conventional LAN Network Interface NIC Controller DMA addr len trncv TX RX Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Data Host Memory NIC IO Bus mem bus Proc

EECS 570: Fall 2003 -- rev3 27 User Level Messaging map network hardware into user’s address space –talk directly to network via loads & stores user-to-user communication without OS intervention: low latency protection: user/user & user/system DMA hard… CPU involvement (copying) becomes bottleneck

EECS 570: Fall 2003 -- rev3 28 User Level Network Ports Appears to user as logical message queues plus status

EECS 570: Fall 2003 -- rev3 29 Example: CM-5 Input and output FIFO for each network Two data networks Save/restore network buffers on context switch

EECS 570: Fall 2003 -- rev3 30 User Level Handlers User/system P Mem DestDataAddress P Mem  Hardware support to vector to address specified in message –message ports in registers –alternate register set for handler? Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)

EECS 570: Fall 2003 -- rev3 31 J-Machine Each node a small message- driven processor HW support to queue msgs and dispatch to msg handler task

EECS 570: Fall 2003 -- rev3 32 Dedicated Message Processing Without Specialized Hardware Network ° ° ° dest Mem PM P NI UserSystem Mem PM P NI UserSystem General purpose processor performs arbitrary output processing (at system level) General purpose processor interprets incoming network transactions (in system) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction

EECS 570: Fall 2003 -- rev3 33 Levels of Network Transaction User Processor stores cmd / msg / data into shared output queue –must still check for output queue full (or grow dynamically) Communication assists make transaction happen –checking, translation, scheduling, transport, interpretation Avoid system call overhead Multiple bus crossings likely bottleneck Network ° ° ° dest Mem PM P NI Mem PM P NI

EECS 570: Fall 2003 -- rev3 34 Example: Intel Paragon

EECS 570: Fall 2003 -- rev3 35

EECS 570: Fall 2003 -- rev3 36 Dedicated MP w/specialized NI: Meiko CS-2 Integrate message processor into network interface –active messages-like capability –dedicated threads for DMA, reply handling, simple remote memory access –supports user-level virtual DMA own page table can take a page fault, signal OS, restart –meanwhile, nack other node Problem: processor is slow, time-slices threads –fundamental issue with building your own CPU

EECS 570: Fall 2003 -- rev3 37 Myricom Myrinet (Berkeley NOW) Programmable network interface on I/O Bus (Sun SBUS or PCI) –embedded custom CPU (“Lanai”, ~40 MHz RISC CPU) –256KB SRAM –3 DMA engines: to network, from network, to/from host memory Downloadable firmware executes in kernel mode –includes source-based routing protocol SRAM pages can be mapped into user space –separate pages for separate processes –firmware can define status words, queues, etc. data for short messages or pointers for long ones firmware can do address translation too… w/OS help –poll to check for sends from user Bottom line: I/O bus still bottleneck, CPU could be faster

EECS 570: Fall 2003 -- rev3 38 Shared Physical Address Space Implement SAS model in hardware w/o caching –actual caching must be done by copying from remote memory to local –programming paradigm looks more like message passing than Pthreads yet, low latency & low overhead transfers thanks to HW interpretation; high bandwidth too if done right result: great platform for MPI & compiled data-parallel codes Implementation: –“pseudo-memory” acts as memory controller for remote mem, converts accesses to network transaction (request) –“pseudo-CPU” on remote node receives requests, performs on local memory, sends reply –split-transaction or retry-capable bus required (or dual-ported mem)

EECS 570: Fall 2003 -- rev3 39 Example: Cray T3D Up to 2,048 Alpha 21064s –no off-chip L2 to avoid inherent latency In addition to remote mem ops, includes: –prefetch buffer (hide remote latency) –DMA engine (requires OS trap) –synchronization operations (swap, fetch&inc, global AND/OR) –message queue (requires OS trap on receiver) Big problem: physical address space –21064 supports only 32 bits –2K-node machine limited to 2M per node –external “DTB annex” provides segment-like registers for extended addressing, but management is expensive & ugly

EECS 570: Fall 2003 -- rev3 40 Cray T3E Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2) –still has physical address space problems E-registers for remote communication and synchronization –512 user, 128 system; 64 bits each –replace/unify DTB Annex, prefetch queue, block transfer engine, and remote load / store, message queue –Address specifies source or destination E-register and command –Data contains pointer to block of 4 E-regs and index for centrifuge Centrifuge –supports data distributions used in data-parallel languages (HPF) –4 E-regs for global memory operation: mask, base, two arguments Get & Put Operations

EECS 570: Fall 2003 -- rev3 41 T3E (continued) Atomic Memory operations –E-registers & centrifuge used –F&I, F&Add, Compare&Swap, Masked_Swap Messaging –arbitrary number of queues (user or system) –64-byte messages –create msg queue by storing message control word to memory location Msg Send –construct data in aligned block of 8 E-regs –send like put, but dest must be message control word –processor is responsible for queue space (buffer management) Barrier and Eureka synchronization

EECS 570: Fall 2003 -- rev3 42 DEC Memory Channel (Princeton SHRIMP) Reflective Memory Writes on Sender appear in Receiver’s memory –send & receive regions –page control table Receive region is pinned in memory Requires duplicate writes, really just message buffers Virtual Physical Virtual

EECS 570: Fall 2003 -- rev3 43 Performance of Distributed Memory Machines Microbenchmarking One-way latency of small (five-word) message –echo test –round-trip divided by 2 Shared Memory remote read Message Passing Operations –see text

EECS 570: Fall 2003 -- rev3 44 Network Transaction Performance Figure 7.31

EECS 570: Fall 2003 -- rev3 45 Remote Read Performance Figure 7.32

EECS 570: Fall 2003 -- rev3 46 Summary of Distributed Memory Machines Convergence of architectures –everything “looks basically the same” –processor, cache, memory, communication assist Communication Assist –where is it? (I/O bus, memory bus, processor registers) –what does it know? does it just move bytes, or does it perform some functions? –is it programmable? –does it run user code? Network transaction –input & output buffering –action on remote node

EECS 570: Fall 2003 -- rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors.

Similar presentations

Presentation on theme: "EECS 570: Fall 2003 -- rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 570: Fall 2003 -- rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors.

Similar presentations

Presentation on theme: "EECS 570: Fall 2003 -- rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors."— Presentation transcript:

Similar presentations

About project

Feedback