Presentation is loading. Please wait.

Presentation is loading. Please wait.

Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Similar presentations


Presentation on theme: "Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley."— Presentation transcript:

1 Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

2 3/10/99CS258 S992 Racap: Common Challenges Input buffer overflow –N-1 queue over-commitment => must slow sources –reserve space per source(credit) »when available for reuse? Ack or Higher level –Refuse input when full »backpressure in reliable network »tree saturation »deadlock free »what happens to traffic not bound for congested dest? –Reserve ack back channel –drop packets –Utilize higher-level semantics of programming model

3 3/10/99CS258 S993 Racap: Challenges (cont) Fetch Deadlock –For network to remain deadlock free, nodes must continue accepting messages, even when cannot source msgs –what if incoming transaction is a request? »Each may generate a response, which cannot be sent! »What happens when internal buffering is full? logically independent request/reply networks –physical networks –virtual channels with separate input/output queues bound requests and reserve input buffer space –K(P-1) requests + K responses per node –service discipline to avoid fetch deadlock? NACK on input buffer full –NACK delivery?

4 3/10/99CS258 S994 Network Transaction Processing Key Design Issue: How much interpretation of the message? How much dedicated processing in the Comm. Assist? PM CA PM ° ° ° Scalable Network Node Architecture Communication Assist Message Output Processing – checks – translation – formating – scheduling Input Processing – checks – translation – buffering – action

5 3/10/99CS258 S995 Spectrum of Designs None: Physical bit stream –blind, physical DMAnCUBE, iPSC,... User/System –User-level portCM-5, *T –User-level handlerJ-Machine, Monsoon,... Remote virtual address –Processing, translationParagon, Meiko CS-2 Global physical address –Proc + Memory controllerRP3, BBN, T3D Cache-to-cache –Cache controllerDash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

6 3/10/99CS258 S996 Net Transactions: Physical DMA DMA controlled by regs, generates interrupts Physical => OS initiates transfers Send-side –construct system “envelope” around user data in kernel area Receive –must receive into system buffer, since no interpretation inCA senderauth dest addr

7 3/10/99CS258 S997 nCUBE Network Interface independent DMA channel per link direction –leave input buffers always open –segmented messages routing interprets envelope –dimension-order routing on hypercube –bit-serial with 36 bit cut-through Os16 ins 260 cy13 us Or18200 cy15 us - includes interrupt

8 3/10/99CS258 S998 Conventional LAN NI NIC Controller DMA addr len trncv TX RX Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Addr Len Status Next Data Host Memory NIC IO Bus mem bus Proc

9 3/10/99CS258 S999 User Level Ports initiate transaction at user level deliver to user without OS intervention network port in user space User/system flag in envelope –protection check, translation, routing, media access in src CA –user/sys check in dest CA, interrupt on system

10 3/10/99CS258 S9910 User Level Network ports Appears to user as logical message queues plus status What happens if no user pop?

11 3/10/99CS258 S9911 Example: CM-5 Input and output FIFO for each network 2 data networks tag per message –index NI mapping table context switching? *T integrated NI on chip iWARP also Os50 cy1.5 us Or53 cy1.6 us interrupt10us

12 3/10/99CS258 S9912 User Level Handlers Hardware support to vector to address specified in message –message ports in registers User/system P Mem DestDataAddress P Mem 

13 3/10/99CS258 S9913 J-Machine: Msg-Driven Processor Each node a small msg driven processor HW support to queue msgs and dispatch to msg handler task

14 3/10/99CS258 S9914 Monsoon Explicit Token-Store

15 3/10/99CS258 S9915 *T: Network Co-Processor

16 3/10/99CS258 S9916 iWARP: Systolic Computation Nodes integrate communication with computation on systolic basis Msg data direct to register Stream into memory Interface unit Host

17 3/10/99CS258 S9917 Dedicated processing without dedicated hardware design

18 3/10/99CS258 S9918 Dedicated Message Processor General Purpose processor performs arbitrary output processing (at system level) General Purpose processor interprets incoming network transactions (at system level) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI UserSystem

19 3/10/99CS258 S9919 Levels of Network Transaction User Processor stores cmd / msg / data into shared output queue –must still check for output queue full (or make elastic) Communication assists make transaction happen –checking, translation, scheduling, transport, interpretation Effect observed on destination address space and/or events Protocol divided between two layers Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI

20 3/10/99CS258 S9920 Example: Intel Paragon

21 3/10/99CS258 S9921 User Level Abstraction (Lok Liu) Any user process can post a transaction for any other in protection domain –communication layer moves OQ src –> IQ dest –may involve indirection: VAS src –> VAS dest Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS

22 3/10/99CS258 S9922 Msg Processor Events Dispatcher User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event

23 3/10/99CS258 S9923 Basic Implementation Costs: Scalar Cache-to-cache transfer (two 32B lines, quad word ops) –producer: read(miss,S), chk, write(S,WT), write(I,WT),write(S,WT) –consumer: read(miss,S), chk, read(H), read(miss,S), read(H),write(S,WT) to NI FIFO: read status, chk, write,... from NI FIFO: read status, chk, dispatch, read, read,... CP User OQ MP Registers Cache Net FIFO User IQ MPCP Net 21.52 4.4 µs5.4 µs 10.5 µs 7 wds 222 250ns + H*40ns

24 3/10/99CS258 S9924 Virtual DMA -> Virtual DMA Send MP segments into 8K pages and does VA –> PA Recv MP reassembles, does dispatch and VA –> PA per page CP User OQ MP Registers Cache Net FIFO User IQ MP CP Net 21.52 7 wds 222 Memory sDMA hdr rDMA MP 2048 400 MB/s 175 MB/s 400 MB/s

25 3/10/99CS258 S9925 Single Page Transfer Rate Actual Buffer Size: 2048 Effective Buffer Size: 3232

26 3/10/99CS258 S9926 Msg Processor Assessment Concurrency Intensive –Need to keep inbound flows moving while outbound flows stalled –Large transfers segmented Reduces overhead but adds latency User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event User Input Queues VAS Dispatcher

27 3/10/99CS258 S9927 Case Study: Meiko CS2 Concept Circuit-switched Network Transaction –source-dest circuit held open for request response –limited cmd set executed directly on NI Dedicated communication processor for each step in flow

28 3/10/99CS258 S9928 Case Study: Meiko CS2 Organization

29 3/10/99CS258 S9929 Shared Physical Address Space NI emulates memory controller at source NI emulates processor at dest –must be deadlock free

30 3/10/99CS258 S9930 Case Study: Cray T3D Build up info in ‘shell’ Remote memory operations encoded in address

31 3/10/99CS258 S9931 Case Study: NOW General purpose processor embedded in NIC

32 3/10/99CS258 S9932 Message Time Breakdown Communication pipeline

33 3/10/99CS258 S9933 Message Time Comparison

34 3/10/99CS258 S9934 SAS Time Comparison

35 3/10/99CS258 S9935 Message-Passing Time vs Size

36 3/10/99CS258 S9936 Message-Passing Bandwidth vs Size

37 3/10/99CS258 S9937 Application Performance on LU

38 3/10/99CS258 S9938 Application Performance on BT

39 3/10/99CS258 S9939 Message Profile on BT

40 3/10/99CS258 S9940 Reflective Memory Writes to local region reflected to remote

41 3/10/99CS258 S9941 Case Study: DEC Memory Channel See also Shrimp


Download ppt "Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley."

Similar presentations


Ads by Google