Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University.

Similar presentations


Presentation on theme: "CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University."— Presentation transcript:

1 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University

2 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 2 Basic Message Passing P0P1 N0 SendReceive P0P1 N0N1 Communication Fabric SendReceive

3 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 3 Terminology Basic Message Passing: –Send: Analogous to mailing a letter –Receive: Analogous to picking up a letter from the mailbox –Scatter-gather: Ability to “scatter” data items in a message into multiple memory locations and “gather” data items from multiple memory locations into one message Network performance: –Latency: The time from when a Send is initiated until the first byte is received by a Receive. –Bandwidth: The rate at which a sender is able to send data to a receiver.

4 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 4 Scatter-Gather … Message Memory Scatter (Receive) … Message Memory Gather (Send)

5 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 5 Network Topologies

6 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 6 Terminology Network partition: When a network is broken into two or more components that cannot communicate with each others. Diameter: Maximum length of shortest path between any two processors. Connectivity: Measure of the multiplicity of paths between any two processors - Minimum number of links that must be removed to partition the network. Bisection width: Minimum number of links that must be removed to partition the network into two equal halves. Bisection bandwidth: Minimum volume of communication allowed between any two halves of the network with an equal number of processors.

7 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 7 Bisection Bandwidth = Bisection Width * Link Bandwidth

8 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 8 Typical Network Diagram

9 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 9 Typical Node CPU MemoryNICRouter

10 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 10 Bus-Based Network Advantages –Simple –Diameter = 1 Disadvantages –Blocking –Bandwidth does not scale with p –Easy to partition network Bus

11 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 11 Completely-Connected Network Advantages –Diameter = 1 –Bandwidth scales with p –Non-blocking –Difficult to partition network Disadvantages –Number of links grows O(p 2 ) –Fan-in (and out) at each node grows linearly with p

12 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 12 Star Network Essentially same as Bus- Based Network

13 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 13 Ring Network

14 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 14 Mesh and Torus Network

15 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 15 Multistage Network

16 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 16 Perfect Shuffle

17 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 17 Omega Network - Log(p) Stages

18 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 18 Blocking in Omega Network

19 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 19 Tree Network

20 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 20 Fat Tree Network

21 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 21 Hypercube Network

22 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 22 Hypercube Network

23 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 23 k-ary d-cube Networks k: radix of the network - the number of processors in each dimension d: dimension of the network k-ary d-cube can be constructed from k k-ary (d- 1)-cubes by connecting the nodes occupying identical positions into rings Examples: –Hypercube: binary d-cube –Ring: p-ary 1-cube

24 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 24 Arbitrary Topology Networks Switch

25 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 25 Network Characteristics

26 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 26 Packet vs. Wormhole Routing Message Packets Worm

27 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 27 Store-and-Forward vs. Cut- Through Routing Store-and-Forward: –Cannot route/forward a packet until the entire packet has been received Cut-Through: –Can route/forward a packet as soon as the router has received and processed the header Worm-hole is always cut-through because not enough buffer space to hold entire message Packet routing is almost always cut-through as well Difference: when blocked, a worm can span multiple routers while a packet will fit entirely into the buffer of a single router

28 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 28 Collective Communication Primitives Send/Receive necessary and sufficient Broadcast, multicast –one-to-all, all-to-all, one-to-all personalized, all-to-all personalized –flood Reduction –all-to-one, all-to-all Scatter, gather Barrier

29 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 29 Broadcast and Multicast P0 P1 P2 P3 Broadcast Message P0 P1 P2 P3 Message Multicast

30 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 30 All-to-All P0 P1 P2 P3 Message

31 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 31 Reduction sum  0 for i  1 to p do sum  sum + A[i] P0 P1 P2 P3 A[1] A[2] A[3] P0 P1 P2 P3 A[1] A[2] + A[3] A[3] A[0] A[1] A[2] A[3] A[0] + A[1] A[2] + A[3] A[0] + A[1] + A[2] + A[3]

32 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 32 Ring Broadcast O(p)

33 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 33 Ring Broadcast O(logp)

34 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 34 Mesh Broadcast

35 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 35 Computation vs. Communication Cost 2GHz clock => 1/2 ns instruction cycle Memory access: –L1: ~2-4 cycles => 1-2 ns –L2: ~5-10 cycles => 2.5-5 ns –Memory: ~120-300 cycles => 60-150 ns Message roundtrip latency: –~20  s –Suppose 75% hit ratio in L1, no L2, 1 ns L1 access time, 200 ns memory access time => average memory access time 51 ns –1 message roundtrip latency = ~400 memory accesses

36 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 36 Performance … Always Performance! So … obviously, when we talk about message passing, we want to know how to optimize for performance But … which aspects of message passing should we optimize? –We could try to optimize everything »Optimizing the wrong thing wastes precious resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly

37 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 37 Martin et al.: LogP Model

38 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 38 Sensitivity to LogGP Parameters LogGP parameters: –L = delay incurred in passing a short msg from source to dest –o = processor overhead involved in sending or receiving a msg –g = min time between msg transmissions or receptions (msg bandwidth) –G = bulk gap = time per byte transferred for long transfers (byte bandwidth) Workstations connected with Myrinet network and Generic Active Messages layer Delay insertion technique Applications written in Split-C but perform their own data caching

39 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 39 Sensitivity to Overhead

40 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 40 Sensitivity to Gap

41 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 41 Sensitivity to Latency

42 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 42 Sensitivity to Bulk Gap

43 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 43 Summary Runtime strongly dependent on overhead and gap Strong dependence on gap because of burstiness of communication Not so sensitive to latency => can effectively overlap computation and communication with non- blocking reads (writes usually do not stall the processor) Not sensitive to bulk gap => got more bandwidth than we know what to do with

44 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 44 What’s the Point? What can we take away from Martin et al.’s study? –It’s extremely important to reduce overhead because it may affect both “o” and “g” –All the “action” is currently in the OS and the Network Interface Card (NIC) Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992.

45 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 45 User-Level Access to NIC Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level

46 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 46 User-level Communication Basic idea: remove the kernel from the critical path of sending and receiving messages –user-memory to user-memory: zero copy –permission is checked once when the mapping is established –buffer management left to the application Advantages –low communication latency –low processor overhead –approach raw latency and bandwidth provided by the network One approach: U-Net

47 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 47 U-Net Abstraction

48 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 48 U-Net Endpoints

49 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 49 U-Net Basics Protection provided by endpoints and communication channels –Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory) –Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS. Message queues can be placed at different memories to optimize polling –Receive queue allocated in host memory –Send and free queues allocated in NIC memory

50 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 50 U-Net Performance on ATM

51 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 51 U-Net UDP Performance

52 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 52 U-Net TCP Performance

53 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 53 U-Net Latency

54 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 54 Virtual Memory-Mapped Communication Receiver exports the receive buffers Sender must import a receive buffer before sending The permission of sender to write into the receive buffer is checked once, when the export/import handshake is performed (usually at the beginning of the program) Sender can directly communicate with the network interface to send data into imported buffers without kernel intervention At the receiver, the network interface stores the received data directly into the exported receive buffer with no kernel intervention

55 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 55 Virtual-to-physical address In order to store data directly into the application address space (exported buffers), the NI must know the virtual to physical translations What to do? sender receiver int rec_buffer[1024]; exp_id = export(rec_buffer, sender); recv(exp_id); int send_buffer[1024]; recv_id = import(receiver, exp_id); send(recv_id, send_buffer);

56 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 56 Software TLB in Network Interface The network interface must incorporate a TLB (NI-TLB) which is kept consistent with the virtual memory system When a message arrives, NI attempts a virtual to physical translation using the NI-TLB If a translation is found, NI transfers the data to the physical address in the NI-TLB entry If a translation is missing in the NI-TLB, the processor is interrupted to provide the translation. If the page is not currently in memory, the processor will bring the page in. In any case, the kernel increments the reference count for that page to avoid swapping When a page entry is evicted from the NI-TLB, the kernel is informed to decrement the reference count Swapping prevented while DMA in progress


Download ppt "CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University."

Similar presentations


Ads by Google