CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University.

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,
Parallel System Performance CS 524 – High-Performance Computing.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
EE 4272Spring, 2003 Chapter 10 Packet Switching Packet Switching Principles  Switching Techniques  Packet Size  Comparison of Circuit Switching & Packet.
Wide Area Networks School of Business Eastern Illinois University © Abdou Illia, Spring 2007 (Week 11, Thursday 3/22/2007)
Communication operations Efficient Parallel Algorithms COMP308.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
I/O Systems CS 3100 I/O Hardware1. I/O Hardware Incredible variety of I/O devices Common concepts ◦Port ◦Bus (daisy chain or shared direct access) ◦Controller.
Chapter 10 Introduction to Wide Area Networks Data Communications and Computer Networks: A Business User’s Approach.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Parallel System Performance CS 524 – High-Performance Computing.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
Ethan Kao CS 6410 Oct. 18 th  Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth.
Storage area network and System area network (SAN)
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Connecting LANs, Backbone Networks, and Virtual LANs
Switching, routing, and flow control in interconnection networks.
Switching Techniques Student: Blidaru Catalina Elena.
ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.
Computer Science Department
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
I/O Systems I/O Hardware Application I/O Interface
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
PPC Spring Interconnection Networks1 CSCI-4320/6360: Parallel Programming & Computing (PPC) Interconnection Networks Prof. Chris Carothers Computer.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Computer Architecture Distributed Memory MIMD Architectures Ola Flygt Växjö University
Computer Networks with Internet Technology William Stallings
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Local-Area-Network (LAN) Architecture Department of Computer Science Southern Illinois University Edwardsville Fall, 2013 Dr. Hiroshi Fujinoki
CSE 661 PAPER PRESENTATION
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
EECB 473 Data Network Architecture and Electronics Lecture 1 Conventional Computer Hardware Architecture
Super computers Parallel Processing
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Interconnection Networks Communications Among Processors.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
CS703 - Advanced Operating Systems
Chapter 3 Part 3 Switching and Bridging
CMSC 611: Advanced Computer Architecture
Switching, routing, and flow control in interconnection networks
Communication operations
Chapter 3 Part 3 Switching and Bridging
CS 6290 Many-core & Interconnect
Presentation transcript:

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University

CS 505: Thu D. Nguyen Rutgers University, Spring Basic Message Passing P0P1 N0 SendReceive P0P1 N0N1 Communication Fabric SendReceive

CS 505: Thu D. Nguyen Rutgers University, Spring Terminology Basic Message Passing: –Send: Analogous to mailing a letter –Receive: Analogous to picking up a letter from the mailbox –Scatter-gather: Ability to “scatter” data items in a message into multiple memory locations and “gather” data items from multiple memory locations into one message Network performance: –Latency: The time from when a Send is initiated until the first byte is received by a Receive. –Bandwidth: The rate at which a sender is able to send data to a receiver.

CS 505: Thu D. Nguyen Rutgers University, Spring Scatter-Gather … Message Memory Scatter (Receive) … Message Memory Gather (Send)

CS 505: Thu D. Nguyen Rutgers University, Spring Network Topologies

CS 505: Thu D. Nguyen Rutgers University, Spring Terminology Network partition: When a network is broken into two or more components that cannot communicate with each others. Diameter: Maximum length of shortest path between any two processors. Connectivity: Measure of the multiplicity of paths between any two processors - Minimum number of links that must be removed to partition the network. Bisection width: Minimum number of links that must be removed to partition the network into two equal halves. Bisection bandwidth: Minimum volume of communication allowed between any two halves of the network with an equal number of processors.

CS 505: Thu D. Nguyen Rutgers University, Spring Bisection Bandwidth = Bisection Width * Link Bandwidth

CS 505: Thu D. Nguyen Rutgers University, Spring Typical Network Diagram

CS 505: Thu D. Nguyen Rutgers University, Spring Typical Node CPU MemoryNICRouter

CS 505: Thu D. Nguyen Rutgers University, Spring Bus-Based Network Advantages –Simple –Diameter = 1 Disadvantages –Blocking –Bandwidth does not scale with p –Easy to partition network Bus

CS 505: Thu D. Nguyen Rutgers University, Spring Completely-Connected Network Advantages –Diameter = 1 –Bandwidth scales with p –Non-blocking –Difficult to partition network Disadvantages –Number of links grows O(p 2 ) –Fan-in (and out) at each node grows linearly with p

CS 505: Thu D. Nguyen Rutgers University, Spring Star Network Essentially same as Bus- Based Network

CS 505: Thu D. Nguyen Rutgers University, Spring Ring Network

CS 505: Thu D. Nguyen Rutgers University, Spring Mesh and Torus Network

CS 505: Thu D. Nguyen Rutgers University, Spring Multistage Network

CS 505: Thu D. Nguyen Rutgers University, Spring Perfect Shuffle

CS 505: Thu D. Nguyen Rutgers University, Spring Omega Network - Log(p) Stages

CS 505: Thu D. Nguyen Rutgers University, Spring Blocking in Omega Network

CS 505: Thu D. Nguyen Rutgers University, Spring Tree Network

CS 505: Thu D. Nguyen Rutgers University, Spring Fat Tree Network

CS 505: Thu D. Nguyen Rutgers University, Spring Hypercube Network

CS 505: Thu D. Nguyen Rutgers University, Spring Hypercube Network

CS 505: Thu D. Nguyen Rutgers University, Spring k-ary d-cube Networks k: radix of the network - the number of processors in each dimension d: dimension of the network k-ary d-cube can be constructed from k k-ary (d- 1)-cubes by connecting the nodes occupying identical positions into rings Examples: –Hypercube: binary d-cube –Ring: p-ary 1-cube

CS 505: Thu D. Nguyen Rutgers University, Spring Arbitrary Topology Networks Switch

CS 505: Thu D. Nguyen Rutgers University, Spring Network Characteristics

CS 505: Thu D. Nguyen Rutgers University, Spring Packet vs. Wormhole Routing Message Packets Worm

CS 505: Thu D. Nguyen Rutgers University, Spring Store-and-Forward vs. Cut- Through Routing Store-and-Forward: –Cannot route/forward a packet until the entire packet has been received Cut-Through: –Can route/forward a packet as soon as the router has received and processed the header Worm-hole is always cut-through because not enough buffer space to hold entire message Packet routing is almost always cut-through as well Difference: when blocked, a worm can span multiple routers while a packet will fit entirely into the buffer of a single router

CS 505: Thu D. Nguyen Rutgers University, Spring Collective Communication Primitives Send/Receive necessary and sufficient Broadcast, multicast –one-to-all, all-to-all, one-to-all personalized, all-to-all personalized –flood Reduction –all-to-one, all-to-all Scatter, gather Barrier

CS 505: Thu D. Nguyen Rutgers University, Spring Broadcast and Multicast P0 P1 P2 P3 Broadcast Message P0 P1 P2 P3 Message Multicast

CS 505: Thu D. Nguyen Rutgers University, Spring All-to-All P0 P1 P2 P3 Message

CS 505: Thu D. Nguyen Rutgers University, Spring Reduction sum  0 for i  1 to p do sum  sum + A[i] P0 P1 P2 P3 A[1] A[2] A[3] P0 P1 P2 P3 A[1] A[2] + A[3] A[3] A[0] A[1] A[2] A[3] A[0] + A[1] A[2] + A[3] A[0] + A[1] + A[2] + A[3]

CS 505: Thu D. Nguyen Rutgers University, Spring Ring Broadcast O(p)

CS 505: Thu D. Nguyen Rutgers University, Spring Ring Broadcast O(logp)

CS 505: Thu D. Nguyen Rutgers University, Spring Mesh Broadcast

CS 505: Thu D. Nguyen Rutgers University, Spring Computation vs. Communication Cost 2GHz clock => 1/2 ns instruction cycle Memory access: –L1: ~2-4 cycles => 1-2 ns –L2: ~5-10 cycles => ns –Memory: ~ cycles => ns Message roundtrip latency: –~20  s –Suppose 75% hit ratio in L1, no L2, 1 ns L1 access time, 200 ns memory access time => average memory access time 51 ns –1 message roundtrip latency = ~400 memory accesses

CS 505: Thu D. Nguyen Rutgers University, Spring Performance … Always Performance! So … obviously, when we talk about message passing, we want to know how to optimize for performance But … which aspects of message passing should we optimize? –We could try to optimize everything »Optimizing the wrong thing wastes precious resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly

CS 505: Thu D. Nguyen Rutgers University, Spring Martin et al.: LogP Model

CS 505: Thu D. Nguyen Rutgers University, Spring Sensitivity to LogGP Parameters LogGP parameters: –L = delay incurred in passing a short msg from source to dest –o = processor overhead involved in sending or receiving a msg –g = min time between msg transmissions or receptions (msg bandwidth) –G = bulk gap = time per byte transferred for long transfers (byte bandwidth) Workstations connected with Myrinet network and Generic Active Messages layer Delay insertion technique Applications written in Split-C but perform their own data caching

CS 505: Thu D. Nguyen Rutgers University, Spring Sensitivity to Overhead

CS 505: Thu D. Nguyen Rutgers University, Spring Sensitivity to Gap

CS 505: Thu D. Nguyen Rutgers University, Spring Sensitivity to Latency

CS 505: Thu D. Nguyen Rutgers University, Spring Sensitivity to Bulk Gap

CS 505: Thu D. Nguyen Rutgers University, Spring Summary Runtime strongly dependent on overhead and gap Strong dependence on gap because of burstiness of communication Not so sensitive to latency => can effectively overlap computation and communication with non- blocking reads (writes usually do not stall the processor) Not sensitive to bulk gap => got more bandwidth than we know what to do with

CS 505: Thu D. Nguyen Rutgers University, Spring What’s the Point? What can we take away from Martin et al.’s study? –It’s extremely important to reduce overhead because it may affect both “o” and “g” –All the “action” is currently in the OS and the Network Interface Card (NIC) Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992.

CS 505: Thu D. Nguyen Rutgers University, Spring User-Level Access to NIC Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level

CS 505: Thu D. Nguyen Rutgers University, Spring User-level Communication Basic idea: remove the kernel from the critical path of sending and receiving messages –user-memory to user-memory: zero copy –permission is checked once when the mapping is established –buffer management left to the application Advantages –low communication latency –low processor overhead –approach raw latency and bandwidth provided by the network One approach: U-Net

CS 505: Thu D. Nguyen Rutgers University, Spring U-Net Abstraction

CS 505: Thu D. Nguyen Rutgers University, Spring U-Net Endpoints

CS 505: Thu D. Nguyen Rutgers University, Spring U-Net Basics Protection provided by endpoints and communication channels –Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory) –Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS. Message queues can be placed at different memories to optimize polling –Receive queue allocated in host memory –Send and free queues allocated in NIC memory

CS 505: Thu D. Nguyen Rutgers University, Spring U-Net Performance on ATM

CS 505: Thu D. Nguyen Rutgers University, Spring U-Net UDP Performance

CS 505: Thu D. Nguyen Rutgers University, Spring U-Net TCP Performance

CS 505: Thu D. Nguyen Rutgers University, Spring U-Net Latency

CS 505: Thu D. Nguyen Rutgers University, Spring Virtual Memory-Mapped Communication Receiver exports the receive buffers Sender must import a receive buffer before sending The permission of sender to write into the receive buffer is checked once, when the export/import handshake is performed (usually at the beginning of the program) Sender can directly communicate with the network interface to send data into imported buffers without kernel intervention At the receiver, the network interface stores the received data directly into the exported receive buffer with no kernel intervention

CS 505: Thu D. Nguyen Rutgers University, Spring Virtual-to-physical address In order to store data directly into the application address space (exported buffers), the NI must know the virtual to physical translations What to do? sender receiver int rec_buffer[1024]; exp_id = export(rec_buffer, sender); recv(exp_id); int send_buffer[1024]; recv_id = import(receiver, exp_id); send(recv_id, send_buffer);

CS 505: Thu D. Nguyen Rutgers University, Spring Software TLB in Network Interface The network interface must incorporate a TLB (NI-TLB) which is kept consistent with the virtual memory system When a message arrives, NI attempts a virtual to physical translation using the NI-TLB If a translation is found, NI transfers the data to the physical address in the NI-TLB entry If a translation is missing in the NI-TLB, the processor is interrupted to provide the translation. If the page is not currently in memory, the processor will bring the page in. In any case, the kernel increments the reference count for that page to avoid swapping When a page entry is evicted from the NI-TLB, the kernel is informed to decrement the reference count Swapping prevented while DMA in progress