Make Hosts Ready for Gigabit Networks. Hardware Requirement To allow a host to fully utilize Gbps bandwidth, its hardware system must be ready for Gbps.

Slides:

Advertisements

Similar presentations

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.

Advertisements

CS-334: Computer Architecture

FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Chapter 7 Protocol Software On A Conventional Processor.

ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.

What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.

Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Eliminating Receive Livelock in an Interrupt-driven Kernel Jeffrey C. Mogul K. K. Ramakrishnan AT&T Bell Laboratories

Chapter 7 Interupts DMA Channels Context Switching.

I/O Systems CS 3100 I/O Hardware1. I/O Hardware Incredible variety of I/O devices Common concepts ◦Port ◦Bus (daisy chain or shared direct access) ◦Controller.

Chapter 1 and 2 Computer System and Operating System Overview

Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

G Robert Grimm New York University Receiver Livelock.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.

System Calls 1.

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

LWIP TCP/IP Stack 김백규.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

I/O Systems I/O Hardware Application I/O Interface

CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Computer Architecture Lecture10: Input/output devices Piotr Bilski.

Operating Systems and Networks AE4B33OSS Introduction.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

2009 Sep 10SYSC Dept. Systems and Computer Engineering, Carleton University F09. SYSC2001-Ch7.ppt 1 Chapter 7 Input/Output 7.1 External Devices 7.2.

Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.

Chapter 13: I/O Systems. 13.2/34 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

CH10 Input/Output DDDData Transfer EEEExternal Devices IIII/O Modules PPPProgrammed I/O IIIInterrupt-Driven I/O DDDDirect Memory.

We will focus on operating system concepts What does it do? How is it implemented? Apply to Windows, Linux, Unix, Solaris, Mac OS X. Will discuss differences.

Processor Architecture

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.

Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.

IT3002 Computer Architecture

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

Input Output Techniques Programmed Interrupt driven Direct Memory Access (DMA)

بسم الله الرحمن الرحيم MEMORY AND I/O.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

CS 6560: Operating Systems Design

CS 286 Computer Organization and Architecture

Hyperthreading Technology

Lecture Topics: 11/1 General Operating System Concepts Processes

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:

Chapter 13: I/O Systems.

Presentation transcript:

Make Hosts Ready for Gigabit Networks

Hardware Requirement To allow a host to fully utilize Gbps bandwidth, its hardware system must be ready for Gbps. For example: –CPU speed Is Pentium 100 MHZ PC fast enough to process a large number of packets per second? (10 bits/HZ ?) –Memory throughput Is SDRAM’s sustained throughput large enough to move data in and out of it at Gbps ? –I/O Bus bandwidth Is 32-bit 33 MHZ PCI bus fast enough to move data at Gbps –Network interface Is the firmware on the NIC fast enough to process packets at Gbps?

Software Design If a host’s hardware system can barely support Gbps bandwidth, its software system must be carefully designed so that Gbps can still be achieved for an application. For example: –NIC device driver in OS –TCP/IP Protocol stack in OS –Routing table look up in OS –Buffer system in OS –API between OS and application programs –Networking services (e.g., NAT, Firewall) –(Improving the design and implementation of software systems is focus of our course.)

The Path of Moving Data What networking does is basically to move data from one networking application residing on one machine to a networking application residing on a different machine. The path of moving data is: – Application -> operating system -> network interface - > network -> network interface -> operating system -> application program. Therefore, to achieve Gbps, moving data between application and operating system, and between operating system and network interface, must be performed at least at Gbps.

The Cost of Moving Data The cost of moving data is very high. –The CPU speed has been continuously improved and now increased to 2 GHZ. However, the throughput and access speed of memory (e.g., SDRAM) remains about the same as those a few years ago. –Therefore, the CPU now needs to wait and waste more clock cycles to access a word in memory. –The cost of moving data now becomes increasingly high and the memory becomes the performance bottleneck. Therefore, the goal is to minimize the need for moving data or hide the cost of moving data.

Hide Memory Access Cost (1) Scoreboarding processor –Instructions that load data into a register do not need to wait for the data to come back from memory, but rather mark the registers as awaiting data. (single stream) –The processor then can continue execution. –Only if an instruction accesses the register before the memory access has completed does the processor needs to stop execution.

Hide Memory Access Cost (2) Super-scalar processor –Permit independent instructions to be executed in the same clock cycle (multiple instruction streams) –Therefore, an instruction that is loading data from memory can be executed in parallel with an instruction that does not need this data. Both scoreboarding and super-scalar methods benefit reading a small amount of data. They are not very useful for reading a large amount of data. Therefore, operating system should be designed to minimize the number of times that a large amount of data has to be copied.

Host Memory Hierarchy Good cache performance depends on good locality. However, networking code often violates the locality assumptions. Example: when a packet arrives, it interrupts the execution of the processor. This forces the processor to load new instructions. Furthermore, because the data of the packet is not in the data cache, it needs to be fetched from memory.

The Problem with Layered Code Layering is a useful concept that enables network researchers to cooperate together at the same time on different aspects of a networking problem. However, an implementation of protocol stacks based on strict layering often results in bad performance. –Because the upper layer does not know which format the lower layer wants, the packets copied into the lower layer often need to be reformatted and recopied. –Nowadays, we are seeing that more and more implementation violates the layering concept for higher performance. E.g., Content-aware (URL) Web switching at an IP router.

Reduce Memory Copy Operations Currently, on a UNIX host, two data copy operations are needed to move data in an application to the network interface. –Application -> OS –OS -> network interface.

One-Copy Techniques Virtual page remapping: –The first copy can be eliminated by using virtual memory mechanism to map the pages used by the application to the pages used by the buffer in OS. –The buffers in the application must start and end at page boundary for this mechanism to work. Copy-on-write: –The first copy can also be saved by COW. –If a packet needs to be copied from one domain to another domain, copy-on-write can be used to reduce or eliminate the copy operation. –The pages of the packet will be copied and generated only when the packet is modified. Otherwise, the same pages are used in different domains.

One-Copy Techniques Memory-mapped buffer: –The second copy can be eliminated by mapping the memory on the NIC to a part of the system memory. –The OS then can use the mapped system memory for its buffer area. –Therefore, when the application’s data is copied to the buffer in the OS, effectively it is copied into the memory on NIC. –(From PCI specification, it shows that if there is memory on a PCI card, we can map that memory to a part of the system memory.)

Zero-Copy Technique Memory-mapped buffer + Virtual page remapping : –We can first map the NIC’s buffer to the buffer in the OS (PCI hardware map operation). –We can then map the buffer in the application to the buffer in the OS. (OS software map operation) –Then, the buffer in the application is mapped to the NIC’s buffer. –This will result in zero-copy operation. Although from network performance’s viewpoints zero-copy is good, it is very difficult to use for the application. –Because now the application needs to know the hardware details, which however should be hidden by the OS.

DMA Technique To avoid the data copy operation between the OS and the NIC, instead of using the normal programming I/O, we can use DMA. Using DMA, a NIC can transfer data directly from/to memory without involving the CPU. This enables CPU to execute in parallel with the data transfer. (However, CPU may still be stalled.) Generally, DMA’s performance is better than PIO. However, there are some situations where PIO is preferred (e.g., doing checksuming) Scatter-Gather capability in DMA-based NIC is important because they can avoid data copy operations.

Buffer Editing To support Gbps, the design of a buffer system should allow buffers to be created, clipped, shared, split, concatenated, destroyed with little overhead. Otherwise, a packet may need to be copied to a new buffer again and again while traversing the layers of a protocol stack. –E.g., as a packet goes down/up a protocol stack when it is sent/received, more and more headers need to be prepended/stripped to/from it. Generally, lists or tree structures are used as the data structure to easily support the above operations. –E.g., the mbuf used in the BSD UNIX.

API Design The design of application program interface (API) can significantly affect the data passing performance between the user application and the OS. Currently, the read() and write() system calls provided on UNIX allow the user to choose a buffer with arbitrary address, size, alignment, and unconstraint access to that buffer. –This makes the OS difficult to avoid the data copy operation between the application and the OS. Suppose that, instead, the UNIX requires that the buffer must start and end at page boundary, the length be a multiple of page size, then copy-on-write technique can be used to avoid one copy operation.

Data Manipulation Data manipulation are computations that inspect and possibly modify every word of data in a network packet. –E.g., encryption, compression, checksuming, presentation conversion, etc. Typically, different network layers manipulate data independently from each other. Each data manipulation requires the CPU to load potentially un-cached data from memory and store the inspected/modified data to memory. Therefore, repeated transfers need to across the CPU/memory data path multiple times, which limits and lowers the achievable maximum throughput.

Integrated Layer Processing Integrated layer processing (ILP) technique can be used to minimize the number of data transfers. The data manipulation steps from different protocol layers are combined into a pipeline. A word of data is loaded into a register, then manipulated by multiple manipulation layers while it remains in a register, then finally stored – all before the next word of data is processed. In this way, a combined series of data manipulations only transfer data from memory to the CPU and back once, instead of transferring the data once per distinct layer. The difficulty is that different manipulations cannot be easily integrated.

Copy-Avoiding Techniques Relationship

NIC to NIC Transfers What we have discussed so far is to reduce the number of copy operations required for sending data from the user application, through the OS, to the NIC. Here, we discuss the methods that can reduce the number of copy operations required for forwarding data from one NIC to another NIC. (I.e., the system functions as a routing or switching device.)

Techniques for NIC-to-NIC (1) Hardware streaming (peer-to-peer) –The maximum achievable forwarding throughput is the I/O bus bandwidth. –The problem are that special hardware is required and the OS has no chance to inspect/modify packets. As a result, some processing (e.g., routing table lookup) needs to be performed by the CPU on NICs. However, due to economic, the CPU on NICs are often much slower than the CPU on the system. DMA-DMA streaming –The maximum achievable throughput is only ½ of the I/O bandwidth. –However, packets can be inspected/modified by the OS. E.g., routing table lookup

Techniques for NIC-to-NIC (2) OS kernel streaming –Packets are first DMA’ed into memory. –Packets then are read from memory to the CPU for inspection or modification. –Depending on the number of inspection/modification and the memory system read/write throughput, the achievable maximum forwarding throughput is further limited by (memory throughput / number of read or write). User-level streaming –In some applications, packets may need to go up to the user level for inspection/modification. E.g., a Web proxy system, an relay system, NATD –The throughput will be further limited.

Latency of Small Packets For large packets, we care about the cost of copying them (i.e., transfer throughput). For small packets, however, what we care about is the latency of their transmission. The following three interactions between the processor and memory can affect latency: –Branch misses –Context switching –Interrupts

Branch Misses –To make the instruction pipeline full (some processors can have up to 13 stages), most processors today fetch instructions continuously. –Conditional branches, however, present a problem because the target instruction cannot be determined until the condition result has been computed. –If the CPU waits for the completion of the condition testing before fetching the next instruction, the pipeline cannot be full most of the time. This will result in low CPU utilization. –To solve this problem, most processors today try to predict the next instruction to perform. –If the guess is wrong, the instructions that are already fetched need to be abandoned. This will also result in low CPU utilization. Do not put too many if-then-else in your networking code.

Context Switches –Context switches are very expensive because they require both new code and data be fetched from the slow memory and loaded into the processor cache. –In a perfect system, no more than one context switch should be needed to send a packet and one context switch plus an interrupt to receive a packet. –In micro-kernel OS, sending and receiving a packet need more context switches than a traditional UNIX kernel. (because the packet needs to traverse the application program, network server, and micro kernel domains) For a high-speed system, macros are preferred over function calls. Function calls are preferred over threads (need to save its PC and stack) to process an incoming packet.

Interrupts Interrupts are very expensive. –They cause context switches, which in turn cause a lot of code and data cache misses. –Sometimes the host’s priority mode needs to be changed from the user mode to privileged mode when an interrupt occurs. Changing mode, however, is a very slow operation. One solution is to minimize the number of interrupts. –Do not issue a receive interrupt for every incoming packet. Issue an interrupt only when a certain number of packets have been received or a timer has expired. –When a receive interrupt occurs, the device driver retrieves and processes as many packets from the NIC as possible. –Do not issue a transmit interrupt for every sent packet. Issue a transmit interrupt only after a certain number of packets have been sent or a timer ha expired.

Receive Livelock Problem Can happen in an interrupt-driven kernel This problem happens when packets arrives at the system at high rates. When this problem occurs, the system will spend all of its time processing interrupts, to the exclusion of other necessary tasks. The result is that, under extreme conditions, no packets can be delivered to the user application or the output of the system. To avoid this problem, tasks and interrupts must be carefully scheduled.

Techniques to Avoid Livelock Limit the interrupt arrival rate –For example, when the ipintr queue is going to be full and packets are going to be dropped, we can temporarily disable interrupts. The interrupt can be re-enabled when the buffer occupancy of the ipintr queue drops a certain threshold. Use of polling –Poll each NIC at a fixed rate. This can limit packet processing rate and also provide fair resource allocation between multiple interfaces. Avoiding preemption –Let higher-level protocol processing (e.g., TCP/IP) be executed at the same level as that used by an interrupt service routine.