Download presentation
1
Low Overhead Fault Tolerant Networking (in Myrinet)
Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003
2
Motivation An increasing use of COTS components in systems has been motivated by the need to Reduce cost in design and maintenance Reduce software complexity The emergence of low cost, high performance COTS networking solutions e.g., Myrinet, SCI, FiberChannel etc. The increasing complexity of network interfaces has renewed concerns about its reliability The amount of silicon used has increased tremendously We need to use what is available to provide the fault detection and recovery Nothing fancy is typically available
3
How can we incorporate fault tolerance
The Basic Question How can we incorporate fault tolerance into a COTS network technology without greatly compromising its performance?
4
Microprocessor-based Networks
Most modern network technologies have processors in their interface cards that help to achieve superior network performance Many of these technologies allow changes in the program running on the network processor Such programmable interfaces offer numerous benefits: Developing different fault tolerance techniques Validating fault recovery using fault injection experimenting with different communication protocols We use Myrinet as the platform for our study You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have
5
Myrinet Myrinet is a cost-effective high performance (2.2 Gb/s) packet switching technology At its core is a powerful RISC processor It is scalable to thousands of nodes Low latency communication (8 ms) is achieved through direct interaction with network interface (“OS bypass”) Flow control, error control and simple “heartbeat mechanisms” are incorporated in hardware Link and routing specifications are public & standard Myrinet support software is supplied “open source” You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have
6
Myrinet Configuration
Host Node System Memory Host Processor System Bridge I/O Bus LANai SRAM Timers 1 2 PCI Bridge DMA Engine Host Interface Packet Interface SAN/LAN Conversion RISC PCIDMA LANai 9
7
Myrinet Control Program
Hardware & Software Application Host Processor System Memory Middleware (e.g., MPI) TCP/IP interface OS driver I/O Bus Myrinet Card Network Processor Local Memory Myrinet Control Program Programmable Interface
8
Susceptability to Failures
Dependability evaluation was carried out using software implemented fault injection Faults were injected in the Control Program (MCP) A wide range of failures were observed Unexpected latencies and reduction of bandwidth The network processor can hang and stop responding A host system can crash/hang A remote network interface can get affected Similar type of failures can be expected from other high-speed networks Such failures can greatly impact the reliability/availability of the system
9
Summary of Experiments
Failure Category Count % of Injections 57.9 1205 No Impact 1.15 23 Other Errors 0.43 9 Host Computer Crash 3.1 65 MCP Restart 12.7 264 Messages Dropped/Corrupted 24.6 514 Host Interface Hang Total 2080 100 More than 50% of the failures were host interface hangs
10
Design Considerations
The faults must be detected and diagnosed as quickly as possible The network interface must be up and running as soon as possible The recovery process must ensure that no messages are lost or improperly received/sent Complete correctness should be achieved The overhead on the normal running of the system must be minimal The fault tolerance should be made as transparent to the user as possible
11
Fault Detection Continuously polling the card can be very costly
We use a spare interval timer to implement a watchdog timer functionality for fault detection We set the LANai to raise an interrupt when the timer expires A routine (L_timer) that the LANai is supposed to execute every so often resets this interval timer If the interface hangs, then L_timer is not executed, causing our interval timer to expire and raising a FATAL interrupt
12
Fault Recovery Summary
The FATAL interrupt signal is picked by the fault recovery daemon on the host The failure is verified through numerous probing messages The control program is reloaded into the LANai SRAM Any process that was accessing the board prior to the failure is also restored to its original state Simply reloading the MCP will not ensure correctness You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have
13
Myrinet Programming Model
Flow control is achieved through send and receive tokens Myrinet software (GM) provides reliable in-order delivery of messages A modified form of “Go-Back-N” protocol is used Sequence numbers for the protocol are provided by the MCP One stream of sequence numbers exists per destination You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have
14
Typical Control Flow Sender Receiver User process prepares message
User process sets send token User process provides receive buffer User process sets recv token LANai sdmas message LANai sends message LANai receives ACK LANai sends event to process LANai recvs message LANai sends ACK LANai rdmas message LANai sends event to process User process handles notification event User process reuses buffer User process handles notification event User process reuses buffer
15
Duplicate Messages Sender Receiver
User process prepares message User process sets send token User process provides receive buffer User process sets recv token LANai sdmas message LANai sends message LANai recvs message LANai sends ACK LANai rdmas message LANai sends event to process LANai goes down Lost ACK Driver reloads MCP into board Driver resends all unacked messages LANai sdmas message LANai sends message User process handles notification event User process reuses buffer Duplicate message LANai recvs message ERROR! Lack of redundant state information is the cause for this problem
16
Lost Messages Sender Receiver
User process prepares message User process sets send token User process provides receive buffer User process sets recv token LANai sdmas message LANai sends message LANai receives ACK LANai sends event to process LANai recvs message LANai sends ACK LANai goes down User process handles notification event User process reuses buffer Driver reloads MCP into board Driver sets all recv tokens again LANai waits for message ERROR! Incorrect commit point is the cause of this problem
17
Fault Recovery We need to keep a copy of the state information
Checkpointing can be a big overhead Logging critical message information is enough GM functions are modified so that A copy of the send tokens and the receive tokens is made with every send and receive call The host processes provide the sequence numbers, one per (destination node, local port) pair Copy of send and receive token is removed when the send/receive completes successfully MCP is modified ACK is sent out only after a message is DMAed to host memory
18
Performance Impact The scheme has been integrated successfully into GM
Over 1 man year for complete implementation How much of the performance of the system has been compromised ? After all one can’t get a free lunch these days! Performance is measured using two key parameters Bandwidth obtained with large messages Latency of small messages
19
Latency
20
Bandwidth
21
Summary of Results Host Platform: Pentium III with 256MB
6.8 ms 6.0 ms LANai-CPU utilization 1.15 ms 0.75 ms Host-CPU utilization for receive 0.55 ms 0.3 ms Host-CPU utilization for send 13.0 ms 11.5 ms Latency 92 MHz 92.4 MHz Bandwidth FTGM GM Performance Metric Host Platform: Pentium III with 256MB RedHat Linux 7.2
22
Summary of Results Fault Detection Latency = 50 ms
Fault Recovery Latency = s Per-Process Latency = 0.50 s
23
Our Contributions We have devised smart ways to detect and recover from network interface failures Our fault detection technique for “network processor hangs” uses software implemented watchdog timers Fault recovery time (including reloading of network control program) ~ 2 seconds Performance impact is under 1% for messages over 1KB Complete user transparency was achieved
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.