TCP Throughput Collapse in Cluster-based Storage Systems

Slides:



Advertisements
Similar presentations
A Comparison of Mechanisms for Improving TCP Performance over Wireless Links Published In IEEE/ACM TRANSACTIONS ON NETWORKING, VOL.5 NO.6,DECEMBER 1997.
Advertisements

TCP Variants.
LOGO Transmission Control Protocol 12 (TCP) Data Flow.
Improving TCP over Wireless by Selectively Protecting Packet Transmissions Carla F. Chiasserini Michele Garetto Michela Meo Dipartimento di Elettronica.
Simulation-based Comparison of Tahoe, Reno, and SACK TCP Kevin Fall & Sally Floyd Presented: Heather Heiman September 10, 2002.
TCP Vegas: New Techniques for Congestion Detection and Control.
1 Transport Protocols & TCP CSE 3213 Fall April 2015.
Hui Zhang, Fall Computer Networking TCP Enhancements.
By Arjuna Sathiaseelan Tomasz Radzik Department of Computer Science King’s College London EPDN: Explicit Packet Drop Notification and its uses.
TCP: Transmission Control Protocol Overview Connection set-up and termination Interactive Bulk transfer Timers Improvements.
Congestion Control Created by M Bateman, A Ruddle & C Allison As part of the TCP View project.
TCP Congestion Control Dina Katabi & Sam Madden nms.csail.mit.edu/~dina 6.033, Spring 2014.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
1 TCP - Part II. 2 What is Flow/Congestion/Error Control ? Flow Control: Algorithms to prevent that the sender overruns the receiver with information.
Fundamentals of Computer Networks ECE 478/578 Lecture #21: TCP Window Mechanism Instructor: Loukas Lazos Dept of Electrical and Computer Engineering University.
Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.
TDC365 Spring 2001John Kristoff - DePaul University1 Internetworking Technologies Transmission Control Protocol (TCP)
Diagnosing Wireless TCP Performance Problems: A Case Study Tianbo Kuang, Fang Xiao, and Carey Williamson University of Calgary.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #7 TCP New Reno Vs. Reno.
1 Internet Networking Spring 2002 Tutorial 10 TCP NewReno.
1 TCP Transport Control Protocol Reliable In-order delivery Flow control Responds to congestion “Nice” Protocol.
MULTIMEDIA TRAFFIC MANAGEMENT ON TCP/IP OVER ATM-UBR By Dr. ISHTIAQ AHMED CH.
A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.
TDC375 Winter 03/04 John Kristoff - DePaul University 1 Network Protocols Transmission Control Protocol (TCP)
1 Internet Networking Spring 2004 Tutorial 10 TCP NewReno.
CMPE 257 Spring CMPE 257: Wireless and Mobile Networking Spring 2005 E2E Protocols (point-to-point)
TCP. Learning objectives Reliable Transport in TCP TCP flow and Congestion Control.
IA-TCP A Rate Based Incast- Avoidance Algorithm for TCP in Data Center Networks Communications (ICC), 2012 IEEE International Conference on 曾奕勳.
Data Transfer Case Study: TCP  Go-back N ARQ  32-bit sequence # indicates byte number in stream  transfers a byte stream, not fixed size user blocks.
COMT 4291 Communications Protocols and TCP/IP COMT 429.
Raj Jain The Ohio State University R1: Performance Analysis of TCP Enhancements for WWW Traffic using UBR+ with Limited Buffers over Satellite.
Improving TCP Performance over Mobile Networks Zahra Imanimehr Rahele Salari.
TCP : Transmission Control Protocol Computer Network System Sirak Kaewjamnong.
CSE679: Computer Network Review r Review of the uncounted quiz r Computer network review.
EE 122: Congestion Control and Avoidance Kevin Lai October 23, 2002.
Computer Networking TCP (cont.). Lecture 16: Overview TCP congestion control TCP modern loss recovery TCP interactions TCP options TCP.
Wireless TCP. References r Hari Balakrishnan, Venkat Padmanabhan, Srinivasan Seshan and Randy H. Katz, " A Comparison of Mechanisms for Improving TCP.
Copyright © Lopamudra Roychoudhuri
1 TCP - Part II Relates to Lab 5. This is an extended module that covers TCP data transport, and flow control, congestion control, and error control in.
Analysis of Buffer Size in Core Routers by Arthur Dick Supervisor Anirban Mahanti.
Lecture 9 – More TCP & Congestion Control
Challenges to Reliable Data Transport Over Heterogeneous Wireless Networks.
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
Computer Networking Lecture 18 – More TCP & Congestion Control.
TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.
1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:
1 Sonia FahmyPurdue University TCP Congestion Control Sonia Fahmy Department of Computer Sciences Purdue University
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
1 TCP - Part II. 2 What is Flow/Congestion/Error Control ? Flow Control: Algorithms to prevent that the sender overruns the receiver with information.
1 Computer Networks Congestion Avoidance. 2 Recall TCP Sliding Window Operation.
Janey C. Hoe Laboratory for Computer Science at MIT 노상훈, Pllab.
Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.
ECE 4110 – Internetwork Programming
Data Transfer Case Study: TCP  Go-back N ARQ  32-bit sequence # indicates byte number in stream  transfers a byte stream, not fixed size user blocks.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
Transport Layer: Sliding Window Reliability
TCP as a Reliable Transport. How things can go wrong… Lost packets Corrupted packets Reordered packets …Malicious packets…
Transmission Control Protocol (TCP) TCP Flow Control and Congestion Control CS 60008: Internet Architecture and Protocols Department of CSE, IIT Kharagpur.
1 TCP ProtocolsLayer name DNSApplication TCP, UDPTransport IPInternet (Network ) WiFi, Ethernet Link (Physical)
Ch 3. Transport Layer Myungchul Kim
DMET 602: Networks and Media Lab Amr El Mougy Yasmeen EssamAlaa Tarek.
DMET 602: Networks and Media Lab
TCP - Part II Relates to Lab 5. This is an extended module that covers TCP flow control, congestion control, and error control in TCP.
Introduction to Networks
COMP 431 Internet Services & Protocols
Transport Protocols over Circuits/VCs
TCP - Part II Relates to Lab 5. This is an extended module that covers TCP flow control, congestion control, and error control in TCP.
Carnegie Mellon University, *Panasas Inc.
CS640: Introduction to Computer Networks
Lecture 18 – More TCP & Congestion Control
Presentation transcript:

TCP Throughput Collapse in Cluster-based Storage Systems Amar Phanishayee Elie Krevat, Vijay Vasudevan, David Andersen, Greg Ganger, Garth Gibson, Srini Seshan Carnegie Mellon University

Cluster-based Storage Systems Data Block Synchronized Read 1 R R R R 2 3 Client Switch Cluster-based storage systems are becoming increasingly popular. (both in research and in the industry) Data is striped across multiple servers for reliability (coding/replication) and performance. Also aids in incremental scalability. Client separated from servers by a hierarchy of switches – one here for simplicity - high b/w, low latency network - high b/w (1 Gbps), low latency (10s to 100 micro seconds) Synchronized reads ! - describe block, SRU - mention that this setting is simplistic - could have multiple clients, multiple outstanding blocks, Server Request Unit (SRU) 1 2 3 4 4 Client now sends next batch of requests Storage Servers

TCP Throughput Collapse: Setup Test on an Ethernet-based storage cluster Client performs synchronized reads Increase # of servers involved in transfer SRU size is fixed TCP used as the data transfer protocol

TCP Throughput Collapse: Incast Setup: client---HP---server SRU size = 256K increase number of servers X axis - # servers Y axis – Goodput (throughput as seen by the application) Order of magnitude drop for as low as 7 servers Initially reported in a paper by Nagle et al. (Panasas) – called this Incast Also observed before in research systems (NASD) Cause – TCP timeouts (due to limited buffer space) + synchronized reads With the popularity of iSCSI devices and companies selling cluster based storage file-systems, this throughput collapse is not a good thing. If we want to play the blame game: if you wear the systems had you can easily say “Hey, this is the networks fault – networking guys fix it!”. If you wear the networking hat you would say “Well, TCP has been tried and tested over time in the wide area, and was designed to perform well and saturate the available bandwidth in settings like this one – so you must not be doing the right thing, like you might want to fine tune your TCP stack for performance” Infact, Problem shows up only in synchronized read settings - Nagle et al. ran netperf and showed that the problem did not show-up In this paper, we perform an in-depth analysis of the effectiveness of possible “network-level” solutions. [Nagle04] called this Incast Cause of throughput collapse: TCP timeouts

Hurdle for Ethernet Networks FibreChannel, InfiniBand Specialized high throughput networks Expensive Commodity Ethernet networks 10 Gbps rolling out, 100Gbps being drafted Low cost Shared routing infrastructure (LAN, SAN, HPC) TCP throughput collapse (with synchronized reads) FibreChannel, Inifiniband High throughput (10 to 100Gbps), High performance RDMA support for direct memory-memory data transfer without interrupting the CPU Flow Control But costly Ethernet networks shared network infrastructure can be used by both storage and compute clusters 10Gbit rolling out, 40/100Gbit being drafted existing protocols designed for the wide area can be used With all its advantages commodity ethernet networks seem to be the way to go - but one of the major hurdles to cross the TCP throughput collapse observed in these networks We are going to consider Ethernet for the rest of this talk

Our Contributions Study network conditions that cause TCP throughput collapse Analyse the effectiveness of various network-level solutions to mitigate this collapse. FibreChannel, Inifiniband High throughput (10 to 100Gbps), High performance RDMA support for direct memory-memory data transfer without interrupting the CPU Flow Control But costly Ethernet networks shared network infrastructure can be used by both storage and compute clusters 10Gbit rolling out, 40/100Gbit being drafted existing protocols designed for the wide area can be used With all its advantages commodity ethernet networks seem to be the way to go - but one of the major hurdles to cross the TCP throughput collapse observed in these networks We are going to consider Ethernet for the rest of this talk

Outline Motivation : TCP throughput collapse High-level overview of TCP Characterizing Incast Conclusion and ongoing work

TCP overview Reliable, in-order byte stream Adaptive Sequence numbers and cumulative acknowledgements (ACKs) Retransmission of lost packets Adaptive Discover and utilize available link bandwidth Assumes loss is an indication of congestion Slow down sending rate TCP provides a reliable, in-order delivery of data. Fairness among flows (with same RTT) So applications using TCP need not bother about losses in the network – TCP will take care of retransmissions, etc. Bottleneck link bandwidth is shared by all flows. Congestion control is an adaptive mechanism – adapts to changing # of flows and varying network settings Slow Start – For init or after timeout, until ss_thresh threshold, doubles CWND for every RTT (exponential growth to quickly discover link capacity) Congestion Avoidance – Additive Increase – increase CWND by 1 for every RTT ----- Congestion control Adaptive mechanism Slow start Additive Increase, Multiplicative Decrease (AIMD) Loss is an indication of congestion Slow down sending rate

TCP: data-driven loss recovery Seq # 1 2 Ack 1 3 4 Ack 1 5 Ack 1 Ack 1 3 duplicate ACKs for 1 (packet 2 is probably lost) Retransmit packet 2 immediately In SANs recovery in usecs after loss. 2 The sender waits for 3 duplicate ACKs because - it thinks that packets might have been reordered in the network and that the receiver might have received pkt 2 after 3 and 4 - but it can’t wait forever (hence a limit – 3 duplicate ACKs) Ack 5 Sender Receiver

TCP: timeout-driven loss recovery Seq # 1 Timeouts are expensive (msecs to recover after loss) 2 3 4 5 Retransmission Timeout (RTO) Timeouts are expensive because - you have to wait for 1 RTO before realizing that a retransmission is required - RTO is estimated based on the round trip time - estimating RTO – tricky (timely response vs premature timeouts) - minRTO value in ms (orders of magnitude greater than the ) 1 Ack 1 Sender Receiver

TCP: Loss recovery comparison Timeout driven recovery is slow (ms) Data-driven recovery is super fast (us) in SANs Sender Receiver 1 2 3 4 5 Retransmission Timeout (RTO) Ack 1 Seq # Sender Receiver 1 2 3 4 5 Ack 1 Retransmit Seq # Ack 5 The sender waits for 3 duplicate ACKs because - it thinks that packets might have been reordered in the network and that the receiver might have received pkt 2 after 3 and 4 - but it can’t wait forever (hence a limit – 3 duplicate ACKs)

Outline Motivation : TCP throughput collapse High-level overview of TCP Characterizing Incast Comparing real-world and simulation results Analysis of possible solutions Conclusion and ongoing work

Link idle time due to timeouts Synchronized Read 1 R R R R 2 4 3 Client Switch Given this background of TCP timeouts, let us revisit the synchronized reads scenario to understand why timeouts cause link idle time (and hence throughput collapse). Setting: SRU contains only one packet worth of information. If 4 is dropped, when server 4 is timing out, the link is idle – no one is utilizing the available bandwidth. Server Request Unit (SRU) 1 2 3 4 4 Link is idle until server experiences a timeout

Client Link Utilization

Characterizing Incast Incast on storage clusters Simulation in a network simulator (ns-2) Can easily vary Number of servers Switch buffer size SRU size TCP parameters TCP implementations

Incast on a storage testbed SRU = 256KB ~32KB output buffer per port Storage nodes run Linux 2.6.18 SMP kernel

Simulating Incast: comparison The slight difference between the two curve can be explained by the following reasons - some non-determinism in the real world – servers in the real world can be slower in pumping data into the network that in simulation (simulated computers are infinitely fast) - we do not know the exact buffer size on the Procurves (thought we believe it is close to 32KB) SRU = 256KB Simulation closely matches real-world result

Outline Characterizing Incast Conclusion and ongoing work Motivation : TCP throughput collapse High-level overview of TCP Characterizing Incast Comparing real-world and simulation results Analysis of possible solutions Varying system parameters Increasing switch buffer size Increasing SRU size TCP-level solutions Ethernet flow control Conclusion and ongoing work

Increasing switch buffer size Timeouts occur due to losses Loss due to limited switch buffer space Hypothesis: Increasing switch buffer size delays throughput collapse How effective is increasing the buffer size in mitigating throughput collapse? Larger buffer size (output buffer per-port)  Fewer losses  fewer timeouts

Increasing switch buffer size: results per-port output buffer First time a graph with x axis in log scale is being shown – explain it. Note this is the same line that was shown before – but it looks different because of the log scale on the x-axis SRU = 256KB

Increasing switch buffer size: results per-port output buffer

Increasing switch buffer size: results per-port output buffer Very high capacity switches are vastly more expensive When we tried this experiment on a switch that had a large per-port output buffer size (> 1MB), we were not able to notice a throughput collapse for as high as 87 servers. But the cost of the switch was $0.5M !!! A commodity Dell switch costs $1100 Force 10 S50  600+ ports, 1 to 5 MB per port (~ $1000/port) HP Procurve – 24 port switch (~ $100/port) More servers supported before collapse Fast (SRAM) buffers are expensive

Increasing SRU size No throughput collapse using netperf Used to measure network throughput and latency netperf does not perform synchronized reads Hypothesis: Larger SRU size  less idle time Servers have more data to send per data block One server waits (timeout), others continue to send Remind people about SRU – Server Request Unit Larger SRU size  lesser link idle time this is because when one of the serevrs is waiting for a timeout event to trigger, other servers can utilize the available link bandwidth

Increasing SRU size: results SRU = 10KB Remind people about SRU Buffer space = 64KB output buffer per port

Increasing SRU size: results SRU = 1MB SRU = 10KB Buffer space = 64KB output buffer per port

Increasing SRU size: results SRU = 8MB SRU = 1MB SRU = 10KB More pinned space in client kernel leads to failures – Ric thinks this is not an issue, Garth and Panasas think this is! :-) Buffer space = 64KB output buffer per port Significant reduction in throughput collapse More pre-fetching, kernel memory

Fixed Block Size Buffer space = 64KB output buffer per port

Outline Characterizing Incast Analysis of possible solutions Motivation : TCP throughput collapse High-level overview of TCP Characterizing Incast Comparing real-world and simulation results Analysis of possible solutions Varying system parameters TCP-level solutions Avoiding timeouts Alternative TCP implementations Aggressive data-driven recovery Reducing the penalty of a timeout Ethernet flow control

Avoiding Timeouts: Alternative TCP impl. SRU = 256KB Buffer = 64KB NewReno better than Reno, SACK (8 servers) Throughput collapse inevitable

Timeouts are inevitable 1 2 Ack 1 Sender Receiver 1 2 3 4 5 Ack 1 Retransmission Timeout (RTO) Sender Receiver 1 2 3 4 5 Retransmission Timeout (RTO) 3 4 5 Ack 1 1 dup-ACK 2 Ack 2 Sender Receiver Why are timeouts still occurring when NewReno and SACK were specifically designed to reduce the number of timeouts - timeout categorization showed that some timeouts occur even when there was limited feedback (<3 duplicate ACKs) - reducing this to 1 is safe as there is no reordering in storage networks Aggressive data-driven recovery did not help either This was perplexing. So we categorized timeouts to find that all the timeouts occurring were inevitable Aggressive data-driven recovery does not help. Complete window of data is lost (most cases) Retransmitted packets are lost

Reducing the penalty of timeouts Reduce penalty by reducing Retransmission TimeOut period (RTO) RTOmin = 200us NewReno with RTOmin = 200ms Estimating RTO is tricky because if you are too aggressive, you will end-up retransmitting packets that might have been received by the destination (the ACK might be on its way back) If you timeout and retransmit early (spurious retransmission + slow start) On the other hand, being overestimating the RTO can hurt timely response to losses – can’t wait for too long. RTOmin to guard against premature timeout Default = 200ms – this made sense in the wide area where the RTT variation due to buffering in routers could be in ms. 3 orders of magnitude greater than RTT in SANs (100 us) We reduce RTOmin in simulation Reduced RTOmin helps But still shows 30% decrease for 64 servers

Issues with Reduced RTOmin Implementation Hurdle Requires fine grained OS timers (us) Very high interrupt rate Current OS timers  ms granularity Soft timers not available for all platforms Unsafe Servers talk to other clients over wide area Overhead: Unnecessary timeouts, retransmissions To reduce RTOmin to 200 us we need a TCP clock granularity of 100us. Linux TCP uses a TCP clock granularity of 10ms. BSD provides 2 coarse grained timers (200ms and 500ms) that are used to handle internal per-connection timers. Allman et al. show that

Outline Motivation : TCP throughput collapse High-level overview of TCP Characterizing Incast Comparing real-world and simulation results Analysis of possible solutions Varying system parameters TCP-level solutions Ethernet flow control Conclusion and ongoing work

Ethernet Flow Control Flow control at the link level Overloaded port sends “pause” frames to all senders (interfaces) EFC disabled EFC enabled We ran these tests on a storage cluster Client and servers were separated by a single switch SRU = 256 KB

Issues with Ethernet Flow Control Can result in head-of-line blocking Pause frames not forwarded across switch hierarchy Switch implementations are inconsistent Flow agnostic e.g. all flows asked to halt irrespective of send-rate New standards of Ethernet Flow Control (Datacenter Ethernet) are trying to solve these problems, but it is unclear when these new standards will be implemented in switches -

Summary Synchronized Reads and TCP timeouts cause TCP Throughput Collapse No single convincing network-level solution Current Options Increase buffer size (costly) Reduce RTOmin (unsafe) Use Ethernet Flow Control (limited applicability) In conclusion … Most solutions we have considered have drawbacks Reducing the RTO_min value and EFC for single switches seem to be the most effective solutions. Datacenter Ethernet (enhanced EFC) Ongoing work: Application level solutions Limit number of servers or throttle transfers Globally schedule data transfers

No throughput collapse in InfiniBand Throughput (Mbps) Number of servers Results obtained from Wittawat Tantisiriroj

Varying RTOmin Goodput (Mbps) RTOmin (seconds)