The DataTAG Project UCSD/La Jolla, USA Olivier H. Martin / CERN

The DataTAG Project UCSD/La Jolla, USA Olivier H. Martin / CERN
CHEP’2003 Conference 25 March 2003, UCSD/La Jolla, USA Olivier H. Martin / CERN

Funding agencies Cooperating Networks 28-févr.-19 The DataTAG Project

EU collaborators Brunel University CERN CLRC CNAF DANTE INFN INRIA
NIKHEF PPARC UvA University of Manchester University of Padova University of Milano University of Torino UCL 28-févr.-19 The DataTAG Project

US collaborators Northwestern University ANL UIC Caltech Fermilab
University of Chicago University of Michigan SLAC Starlight ANL Caltech Fermilab FSU Globus Indiana Wisconsin 28-févr.-19 The DataTAG Project

Project information & goals
Two years project started on 1/1/2002 Following successful 1st year review, extension until 1Q04 is likely 3.9 MEUROs 50% Circuit cost, hardware Manpower Grid related network research High Performance Transport protocols Inter-domain QoS Advance bandwidth reservation Interoperability between European and US Grids 28-févr.-19 The DataTAG Project

Workplan WP1: WP2: WP3 WP4 WP5 & WP6
Establishment of a high performance intercontinental Grid testbed (CERN) WP2: High performance networking (PPARC) WP3 Bulk data transfer validations and application performance monitoring (UvA) WP4 Interoperability between Grid domains (INFN) WP5 & WP6 Dissemination and project management (CERN) 28-févr.-19 The DataTAG Project

Interoperability Framework
EU Part US Part GriPhyN PPDG iVDGL DataTAG-WP4 iVDGL HICB GLUE DataGRID Griphyn/PPDG HEP experiments and LCG LCG middleware selection 28-févr.-19 The DataTAG Project

DataTAG testbed status

Evolution of the testbed
2.5G circuit in operation since August 20, 2002 On request from the partners, the testbed evolved from a simple layer3 testbed into an extremely rich, most probably unique, multi-vendor layer2 & layer 3 testbed Alcatel, Cisco, Juniper Direct extensions to Amsterdam (UvA)/Surfnet (10G) & Lyon (INRIA)/VTHD (2.5G) VPN layer 2 extension to INFN/CNAF over GEANT & GARR using Juniper’s MPLS In order to guarantee exclusive access to the testbed a reservation application has been developed Proved to be essential 28-févr.-19 The DataTAG Project

Major 2.5/10 Gbps circuits between Europe & USA
DataTAG connectivity NewYork IT GARR-B Abilene UK SuperJANET4 3*2.5G VPN Layer 2 STAR-LIGHT ESNET GEANT CERN 2.5G --> 10G 10G MREN NL SURFnet STAR-TAP FR INRIA ATRIUM/VTHD Major 2.5/10 Gbps circuits between Europe & USA 28-févr.-19 The DataTAG Project

Multi vendor layer 2/3 testbed
INFN (Bologna) STARLIGHT (Chicago) CERN (Geneva) Abilene Canarie ESnet INRIA (Lyon) GEANT Surfnet 2.5Gbps 10Gbps 10Gbps Juniper Juniper Wave triangle Extreme Summit5i M M Alcatel 2.5Gbps Alcatel GBE GBE Cisco Cisco M=A1670 (Layer 2over SDH Mux) 28-févr.-19 The DataTAG Project

Phase I (iGRID2002) Layer2 28-févr.-19 The DataTAG Project

Phase II Generic Layer 3 configuration (Oct. 2002 – Feb. 2003)
CERN StarLight Servers Servers GigE switch GigE switch 2.5Gbps C7606 C7606 28-févr.-19 The DataTAG Project

Phase III Layer2/3 (March 2003)
INRIA Layer 2 VTHD Layer 1 Routers Servers GigE switch A1670 Multiplexer GigE switch A7770 2.5G 2*GigE C7606 To STARLIGHT 8*GigE CERN J-M10 C-ONS15454 10G UvA GEANT From CERN Servers Ditto Abilene 2.5G GARR ESNet Canarie 28-févr.-19 The DataTAG Project STARLIGHT INFN/CNAF

Main achievements GLUE Interoperability effort with DataGrid, iVDGL & Globus GLUE testbed & demos VOMS design and implementation in collaboration with DataGrid VOMS evaluation within iVDGL underway Integration of GLUE compliant components in DataGrid and VDT middleware Internet landspeed records have been beaten one after the other by DataTAG project members and/or teams closely associated with DataTAG: Atlas Canada lightpath experiment (iGRID2002) New Internet2 landspeed record (I2 LSR) by Nikhef/Caltech team (SC2002) Scalable TCP, HSTCP, GridDT & FAST experiments (DataTAG partners & Caltech) Intel 10GigE tests between CERN (Geneva) and SLAC (Sunnyvale) – (Caltech, CERN, Los Alamos NL, SLAC) 2.38Gbps sustained rate, single TCP/IP flow, 1TB in one hour (S. Ravot/Caltech) 28-févr.-19 The DataTAG Project

10GigE Data Transfer Trial
On Feb , a terabyte of data was transferred in 3700 seconds by S. Ravot of Caltech between the Level3 PoP in Sunnyvale near SLAC and CERN through the TeraGrid router at StarLight from memory to memory with a single TCP/IP stream. This achievement translates to an average rate of 2.38 Gbps (using large windows and 9kB “jumbo frames”). This beat the former record by a factor of ~2.5 and used the US-CERN link at 99% efficiency. European Commission

PFLDnet workshop (CERN – Feb 3-4)
1st workshop on protocols for fast long distance networks Co-organized by Caltech & DataTAG Sponsored by Cisco 65 attendees Most key actors were present e.g. S. Floyd, T. Kelly, S. Low Headlines: High Speed TCP (HSTCP), Limited Slow start Quickstart, XCP, Tsunami (UDP) Grid DT, Scalable TCP, FAST (Fast AQM (Active Queue Management)) 28-févr.-19 The DataTAG Project

Main TCP issues Does not scale to some environments
High speed, high latency Noisy Unfair behaviour with respect to: RTT MSS Bandwidth Widespread use of multiple streams in order to compensate for inherent TCP/IP unfairness (e.g. Gridftp, BBftp): Bandage rather than a cure New TCP/IP proposals in order to restore performance in single stream environments 28-févr.-19 The DataTAG Project

TCP dynamics (10Gbps, 100ms RTT, 1500Bytes packets)
Window size (W) = Bandwidth*Round Trip Time Wbits = 10Gbps*100ms = 1Gb Wpackets = 1Gb/(8*1500) = packets Standard Additive Increase Multiplicative Decrease (AIMD) mechanisms: W=W/2 (halving the congestion window on loss event) W=W + 1 (increasing congestion window by one packet every RTT) Time to recover from W/2 to W (congestion avoidance) at 1 packet per RTT: RTT*Wp/2 = hour In practice, 1 packet per 2 RTT because of delayed acks, i.e hour Packets per second: RTT*Wpackets = 833’333 packets 28-févr.-19 The DataTAG Project

Maximum throughput with standard Window sizes as a function of the RTT
Window(KB) RTTms K 1.28M 2.56MB/s K 640K 1.28MB/s K 320K 640KB/s The best throughput one can hope for, on a standard intra-European path with 50ms RTT, is only about 10Mb/s! 28-févr.-19 The DataTAG Project

HSTCP (IETF Draft August 2002)
Modifying TCP’s response function in order to allow high performance in high-speed environments and in the presence of packet losses Target 10Gbps performance in 100ms Round Trip Times (RTT) environments Acceptable fairness when competing with standard TCP in environments with packet loss rates of 10-4 or 10-5. Wmss = 1.2/sqrt(p) Equivalent to W/1.5 RTTs between losses 28-févr.-19 The DataTAG Project

HSTCP Response Function (Additive Increase – HSTP vs standard TCP)
Packet Avg. Congestion RTTs between Drop Rate Window Losses (120) 38(80) (379) 57(252) (1200) 83(800) (3795) 123(2530) …… (120000) 388(80000) 28-févr.-19 The DataTAG Project

Relative Fairness (HSTCP vs standard TCP)
Packet Rel. Agg. Agg. Drop Rate Fairness Window Bandwidth Mb Mb Mb Mb Gb Gb …… Gb N.B. Aggregate bandwidth used by one standard TCP plus one HSTCP connections with 100ms RTT and 1500Bytes MSS 28-févr.-19 The DataTAG Project

Limited Slow-Start (IETF Draft August 2002)
Current « slow-start » procedure can result in increasing the congestion window by thousands of packets in a single RTT Massive packet losses Counter-productive Limited slow-start introduces a new parameter max_ssthresh in order to limit the increase of the congestion window. max_ssthresh < cwnd < ssthresh Recommended value 100 MSS K = int (cwnd/(0.5*max_ssthresh)) When cwnd > max_ssthresh Cwnd += int(MSS/K) for each receivingACK instead of Cwnd += MSS This ensures that cwnd is increased by at most max_ssthresh/2 per RTT, i.e. ½MSS when cwnd=max_ssthresh, 1/3MSS when cwnd=1.5*max_ssthresh, etc 28-févr.-19 The DataTAG Project

Limited Slow-Start (cont.)
With limited slow-start it takes: Log(max_ssthresh) RTTs to reach the condition where cwnd = max_ssthresh Log(max_ssthresh) + (cwnd – max_ssthresh)/(max_sstresh/2) RTTs to reach a congestion window of cwnd when cwnd > max_ssthresh Thus with max_ssthresh = 100 MSS It would take 836 RTT to reach a congestion window of packets Compared to 16 RTT otherwise (assuming NO packet drops) Transient queue limited to 100 packets against packets otherwise! Limited slow-start could be used in conjunction with rate based pacing 28-févr.-19 The DataTAG Project

Slow-start vs Limited Slow-start
100000 ssthresh (83333) 10000 Congestion window size (MSS) 1000 max-ssthresh (100) 100 10Gbps bandwidth! (RTT=100msec, MSS=1500B) 16000 16 160 1600 Time (RTT) 28-févr.-19 The DataTAG Project

Scalable TCP (Tom Kelly – Cambridge)
The responsiveness of traditional TCP connection to loss events is proportional to: window size and round trip time. With Scalable TCP the responsiveness is proportional to the round trip time only. Scalable TCP alters the congestion window, cwnd, on each acknowledgement in a RTT without loss by cwnd -> cwnd + 0:02 and for each window experiencing loss, cwnd is reduced by, cwnd -> cwnd – 0.125*cwnd 28-févr.-19 The DataTAG Project

Scalable TCP (2) As a result, the responsiveness of a connection with 200ms RTT is changed as follows: Standard TCP connection: packet loss recovery time is nearly 3 minutes at 100 Mbit/s and 28 minutes at 1 Gbit/s Scalable TCP: packet loss recovery time is about 2.7s at any rate. Scalable TCP has been implemented on a Linux kernel. Gigabit kernel modifications, remove the copying of small packets in the SysKonnect driver and the scale device driver decoupling buffers to reflect Gigabit Ethernet devices. Initial results on performance suggest that the variant is capable of providing high speed in a robust manner using only sender side modifications. Up to 400% improvement over standard Linux 28-févr.-19 The DataTAG Project

Standard TCP/IP (recovery time vs window size)
Scalable TCP (3) Scalable TCP/IP Standard TCP/IP (recovery time vs window size) 28-févr.-19 The DataTAG Project

QuickStart Initial assumption is that routers have the ability to determine whether the destination link is significantly under-utilized Similar capabilities also assumed for Active Queue Management (AQM) and Early Congestion Notifications (ECN) techniques Coarse grain mechanism only focusing on initial window size Incremental deployment New IP & TCP options QS request (IP) & QS response (TCP) Initial Window size = Rate*RTT*MSS 28-févr.-19 The DataTAG Project

QuickStart (cont.) SYN/SYN-ACK IP packets
New IP option Quick Start Request (QSR) Two TTL (Time To Live) IP & QSR Sending rate expressed in packet rates per 100ms Therefore maximum rate is 2560 packets/seconds Rate based pacing assumed Non-participating router ignores QSR option Therefore does not decrease QSR TTL Participating router Delete QSR option or reset initial sending rate Accept or reduce initial rate 28-févr.-19 The DataTAG Project

XCP Congestion Window Set by Bottleneck Router
28-Feb-19 The DataTAG Project

FAST Intellectual advances (S. Low/Caltech)
New mathematical theory of large scale networks FAST = Fast Active-queue-managed Scalable TCP Innovative implementation: TCP stack in Linux Experimental facilities “High energy physics networks” Caltech and CERN/DataTAG site equipment: Switches, Routers Servers Level(3) SNV-CHI OC192 Link; DataTAG link; Cisco 12406, GbE and 10 GbE port cards donated Abilene, Calren2, … Unique features: Delay (RTT) as congestion measure Feedback loop for resilient window, and stable throughput The DataTAG Projectnetlab.caltech.edu

SCinet Bandwidth Challenge (FAST)
SC2002 Baltimore, Nov 2002 Highlights FAST TCP Standard MTU Peak window = 14,100 pkts 940 Mbps single flow/GE card 9.4 petabit-meter/sec 1.9 times LSR 9.4 Gbps with 10 flows 37.0 petabit-meter/sec 6.9 times LSR 16 TB in 6 hours with 7 flows Implementation Sender-side modification Delay based Stabilized Vegas Sunnyvale-Geneva Baltimore-Geneva Baltimore-Sunnyvale SC2002 10 flows SC2002 2 flows I2 LSR multiple SC2002 1 flow 9.4.02 1 flow IPv6 Berkeley, June 7, 2002 (Bannister, Walrand, Fisher) IPAM, Arrowhead, June 10, 2002 (Doyle, Low, Willinger) Postech-Caltech Symposium, Pohang, S. Korea, Sept 30, 2002 (Jin Lee) Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, Oct 3, 2002 (Hajek, Srikant) IEEE Computer Communications Workshop (CCW), Santa Fe, NM, Oct 14, 2002 (Kunniyur), presenter: Dawn Lee Center Industry Workshop, Oct 18, 2002, Pasadena (include Cheng Jin’s slides) Internet: distributed feedback system Rf (s) Rb’(s) x p TCP AQM Theory Experiment Sunnyvale Baltimore Chicago Geneva 3000km 1000km 7000km C. Jin, D. Wei, S. Low FAST Team & Partners netlab.caltech.edu/FAST

Grid DT (Sylvain Ravot/Caltech)
Set of patches to Linux RedHat allowing to control: Slow start threshold & behaviour AIMD parameters Parameter tuning New parameter to better start a TCP transfer Set the value of the initial SSTHRESH Smaller backoff Reduce the strong penalty imposed by a loss 28-févr.-19 The DataTAG Project

GRID DT (cont) Modifications of the TCP algorithms (RFC 2001)
Modification of the well-know congestion avoidance algorithm During congestion avoidance, for every acknowledgement received, cwnd increases by A * (segment size) * (segment size) / cwnd. This is equivalent to increase cwnd by A segments each RTT. A is called additive increment Modification of the slow start algorithm During slow start, for every acknowledgement received, cwnd increases by M segments. M is called multiplicative increment. Note: A=1 and M=1 in TCP RENO. Single-stream with modified backoff policy + different increment allows us to simulate multi-streaming Single stream implementation differs from multi-stream in some important ways: - it is simpler (CPU utilization – try to quantify) - startup and shutdown are faster (performance impact on short transfers– try to quantify) - fewer keys to manage if it is secure. 28-févr.-19 The DataTAG Project

Comments on above proposals
Recent Internet history shows that: any modifications to the Internet standards can take years before being: accepted and widely deployed, especially if it involves router modifications, e.g. RED, ECN Therefore, the chances of getting Quickstart or XCP type proposals implemented and widely deployed soon are somewhat limited! Proposals only requiring TCP sender stacks modifications are much easier to deploy. 28-févr.-19 The DataTAG Project

Conclusions TCP/IP performance in long distance high speed networks has been known for very many years. What is new, however, is the widespread availability of 10Gbps A&R backbones as well as the emergence of 10GigE technology. Thus, the awareness that the problem requires quick resolution has been growing rapidly during the last 2 years, hence the flurry of proposals. Hard to predict who will win! 28-févr.-19 The DataTAG Project

The DataTAG Project UCSD/La Jolla, USA Olivier H. Martin / CERN

Similar presentations

Presentation on theme: "The DataTAG Project UCSD/La Jolla, USA Olivier H. Martin / CERN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The DataTAG Project UCSD/La Jolla, USA Olivier H. Martin / CERN

Similar presentations

Presentation on theme: "The DataTAG Project UCSD/La Jolla, USA Olivier H. Martin / CERN"— Presentation transcript:

Similar presentations

About project

Feedback