Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan Matt Mathis Brian Tierney CSM lunch.

Slides:



Advertisements
Similar presentations
Click to edit Master title style Click to edit Master text styles –Second level Third level –Fourth level »Fifth level 1 List of Nominations Whats Good.
Advertisements

Appropriateness of Transport Mechanisms in Data Grid Middleware Rajkumar Kettimuthu 1,3, Sanjay Hegde 1,2, William Allcock 1, John Bresnahan 1 1 Mathematics.
Using NetLogger and Web100 for TCP analysis Data Intensive Distributed Computing Group Lawrence Berkeley National Laboratory Brian L. Tierney.
ORNL Net100 status July 31, UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory ORNL Net100 Focus Areas (first year) –TCP optimizations.
1 TCP Congestion Control. 2 TCP Segment Structure source port # dest port # 32 bits application data (variable length) sequence number acknowledgement.
Congestion Control An Overview -Jyothi Guntaka. Congestion  What is congestion ?  The aggregate demand for network resources exceeds the available capacity.
School of Information Technologies TCP Congestion Control NETS3303/3603 Week 9.
Chapter 3 Transport Layer slides are modified from J. Kurose & K. Ross CPE 400 / 600 Computer Communication Networks Lecture 12.
Transport Layer 3-1 Fast Retransmit r time-out period often relatively long: m long delay before resending lost packet r detect lost segments via duplicate.
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
1 Chapter 3 Transport Layer. 2 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4.
Data Communication and Networks
Transport Level Protocol Performance Evaluation for Bulk Data Transfers Matei Ripeanu The University of Chicago Abstract:
Introduction 1 Lecture 14 Transport Layer (Congestion Control) slides are modified from J. Kurose & K. Ross University of Nevada – Reno Computer Science.
Development of network-aware operating systems Tom Dunigan
Transport Layer 4 2: Transport Layer 4.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Transport Layer3-1 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4 Principles.
Transport Layer1 Flow and Congestion Control Ram Dantu (compiled from various text books)
Implementing High Speed TCP (aka Sally Floyd’s) Yee-Ting Li & Gareth Fairey 1 st October 2002 DataTAG CERN (Kinda!)
CSE 461 University of Washington1 Topic How TCP implements AIMD, part 1 – “Slow start” is a component of the AI portion of AIMD Slow-start.
1 Project Goals Project Elements Future Plans Scheduled Accomplishments Project Title: Net Developing Network-Aware Operating Systems PI: G. Huntoon,
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
Parallel TCP Bill Allcock Argonne National Laboratory.
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.
NET100 Development of network-aware operating systems Tom Dunigan
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100 PIs: Wendy Huntoon/PSC, Tom Dunigan/ORNL, Brian Tierney/LBNL Impact and Connections.
TCP performance Sven Ubik FTP throughput capacity load ftp.uninett.no 12.3 Mb/s 1.2 Gb/s 80 Mb/s (6.6%) ftp.stanford.edu 1.3 Mb/s 600.
Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan Matt Mathis Brian Tierney
NET100 … as seen from ORNL Tom Dunigan November 8, 2001.
NET100 Development of network-aware operating systems Tom Dunigan
National Center for Atmospheric Research Pittsburgh Supercomputing Center National Center for Supercomputing Applications Web100 Basil Irwin & George Brett.
An Introduction to UDT Internet2 Spring Meeting Yunhong Gu Robert L. Grossman (Advisor) National Center for Data Mining University.
Network-aware OS DOE/MICS Project Final Review September 16, 2004 Tom Dunigan Matt Mathis Brian Tierney ORNL.
What is TCP? Connection-oriented reliable transfer Stream paradigm
Transport Layer 3-1 Chapter 3 Transport Layer Computer Networking: A Top Down Approach 6 th edition Jim Kurose, Keith Ross Addison-Wesley March
Computer Networking Lecture 18 – More TCP & Congestion Control.
Web100/Net100 at Oak Ridge National Lab Tom Dunigan August 1, 2002.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Transport Layer 3- Midterm score distribution. Transport Layer 3- TCP congestion control: additive increase, multiplicative decrease Approach: increase.
TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University.
NET100 Development of network-aware operating systems Tom Dunigan
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100: developing network-aware operating systems New (9/01) DOE-funded (Office of.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.
INDIANAUNIVERSITYINDIANAUNIVERSITY Status of FAST TCP and other TCP alternatives John Hicks TransPAC HPCC Engineer Indiana University APAN Meeting – Hawaii.
Network-aware OS ESCC Miami February 5, 2003 Tom Dunigan Matt Mathis Brian Tierney
An Analysis of AIMD Algorithm with Decreasing Increases Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data Mining.
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100 year 1 leftovers (proposal): PSC –none ORNL –router access to SNMP data (besides.
Network-aware OS DOE/MICS ORNL site visit January 8, 2004 ORNL team: Tom Dunigan, Nagi Rao, Florence Fowler, Steven Carter Matt Mathis Brian.
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
Transport Layer session 1 TELE3118: Network Technologies Week 11: Transport Layer TCP Some slides have been taken from: r Computer Networking:
A TCP Tuning Daemon SC2002 November 19, 2002 Tom Dunigan Matt Mathis Brian Tierney
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
CS450 – Introduction to Networking Lecture 19 – Congestion Control (2)
Chapter 3 outline 3.1 transport-layer services
Chapter 6 TCP Congestion Control
TCP Vegas: New Techniques for Congestion Detection and Avoidance
Chapter 3 outline 3.1 Transport-layer services
Transport Protocols over Circuits/VCs
Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the
Wide Area Networking at SLAC, Feb ‘03
Chapter 6 TCP Congestion Control
Chapter 3 outline 3.1 Transport-layer services
TCP flow and congestion control
Anant Mudambi, U. Virginia
Using NetLogger and Web100 for TCP analysis
Presentation transcript:

Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan Matt Mathis Brian Tierney CSM lunch seminar August 14,2003

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Network research at ORNL Net100 –Dunigan, Florence Fowler, Steven Carter, Nagi Rao, Bill Wing –TCP tuning SSFNet –Bill Wing, Jim Rome – large scale network simulation Multipath routing, TCP dynamics, stable transport protocol –Nagi Rao, Qishi Wu Coming soon –ETF/ teragrid ( 40 Gbs) ($3M) –DOE Science Ultranet testbed, lambda switching ($4.5M) –sub-lambda provisioning for visualization (NSF, $3.5M) * much help from Susan Hicks and Chuck Fisher

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Roadmap Motivation & Background Net100 project components –Web100 –network probes & sensors –protocol analysis and tuning Results –TCP tuning daemon –Tuning experiments Ongoing & future research rg DOE-funded project (Office of Science) $2.6M, 3 yrs beginning 9/01 LBNL, ORNL, PSC, NCAR Net100 project objectives: (network-aware operating systems) measure, understand, and improve end-to-end network/application performance tune network protocols and applications (grid and bulk transfer) emphasis: TCP bulk transfer over high delay/bandwidth nets

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Motivation Poor network application performance –High bandwidth paths, but app’s slow – Is it application? OS? network? … Yes –Often need a network “wizard” Changing : bandwidths –9.6 Kbs… 1.5 Mbs..45 …100…1000…? Gbs Unchanging: TCP –speed of light (RTT) –packet size (MSS/MTU) still 1500 bytes –TCP congestion control TCP is lossy by design ! –2x overshoot at startup, sawtooth –Recovery proportional to MSS/RTT 2 –recovery after a loss can be very slow on today’s high delay/bandwidth links -- unacceptable on tomorrow’s links: 10 Gbs cross country: recovery time > 1 hr.!! Linear recovery at 0.5 Mb/s! Instantaneous bandwidth Average bandwidth Early startup losses ORNL to NERSC ftp 8 Mbs GigE/OC12 (600 Mbs) 80ms RTT 40 seconds

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory TCP 101 adaptable and fair flow-controlled by sender/receiver buffer sizes self-clocking with positive ACK’s of in-sequence data sensitive to packet size (MTU) and RTT slow start packet per each packet ACK’d (exponential) congestion window ( cwnd )-- max packets that can be in flight packet loss: 3 dup ACKs or timeout ( AIMD ) –cut cwnd in half (Multiplicative Decrease) – add 1 packet to cwnd per RTT (Additive Increase) Workarounds: – parallel streams – non-TCP (UDP) applications – Net100 ( no changes to applications )

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100 components Web100 Linux kernel (NSF) –instrumented TCP stack (IETF MIB draft) Path characterization –Network Tuning and Analysis Framework (NTAF) –both active and passive measurement tools –data base of measurements TCP protocol analysis and tuning –simulation/emulation ns TCP-over-UDP ( atou ) NISTNet –kernel tuning extensions –tuning daemon

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Web100 NSF funded (PSC/NCAR/NCSA) web100.org Modified Linux kernel – instrumented kernel to read/set TCP variables for a specific flow –readable: RTT, counts (bytes, pkts, retransmits,dups), state (SACKs, windowscale, cwnd, ssthresh) –settable: buffer sizes –100+ TCP variables (IETF MIB) ( /proc/web100/) GUI to display/modify a flow’s TCP variables, real-time API for network-aware applications or tuning daemon Net100 extensions: –additional tuning variables and algorithms –event notification –Java bandwidth tester firebird.ccs.ornl.gov:7123

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Network Tool Analysis Framework (NTAF) Configure and launch network tools –measure bandwidth/latency ( iperf, pchar, pipechar ) –augment tools to report Web100 data Collect and transform tool results –use Netlogger to transform common format Save results for short-term auto-tuning and archive for later analysis –compare predicted to actual performance –measure effectiveness of tools and auto-tuning –provide data that can be used to predict future performance –invaluable for comparing tools ( pathload/pchar/netest ) Net100 hosts at : LBNL,ORNL,PSC,NCAR NERSC, SLAC, UT, CERN, Amsterdam,ANL

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory TCP flow visualization - Web interface to data archive and visualization

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Monitoring Tool Comparison

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory ORNL wide-area traffic By volume: measurement 50%, ftp 19%, http 13%, nntp 5%, ssh 5%

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory TCP tuning “enable” high speed –need buffer = bandwidth*RTT - autotune ORNL/NERSC (80 ms, OC12) need 6 MB –faster slow-start avoid losses –modified slow-start –reduce bursts –anticipate loss (ECN,Vegas?) –reorder threshold speed recovery –bigger MTU or “virtual MSS” –modified AIMD (0.5,1) (Floyd, Kelly) –delayed ACKs, initial window, slow-start increment avoid congestion collapse, be fair (?) … intranets, QoS Net100: ns simulation, NISTNet emulation, “almost TCP over UDP” ( atou ), WAD/Internet ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow start. Standard TCP with del ACK takes 10 minutes to recover!

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory TCP Tuning Daemon Work-around Daemon (WAD) –tune unknowing sender/receiver at startup and/or during flow –Web100 kernel extensions pre-set windowscale to allow dynamic tuning uses netlink to alert daemon of socket open/close (or poll) besides existing Web100 buffer tuning, new tuning parameters and algorithms knobs to disable Linux 2.4 caching, burst mgt., and sendstall –config file with static tuning data mode specifies dynamic tuning (AIMD options, NTAF buffer size, concurrent streams) –daemon periodically polls NTAF for fresh tuning data –can do out-of-kernel tuning (e.g., Floyd) –written in C (also Python version) WAD config file [bob] src_addr: src_port: 0 dst_addr: dst_port: 0 mode: 1 sndbuf: rcvbuf: wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1 kellyai: 0

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Experimental results Evaluating the tuning daemon in the wild –emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet) –tests over: 10GigE/OC192,OC48, OC12, OC3, ATM/VBR, GigE,FDDI,100/10T,cable, ISDN,wireless (802.11b),dialup –tests over NISTNet testbed (speed, loss, delay) Various TCP tuning options –buffer tuning (static and dynamic/NTAF) –AIMD mods (including Floyd, Kelly, static, and autotuning) –slow-start mods –parallel streams vs single tuned NISTNet host

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Buffer tuning Classic buffer tuning network-challenged app. gets 10 Mbs same app., WAD/NTAF tuned buffer gets 143 Mbs Autotuning buffers ( kernel ) Linux 2.4, Feng’s Dynamic Right Sizing Net100 autotuning receiver estimates RTT receiver advertises window 2 times data recv’d in RTT buffer size grows dynamically to 2x bandwidth*RTT separate application buffers from kernel buffers ORNL to PSC, OC192, 30 ms RTT ORNL to PSC, OC12, 80ms RTT

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Speeding recovery Amsterdam-Chicago GigE via 10GigE, 100 ms RTT UDP burst Selectable TCP AIMD algorithms : Floyd HS TCP : as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks Kelly scalable TCP : use MD of 1/8 instead of 1/2 and add % of cwnd (e.g. 1%) each RTT Virtual MSS tune TCP’s additive increase (WAD_AI) add k segments per RTT during recovery k =6 like GigE jumbo frame, but: interrupt rate not reduced doesn’t do k segments for initial window

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory WAD tuning Modified slow-start and AI often losses in slow-start WAD tuned Floyd slow-start and fixed AI (6) WAD-tuned AIMD and slow-start parallel streams AIMD (1/(2k),k) exploit TCP’s fairness WAD-tuned single stream (0.125,4) “ “ + Floyd slow-start ORNL to NERSC, OC12, 80 ms RTT ORNL to CERN, OC12, 150ms RTT

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Workaround: parallel streams Takes advantage of TCP’s fairness Faster startup, k buffers faster recovery –often only 1 stream loses a packet –MD: 1/(2k) rather than 1/2 –AI: k times faster linear phase BUT –requires rewrite of applications –how many streams? Buffer size? GridFTP, bbftp, psocket lib Alice and Bob sharing Clever Alice -- 3 streams Bad girl...

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory GridFTP tuning Can tuned single stream compete with parallel streams? Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage. WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive so far…. Testing on real Internet is problematic. Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmits untuned tuned parallel untuned tuned parallel Data/plots from Web100 tracer Buffers: 64K I/O, 4MB TCP

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Ongoing Net100 research –more user-friendly WAD –invited to submit Web100/Net100 mods to Linux 2.6 –port of Web100 to FreeBSD (Web100 team) base for AIX, SGI, Solaris, OSF –port to ORNL Cray X1 Linux network front-end added Net100 kernel, 4x improvement in wide-area TCP! –TCP Vegas Vegas avoids loss (if RTT increasing, Vegas backs off) can be configured to compete with standard TCP (Feng) CalTech’s FAST –comparison with other “work arounds” parallel streams non-TCP (SABUL, FOBS, TSUNAMI, RBUDP, SCTP) –additional accelerants slow-start initial/increment reorder resiliance delayed ACKs

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory TCP tuning for other OS’s Reorder threshold seeing more out of order packets (future: multipath?) WAD tune a bigger reorder threshold for path 40x improvement! Linux 2.4 does a good job already adjusts and caches reorder threshold “undo” congestion avoidance Delayed ACKs WAD could turn off delayed ACKs 2x improvement in recovery rate and slow-start Linux 2.4 already turns off delayed ACKs for initial slow-start ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow-start. Standard TCP with del ACK takes 10 minutes to recover! NOTE aggressive static AIMD (Floyd pre-tune) LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded!

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Planned Net100 research –improve ease of use (WAD  WAND) –analyze effectiveness/fairness of current tuning options simulation emulation on the net (systematic tests) –NTAF probes -- characterizing a path to tune a flow integration with SCNM monitoring applications with Web100 latest probe tools –additional tuning algorithms identify non-congestive loss, ECN? Tuning for dedicated path (lambda/10GigE) –parallel/multipath selection/tuning –WAD-to-WAD tuning –WAD caching –SGI/Linux –jumbo frame experiments… the quest for bigger and bigger MTUs

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Interactions Scientific applications –SciDAC supernova and global climate –Data grids (CERN, SLAC) Middleware –Globus/gridFTP –HSI/HPSS Network measurement –Internet2 end-to-end –Pinger (Cottrell) –Claffy/Dovrolis pathload –netest (Guojun) –SCNM Protocol research –Dynamic Right-Sizing (Feng) –HS TCP (Floyd) –Scalable TCP (Kelly) –TCP Vegas (Feng, Low) –Tsunami/SABUL/FOBS/RBUDP –parallel streams (Hacker) OS vendors –Linux –IBM AIX/Linux –Cray X1 Talks/papers/software/ t100

UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Summary Novel approaches –non-invasive dynamic tuning of legacy applications –out-of-kernel tuning –using TCP to tune TCP –tuning on a per flow/destination based on recent path metrics or policy (QoS) Effective evaluation framework –protocol analysis and tuning –network/application/OS debugging –path characterization tools, archive, and visualization tools Performance improvements –WAD tuned : buffers  10x AIMD  2x to 10x delayed ACK  2x slowstart  3x reorder  40x Timely -- needed for science on today’s and tomorrow’s networks