Presentation on theme: "Storage System Integration with High Performance Networks Jon Bakken and Don Petravick FNAL."— Presentation transcript:
Storage System Integration with High Performance Networks Jon Bakken and Don Petravick FNAL
Overview Review of some salient characteristics of wide-area networks. Describe initial investigations at Fermilab for optimizing wide area file transfers –integrated with production WAN/LAN and storage systems.
Wide Area Characteristics Most prominent characteristic, compared to LAN, is the very large bandwidth*delay product. Underlying structure – its a packet world! Possible to use pipes between specific sites These circuits can be both static and dynamic Both IP and non-IP (for example, Fibre-channel over sonet) –FNAL has proposed investigations and has just begun studies with its storage systems to optimize WAN file transfers using pipes.
Bandwidth*Delay At least bandwidth*delay bytes must be kept in flight on the network to maintain bandwidth. –This fact is independent of protocol. –Current practice uses more than this lower limit. For example, US CMS used ~2x for their DC04. CERN FNAL has a measured ~60 ms delay Using the 2x factor, 120 ms delay gives – 30 MB/sec ~3-4 MB in flight –1000 MB/sec ~120 MB in flight
Bandwidth*Delay and IP Given a single lost packet and a standard MTU size of 1500 bytes, the host will receive many out-of-order packets before receiving the retransmitted missing packet. –Must incur at least 2 delays worth FNAL CERN (2*60 ms delay) 30 MB/sec: more than 2400 packets 1000 MB/sec: more than 80000 packets
Knee-Cliff-Collapse Model When load on a segment approaches a threshold, a modest increases in throughput is a accompanied by a great increases delay. Even more throughput results in congestion collapse. Can not load a network arbitrarily. TCP tries to avoid collapse, but its solution has problems at large bandwidth*delay
Bandwidth and Delay and TCP Stream model of TCP implies packet buffering is in kernel - this leads to kernel efficiency issues. Vanilla TCP behaves as if all packet loss is caused by congestion. –TCP Solution is to back off throughput to avoid the congestion collapse in AIMD fashion: Lost packet? Cut packets in flight by ½ Success? Open window next time by one more packet –This leads to a very large recovery time at high bandwidth*delay: 30 MB/sec drops to 15 MB/sec with just 1 lost packet –Recovery time is 15 MB / 1500 byte MTU = 10000 * 120 ms –Recovery time is 1200 sec = 20 minutes!
Strategies Smaller, lower bandwidth TCP streams in parallel –Examples of these are GridFTP and BBftp Tweak AIMD algorithm –Logic is in the senders kernel stack only (congestion window) –FAST, and others – USCMS used an FNAL kernel mod in DC04 May not be fair to others using shared network resources Break the stream model, use UDP and cleverness, especially for file transfers. But: –You have to be careful and avoid congestion collapse. –You need to be fair to other traffic, and be very certain of it –Isolate strategy by confining transfer to a pipe
Pipes and File Transfer Primitives Tell network the bandwidth of your stream using RSVP, Resource Reservation Protocol Network will forward the packets/sec you reserved and drop the rest (QoS) Network will not over subscribe the total bandwidth. Network leaves some bandwidth out of the QoS for others. Unused bandwidth is not available to others at high QoS.
Storage Element worker WAN Worker Node Side Grid Side LAN File Stage In (POSIX style I/O) File Stage Out File Srv worker
Storage System and Bandwidth Storage Element does not know the bandwidth of individual stream very well at all –For example, a disk may have many simultaneous assessors or the file may be in memory cache and transferred immediately –Bandwidth depends on fileserver disk and your disk. Requested bandwidth too small? –If QoS tosses a packet, AIMD will drastically affect transfer rate Requested bandwidth too high? –Bandwidth at QoS level wasted, overall experimental rate suffers Storage Element may know the aggregate bandwidth better than individual stream bandwidth. –Storage Element, therefore needs to aggregate flows onto a pipe between sites, not deal with QoS on a single flow. –This means the local network will be involved in aggregation.
FNAL investigations Investigate support of static and dynamic pipes by storage systems in WAN transfers. –Fiber to Starlight optical exchange at Northwestern University. –Local improvements to forward traffic flows onto the pipe from our LAN –Local improvements to admit traffic flows onto our LAN from the pipe –Need changes to Storage System to exploit the WAN changes.
Fiber to Starlight FNALs fiber pair has the potential for 33 channels between FNAL and Starlight (3 to be activated soon) Starlight provides FNALs access to Research and Education Networks: –ESnet –DOE Science Ultranet –Abilene –LHCnet (DOE-funded link to CERN) –SurfNet –UKLight –CA*Net –National Lambda Rail
LAN – Pipe investigation Starlight path bypasses FNAL border router Aggregation of many flows to fill a (dynamic) pipe. We believe that pipes will be owned by a VO. Forwarding to the pipe is done on a per flow basis Starlight path ties directly to production LAN and production Storage Element (no dual NICs).
Forwarding Server File server Router and Core Network Forwarding server ESNetStarlight
Flow-by-flow Strategy Storage element identifies flows to the forwarding server by using layer 5 information –Host IP, Dest IP, Host Port, Dest Port and Transfer Protocol – And VO information Forwarding server informs peer site to allow admission Forwarding server configures local router to forward flow over DWDM link or the flow takes the default route –1 GB pipe is about 30 flows at 30 MB/S. –If flows are 1 GB files, this yields about 1 flow change/sec Forwarding server allows flows to take alternate path when dynamic path is torn down. –Firewalls may have issues with this. Incoming flows are analogous Flow-by-Flow solution seems to suit problem well, but there are plenty of implementation issues.
Changes to Storage Element to exploit dynamic pipes Build semantics into bulk copy interfaces that allow for batching transfers to use bandwidth when available. Based on bandwidth availability, dynamically change number of files transferred in parallel Based on bandwidth availability, change the layer-5 (FTP) protocols used –Switch from FTP to UDP blaster (sabul) for example. –Or change the parameters used to tune layer-5 protocols, for example parallelism within ftp. Deal with flows which have not completed when dynamic pipe is de-allocated.
Summary There are conventional and research approaches to wide area networks. The interactions in the wide area are interesting and important to grid based data systems FNAL now has the facilities in place to investigate a number of these issues. Storage Elements are important parts of the investigation and require changes to achieve high throughput and reliable transfers over WAN