Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan.

Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan

Exploding data volumes 100,000 TB MACHO et al.: 1 TB Palomar: 3 TB 2MASS: 10 TB GALEX: 30 TB Sloan: 40 TB Pan-STARRS: 40,000 TB 2004: 36 TB 2014: 3,300 TB 10 5 increase in data volumes in 6 years AstronomyClimate Genomics

Data movement Data Transfer Node Storage

Current work  Understand characteristics, control and optimize transfers  Efficient scheduling of wide-area transfers  Model – predict and control throughput –Characterize, identify key features –Data-driven modeling using experimental data  Adaptive scheduling –Algorithm to minimize slowdown –Experimental evaluation using real transfer logs

 High-performance, secure data transfer protocol optimized for high-bandwidth wide-area networks  Parallel TCP streams, PKI security for authentication, integrity and encryption, checkpointing for transfer restarts  Based on FTP protocol - defines extensions for high- performance operation and security  Globus implementation of GridFTP is widely used.  Globus GridFTP servers support usage statistics collection –Transfer type, size in bytes, start time of the transfer, transfer duration etc. are collected for each transfer GridFTP 5

GridFTP usage log

Parallelism vs concurrency in GridFTP Data Transfer Node at Site B Parallel File System Data Transfer Node at Site A Parallel File System Parallelism = 3 TCP Connection GridFTP Daemon GridFTP Daemon GridFTP Client 2811 GridFTP Server GridFTP Server GridFTP Server GridFTP Server TCP Connection Concurrency = 2 Control channel Control channel

Parallelism vs concurrency

 Objective - control bandwidth allocation for transfer(s) from a source to the destination(s)  Most large transfers between supercomputers –Ability to both store and process large amounts of data  Site heavily loaded, most bandwidth consumed by small number of sites  Goal – develop simple model for GridFTP –Source concurrency - total number of ongoing transfers between the endpoint A and all its major transfer endpoints –Destination concurrency - total number of ongoing transfers between the endpoint A and the endpoint B –External load - All other activities on the endpoints including transfers to other sites Model throughput and control bandwidth allocation

Modeling throughput  Linear models  Model dest throughput (DT) using source & destination CC  Data to train, validate models – load variation experiments  Errors >15% for most cases  Log models Y’ = a 1 X 1 + a 2 X 2 + … + a k X k + b DT = a 1 *DC + a 2 *SC + b 1 DT = a 3 *DC/SC + b 2 log(DT)=a4*log(SC) + a5*log(DC) + b3

Modeling throughput  Log model better than linear models, still high errors  Model based on just SC and DC too simplistic  Incorporate external load –External load - network, disk, and CPU activities outside transfers –How to measure the external load? –How to include external load in model(s)?

External load  Multiple training data – same SC, DC - different days & times  EL - Throughput differences for same SC, DC  Three different functions for external load (EL) –EL1=T −AT, T - throughput for transfer t, AT - average throughput of all transfers with same SC, DC as t –EL2=T−MT, MT - max throughput with same SC, DC as t –EL3 = T/MT EL a11 if EL>0 |EL| (−a11) otherwise AEL{a11} = DT = a6*DC + a7*SC + a8*EL + b4 DT = SCa9 * DCa10 * AEL{a11} * 2b5 Linear Log

Models with external load DT = a6*DC + a7*SC + a8*EL + b4 PredictControllable Uncontrollable  Unlike SC and DC, external load is uncontrollable  Train models – multiple data points with same SC, DC  In practice, some recent transfers possible but all combinations of SC, DC unlikely

Calculating external load in practice DT = a6*DC + a7*SC + a8*EL + b4 Known Compute Transfers in past 30 minutes DT = a6*DC + a7*SC + a8*EL + b4 + e Historic transfers Previous Transfer Method Recent Transfers Method Recent Transfers with Error Correction

Applying models to control bandwidth  Find DC, SC to achieve target throughput  Limit DC to 20 to narrow search space –Even then, large number of possible DC combinations (20 n )  SCmax (max source concurrency allowed) is the number of possible values for SC –Heuristics to limit search space to SCmax * #destinations DT = a6*DC + a7*SC + a8*EL + b4 PredictGiven Known (Compute w/ PT, RT or RTEC) DT = a6*DC + a7*SC + a8*EL + b4 Give n Compute Known (Compute w/ PT, RT or RTEC)

Experimental setup TACC NCAR SDSC Indiana NICS PSC

Experiments  Ratio experiments – allocate available bandwidth at source to destinations using predefined ratio  Available bandwidth at stampede is 9 Gbps  2:1:2:3:3 for Kraken, Mason, Blacklight, Gordon, Yellowstone Kraken = 2*9Gbps/(2+1+2+3+3) = 2*9Gbps/9 = 2Gbps Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps Kraken=2Gbps, Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps Kraken=3Gbps, Mason=X 1 Gbps, Blacklight=X 2 Gbps, Gordon=X 3 Gbps, Yellowstone=X 4 Gbps  Factoring experiments – increase destination’s throughput by a factor when source is saturated

Results – Ratio experiments Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,3,3,1,1}. Model: log with EL1. Method: RTEC Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,4,3,1,1}. Model: log with EL3. Method: RT

Results – Factoring experiments Increasing Gordon’s baseline throughput by 2x. Concurrency picked by picked by Algorithm for Gordon was 5 Increasing Yellowstone’s baseline throughput by 1.5x. Concurrency picked by picked by Algorithm for Yellowstone was 3

Adaptive scheduling of data transfers Data Transfer Node Storage

Adaptive scheduling of data transfers

 Bursty transfers  opportunity for adaptive scheduling  Goals - optimize throughput, improve response times  Challenge – adaptive concurrency –Low load – increase CC (unsaturated destinations) to max. utilization –New requests  queue or adjust ongoing transfer concurrency  Data transfer scheduling analogous to parallel job scheduling? –Data transfers ≅ compute jobs. wide-area bandwidth ≅ compute resources, transfer concurrency ≅ job parallelism  CPU, storage network different at source, destination  Shared wide area network  Scheduling wide-area data transfers challenging –Heterogenous resources, shared network, dynamic nature of load –Scheduling decisions not based on resource availability at one site

Metrics Turnaround time – time a job spends in the system: completion time - arrival time Job slowdown – factor slowed relative to the time on a unloaded system: turnaround time / processing time Bounded slowdown in parallel job scheduling Bounded slowdown for wide-area transfers Job priority for wide-area transfers

Scheduling algorithm  Maximize resource utilization and reduce slowdown –Adaptively queue and adjust concurrency based on load  Preemption/restart –State required is missing block information & No migration –Still overhead (auth, checkpoint restart), p-factor limits preemption  Four key decision-making points –Upon task arrival – schedule or queue –If scheduled, what concurrency value? –When to preempt (and schedule a waiting job) –When to change concurrency of a running job  Use both models and recent observed behavior –Models to predict throughput and determine concurrency value –5-second averages of observed throughput to determine saturation

Illustrative example Average turnaround time is 10.92 Average turnaround time for baseline is 12.04

Workload traces  Traces from actual executions –Anonymized GridFTP usage statistics  Busiest day from a 1 month period  Busiest server log on that day  Limit length of logs due to production environment  Three 15-minute logs - 25%, 45%, and 60% load traces –“load” is total bytes transferred / max. that can be transferred  Destination anonymized in logs –Weighted random split based on capacities

Experimental results – turnaround 60% load

Experimental results – worst case 60% load

Experimental results – 60% load improved baseline

Related work  Several models for predicting behavior & finding optimal parallel TCP streams –Uncongested networks, simulations  Many studies on bandwidth allocation at router –Our focus is application-level control  Adaptive replica selection, algorithms to utilize multiple paths –Ability to control network path –Overlay networks  Workflow schedulers - dependencies between computation and data movement  Adaptive file transfer scheduling w/preemption in production environments not studied

Summary of current work  Models for wide-area data transfer throughput in terms of few key parameters  Log models that combine total source CC, destination CC, and a measure of external load are effective  Methods that utilize both recent and historical experimental data better at estimating external load  Adaptive scheduling algorithm to improve the overall user experience  Evaluated it using real traces on a production system  Significant improvements over the current state-of-the-art

Proposed work  File transfers have different time constraints –Near real time to highly flexible  Objective – account time requirements to improve overall user experience  Consider 2 job types – batch and interactive –First, exploit relaxed deadlines of batch jobs –Next, exploit knowledge about future arrival times  Finally, maximize utility value for jobs –Each job has a utility function

Batch jobs  If deadline closer, batch jobs get highest priority –Scheduled with a concurrency of 2, no preemption  Otherwise, batch jobs get lowest priority  Interactive jobs measured by turnaround and slowdown, batch jobs measured by deadline satisfaction rate

Knowledge about future jobs T1 (d2) T2 (d1) T3 (d2) T1 (d2) T2 (d1 ) T3 (d2 ) 0 1 2 0 1 2 3 Wait queue Schedule A – no knowledge of future jobs 4 5 T1 (d2) T2 (d1) T3 (d2) 0 1 2 3 4 5 3 Schedule B – w/ knowledge of future jobs T1 – 1GB, T2 – 1GB Source – 1GB/s Destination d1 – 1GB/s Destination d2 – 0.5GB/s T3 – 0.5GB 0.5 1.0 Throughput in GB/s Time in Seconds 0.5 1.0 Throughput in GB/s Time in Seconds Average Slowdown is (1.5+1+2)/3 = 1.5 Average Slowdown is (1+2+1)/3 = 1.33

Utility based scheduling  Both interactive and batch jobs have deadline  Associated utility function –Impact of missing the deadline  Decay – linear, exponential, step, or a combination  Each transfer request R defined by tuple, R = (d,A,S,D,U) –d = destination, –A = arrival time of R, –S = size of the file to be transferred, –D = deadline of R, and –U = utility function of R.  Objective – maximize aggregate utility value of jobs

Utility based scheduling  Inverse of instantaneous utility value as priority  Instantaneous utility value calculated as follows

Questions

Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan.

Similar presentations

Presentation on theme: "Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan.

Similar presentations

Presentation on theme: "Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan."— Presentation transcript:

Similar presentations

About project

Feedback