Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC.

Similar presentations


Presentation on theme: "Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC."— Presentation transcript:

1 Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

2 The Problem Large data sets gets recreated, and scientists want to know if they should –Fetch a copy of the data –Recreate it locally This problem can be considered in the context of a virtual data system that tracks how data is created so recreation is feasible

3 To make this decision you need 1) Estimate of time to recreate data –Info about data provenance, machine types, etc 2) Estimate of data transfer time 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly

4 To make this decision you need 1) Estimate of time to recreate data –Info about data provenance, machine types, etc 2) Estimate of data transfer time 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly –OUR AREA OF CONCENTRATION

5 Regeneration Time Estimates Previous work (Chep 2004, “Resource Predictors in HEP Applications”) Estimate runtime of ATLAS application –End-to-end estimation since no low-level application model available –Used data about input parameters (number of events, versioning, debug on/off, etc) and benchmark data (using nbench) Estimates are accurate to 10% for event generation and reconstruction, 25% for event simulation

6 Regeneration Time Estimate Accuracy

7 File Transfer Time Estimates Much previous work (e.g. Vazhkudai and Schopf, IJHPCA Vol 17, No. 3, August 2003 ) We use simple end-to-end history data from GridFTP logs to estimate behavior –Simple approach works well on our networks/machines –Average bandwidth used with no file-size filtering

8 Testbed Size of file transferred (was it the same all the time or did it vary?) Three sites – Boston, Cern, BNL Something about the machines – esp OS, arch, and memory size Something about the standard latency Something about the network – what was the bottleneck in terms of smallest piece – 100 MBS ethernet to something? Something about the variance seen

9 Testbed Files transferred from BNL to Harvard and from CERN to Harvard –BNL (aftpexp01.bnl.gov): 4x 3GHz Xeon, Linux 2.4.21- 37.ELsmp, 2.0GB RAM, 1.0 GBit/s NIC –Harvard: 2x 3.4GHz P4, Linux 2.4.20-21.EL.cernsmp, 1.5GB RAM, 1.0 GBit/s NIC Typical network routes: –Harvard –NoX – ManLan – ESNet – BNL Typical Latency 7.8 ms –Harvard – NoX – ManLan – Chicago (Abilene) – CERN Typical Latency 148 ms Bottlenecks are in machines at each end (e.g. disk access)

10 Initial Transfer History Transferred 20 files each, 25MB, 50MB, 100MB, 250MB, 500MB, 1GB from BNL to Harvard (similar tests for CERN to Harvard link) – BNL: aftpexp01.bnl.gov (4x3.0GHz Xeon, Linux 2.4.21-37.ELsmp, 2.0GB RAM, 1Gbit/sec NIC) – Harvard: heplatlas3.physics.harvard.edu (2x3.4GHz P4, Linux 2.4.20-21.EL.cernsmp, 1.5GB RAM, 1Gbit/sec NIC) –CERN: castorgrid.cern.ch - Linux Typical routes: –Harvard – NoX – ManLan – ESNet – BNL Typical latency 7.8ms –Harvard – NoX – ManLan – Chicago (Abilene) – CERN Typical latency 148ms Bottlenecks are apparently within the internal Harvard, BNL networks. Harvard, BNL machines, network very quiet during this initial phase Transfer times are linear with file size During this quiet time, transfers of typical 100MB ATLAS files have typical variance of approximately 5%

11 Network Routing

12 Transfer Benchmarking Transfer files from BNL to Harvard –20 files each 25MB, 50MB, 100MB, 250MB, 500MB, 1GB Average file transfer times are linear with file size Initially quiet machines, network –Transfers of 100MB files have variance ~5%

13 Time vs File Size, BNL (Quiet network)

14 Transfer Variance, BNL (100 MB files, quiet network)

15 Transfer Benchmarking Some data taken during “Service Challenge 3” Average file transfer times are still linear with file size, but have larger variance

16 Time vs File Size, BNL (Busy network)

17 Transfer Variance, BNL (100 MB files, busy network)

18 But our concentration was on the framework Given ways to estimate application run time and file transfer time, we want to plug them into an existing framework to make better resource management decisions Could be implemented as a post- processor to optimize DAG’s produced by Chimera

19 Workflow Optimization A script parses the DAG, looking for I/O, binaries I/O files indexed in Replica Location Service (RLS) Client queries database for execution parameters, bandwidths Script evaluates execution, transfer times and rewrites fastest DAG

20 Workflow Optimization Steps A script parses job submission files, looks for I/O file names, event numbers, names of binaries Locate files in RLS, determine sizes, provenance Client queries database for execution time parameters and end-to-end bandwidths Go job-by-job –Calculate regeneration time estimate –Calculate file transfer time estimate –Rewrite job files to perform fastest instantiation

21 Our Strawman Application ATLAS event reconstruction jobs take ~20Mins to calculate a 100 Meg file File transfer Boston to BNL ~15 Sec/ 100 Meg file We created simplified jobs that would have average execution times equal to the file transfer times in order to have a situation closer to the one originally hypothesized Likely to be more common as data access becomes more contentious, and machines/calculations speed up

22 Framework Tests Generate “Non-optimized” DAG’s – linear chains which use a random mixture of transfers and calculations to instantiate 10, 20, or 40 files. Operate on these DAG’s with our optimizer to produce “Optimized” DAG’s Submit both “Non-optimized” and “Optimized” DAG’s and compare processing times For our particular strawman we expect the “Optimized” DAG’s to be 25% faster than the “Non-optimized”

23 Framework Tests

24 Comparison of Results

25

26 Optimized Results

27

28 Summary Implementation works A 28% time savings is seen Works with crude bandwidth predictions –More sophisticated predictions for dynamic situations would be helpful Most useful when regeneration and transfer times are similar.


Download ppt "Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC."

Similar presentations


Ads by Google