Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The usage of the gLite Workload Management.

Similar presentations


Presentation on theme: "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The usage of the gLite Workload Management."— Presentation transcript:

1 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The usage of the gLite Workload Management System by the LHC experiments Simone Campana, CERN Enzo Miccio, CERN-INFN Andrea Sciabà, CERN EGEE User Forum 2007, Manchester (UK)

2 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Outline Monte Carlo production and data analysis in ATLAS and CMS –The ATLAS production system –The CMS Remote Analysis Builder The gLite Workload Management System –Architecture –Main functionalities –Requirements for 2007 and 2008 –WLCG acceptance criteria Testing the gLite WMS –Results of the acceptance tests –Single job submission tests Usage of the gLite WMS in the experiments –ATLAS –CMS –ALICE and LHCb Conclusions

3 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Monte Carlo production The LHC experiments need to generate huge amounts of simulated data to validate the reconstruction software, test the computing model and develop physics data analysis –~ 50 million events/month in 2007 –~ 100 million events/month in 2008 –MC production done at Tier-2 sites  distributed activity Specific tools have been developed by each experiment to manage the production workflow –ATLAS production system –CMS Production Agent –These tools need to be interfaced to one or more Grid workload management systems  To use different Grids  To use the same Grid in different ways

4 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Example: the ATLAS Production System A central database of jobs to be run A “supervisor” for each Grid that takes jobs from the central database, submits them to the Grid, monitors them and checks their outcome An “executor” acting as interface to the Grid middleware –EGEE/WLCG  Lexor using the gLite WMS  Condor-G direct submission

5 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Data analysis Data analysis is either done by individuals, or as an organized activity (event reconstruction, data reprocessing) –Reconstruction and reprocessing done at Tier-1 –End-user analysis done at Tier-2 –Datasets can be distributed, or replicated at several sites Tools exist specifically to submit and manage analysis jobs –To shield the user from the differences between Grids or job management systems –To integrate the analysis job workflow with the experiment data management system –To implement higher level job management functionalities

6 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Example: data analysis with CRAB The user develops and compiles his code on the UI –Common libraries are pre-installed at all CMS sites Given the datasets to analyze, CRAB splits the task in many jobs and submits them near the data Jobs are submitted via –LCG RB or gLite WMS (EGEE, Open Science Grid) –via Condor-G (OSG only) The user retrieves the output once the task has finished Used since two years by physicists and in data challenges It is being complemented by an analysis server to automate the management of analysis tasks

7 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 The gLite WMS architecture The service to submit and manage jobs –Task queue: holds jobs not yet dispatched –Information SuperMarket: caches all information about Grid resources –Match Maker: selects the best resource for each job –Job Submission & Monitoring –Interacts with Data Management, Logging & Bookkeeping, etc. WMProxy service optimizes job management and stands between the user and the real WMS –Service Oriented Architecture (SOA) compliant  Implemented as a SOAP Web service –Validates, converts and prepares jobs and sends them to the WM –Interacts with the L&B via LBProxy (a state storage of active jobs) –Implements most new features

8 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Main functionalities The gLite WMS offers several advantages over the old LCG WMS –Bulk submission  Collections: sets of independent jobs  New, much more reliable implementation as a compound job submission –Job sandboxes  Shared input sandboxes for a collection  Download/upload of sandboxes via GridFTP, http, https –Faster match-making  "bulk" matchmaking and ranking for collections –Internal task queue  If a job cannot match right away it is kept for some time until it matches –Resubmission of failed jobs  a job is resubmitted right away after a middleware/infrastructure-related failure  greatly improves the job success rates –A limiter mechanism which prevents submission of new jobs if the load exceeds a certain threshold  Leads to "artificial", but desired, limitations of the job submission rate  Improves the stability of the system –Last but not least, the gLite WMS is actively developed and maintained, while the LCG RB is "frozen"

9 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Requirements for the gLite WMS CMSATLAS Performance 2007 50K jobs/day20K production jobs/day + analysis load 2008 200K jobs/day (120K to EGEE, 80K to OSG) Using <10 WMS entry points 100K jobs/day through the WMS; Using <10 WMS entry points Stability <1 restart of WMS or LB every month under load

10 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 WLCG acceptance criteria Based on the experiment requirements, some criteria have been defined to decide if the gLite WMS satisfies the requirements –At least 10000 jobs/day submitted for at least five days –No service restart required for any WMS component –The WMS performance should not show any degradation during this period –The number of jobs "stuck" should be less than 1% of the total

11 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Testing the gLite WMS The testing of the gLite WMS is mainly done by the Experiment Integration and Support team of WLCG –Collaboration between EIS, JRA1 (EGEE developers), SA1 (EGEE), SA3 (EGEE) –Bugs discovered, fixed and patched bypassing normal certification procedures –Huge improvements in stability and performance The gLite WMS is not yet really in production, but is an "experimental" service –Few instances at CERN, CNAF and Milan used for tests, CMS 2006 data challenge (CSA06) and ATLAS MC production

12 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Test setup The latest version of the gLite WMS is installed on dedicated machines at CNAF –Dual Opteron 2.4 GHz, 4 GB RAM The WMS is stressed by submitting a large number of jobs –Collections of a few hundreds jobs –Single jobs The behaviour or the WMS is closely monitored –Job status: to check that jobs do not become "stuck", or abort due to the WMS –WMS internal status: WMS components running fine, check for bottlenecks, etc. –System status: high load, excessive I/O or memory consumption, etc.

13 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Results of the acceptance test 115000 jobs submitted in 7 days –~16000 jobs/day well exceeding acceptance criteria –The "limiter" prevented submission when load was very high (>12) All jobs were processed normally but for 320 –~0.3% of jobs with problems, well below the required threshold –Recoverable using a proper command by the user No stale jobs The WMS dispatched jobs to computing elements with no noticeable delay Acceptance tests were passed

14 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Single job submission Submission of single jobs from different parallel processes has been also studied –Useful for applications that do not need to submit very large numbers of jobs per user but with many users Results –The time needed for the job submission becomes a limiting factor  max. submission rate/thread  7000 jobs/day –The limiter refuses about ~30% of jobs because the load is always near threshold (12)  Effective rate/process is ~4500 jobs/day per thread –The total submission rate is proportional to the number of threads 12 Failures due to limiter

15 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Single job submission (cont.) Thread #1Thread #2 When the load reaches 12 the limiter decreases the submission rate

16 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 The WMS in the ATLAS MC production Big ramp-up in the last months Reached 20000 jobs/day Wallclock time lost in failures is usually low –Validation periods or occasional incidents increase it from time to time

17 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 ATLAS MC error analysis 30% of errors due to WMS + infrastructure (Computing Elements and Batch Farms problems) The wasted wallclock time comes from problems in the data management (75% of the total)

18 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 The WMS in CMS data analysis CMS supports submission of job collections via WMS in CRAB (analysis jobs) –Tested during the computing/software/analysis challenge (CSA06) in 2006 for ~1 month –The submission rate of "fake"analysis jobs reached about 16000 jobs/day (2 WMS instances used) –Globally the gLite + Condor-G submission systems achieved the goal of 50000 jobs/day Submitted jobs Successful jobs

19 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 ALICE and LHCb experience ALICE is using the LCG Resource Broker to submit "pilot" jobs –A pilot job "pulls" real MC production jobs from central queue –Pilot jobs are much less (by one order of magnitude) than real jobs  much less stringer requirements on the WMS than ATLAS and CMS –First tests with the WMS have been very successful  Huge reduction of time needed to dispatch pilot jobs LHCb also plans to abandon the LCG RB for the gLite WMS –As for ALICE, the WMS is used to send pilot jobs  less stringent requirements –The gLite WMS is already integrated in DIRAC, the LHCb submission system –Already used in production for a short time, usage should significantly increase very soon

20 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Conclusions Most reliability problems are understood –A few minor issues still being investigated The WMS is not yet officially in production in EGEE –The current version used for the tests is in certification and will be available for deployment in a couple of weeks The advantages compared to the LCG Resource Broker are very significant –Performance with single job submission still needs to be tuned The improvements achieved during these months have already had a big impact on the amount of effort required to run the gLite WMS in production activities –e.g. for the ATLAS Monte Carlo production All the LHC experiments are ready to use it –Either they are already using it, or have finished the testing phase

21 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Acknowledgements Thanks to Julia Andreeva, Gianluca Castellani, Gerhild Maier, Patricia Méndez and Roberto Santinelli for their contributions to this presentation Thanks to JRA1, SA1 and SA3 for their continuing support


Download ppt "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks The usage of the gLite Workload Management."

Similar presentations


Ads by Google