Presentation is loading. Please wait.

Presentation is loading. Please wait.

Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Similar presentations


Presentation on theme: "Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:"— Presentation transcript:

1 Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration: 12-24 hrs Goal2: analyse these events 1 job to analyse all the events

2 Status Started on Mar, 15 th Stopped on May, 31 st About 450 central Pb-Pb events simulated (6 jobs/day) :-( Output registered in the EDG Alice RC Output stored on : EDG disk SE's (300) EDG MSS SE's (150) CASTOR at CNAF and CERN (all, registered in the AliEn Data Catalogue) Production test on EDG-1.4

3 Comments Average Efficiency: 35% More jobs would mean lower efficiency Application Testbed unstable on the time scale of our job duration (24 h) Most of the jobs failed because of services failures It takes a long time to track down the errors and recover (i.e., clean up the RC by hand when needed) Production test on EDG-1.4

4 Failure reasons: RB overloaded Service crash, jobs get lost even though under execution at a WN, and they can't be tracked/monitored anymore stdout/stderr can't be monitored during execution The job might complete correctly and store/register the output on/in the SE/RC No Output Sandbox available No change of job status Production test on EDG-1.4

5 Failure reasons: WN disk space full Alice jobs produce a 2 GB output Sometimes the available disk space on the executing WN is filled up and the job crashes Production test on EDG-1.4

6 Failure reasons: The "Lyon" problem WN's publish the total available memory in the IS The JDL memory requirement is compared to the published values When more than a job is allowed on the WN, the memory is shared. AliRoot jobs break because they need more memory than the actually available amount Workaround by F. Hernandez Production test on EDG-1.4

7 Behaviours not understood Some jobs go to "OutputReady" status after 6-8 days MSS jobs fail more frequently (and job information only available for CNAF jobs) Production test on EDG-1.4

8 MSS jobs OK 74 LDAP failure 23 RC failure 35 Disk full 16 Lost 32 Wrapper 39 Running 36 Submit 15 --------------------- Total 270

9 Production test on EDG-1.4 Conclusions The EDG Application Testbed is not suitable for large productions (lack of resources) Its use is very frustrating: instability, limited functionality, low efficiency at the present rate, it would take 18 months to complete the production :-( functionality for data analysis is now missing The application testbed is being closed use AliEn for data analysis and wait for LCG-1


Download ppt "Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:"

Similar presentations


Ads by Google