Presentation is loading. Please wait.

Presentation is loading. Please wait.

The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (

Similar presentations


Presentation on theme: "The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard ("— Presentation transcript:

1 The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

2 Gabriele Garzoglio Sep 28, 2005 Overview The Interoperability Test Bed Motivations Architecture Status Report Lesson learned / Problems encountered Still discussing… Conclusions

3 Gabriele Garzoglio Sep 28, 2005 Motivations for the interoperability project The SAM-Grid is a convenient meta- computing system for the RunII experiments because it offers… …transparent access to the experiment data through SAM …integrated application management (job environment preparation, application-sensitive policies, job aggregation) But deployment is expensive… The idea: DZero will increase its resource pool within the framework of LCG (EGEE), while relying on the SAM-Grid data and application management

4 Gabriele Garzoglio Sep 28, 2005 Basic Architecture SAM-Grid LCG SAM-Grid / LCG Forwarding Node SAM-Grid VO-Specific Services Flow of Job Submission Offers services to … Main issues to track down: Accessibility of the services Usability of the resources Scalability

5 Gabriele Garzoglio Sep 28, 2005 Service/Resource Multiplicity FW SAM- Grid CCCCCCCCCSSS FW C S Network Boundaries Forwarding Node LCG Cluster VO-Service (SAM) Job Flow Offers Service

6 Gabriele Garzoglio Sep 28, 2005 Current Test Bed Configuration FW SAM- Grid C S FW C S Network Boundaries Forwarding Node LCG Cluster Integration in Progress VO-Service (SAM) Job Flow Offers Service Wuppertal CCIN2P3 C Clermont- Ferrand CCC Imperial College RAL Lancaster C

7 Gabriele Garzoglio Sep 28, 2005 Job Scheduling System Adaptation I The SAM-Grid sees the FW node as another gateway The SAM-Grid has developed a grid-to-fabric interface (job-manager) that interacts with multiple fabric services (SAM, Monitoring, Environment Preparation): the Batch System is one of them. Batch system adaptation is done through a layer of abstraction and implemented via robust local scheduler handlers.

8 Gabriele Garzoglio Sep 28, 2005 Job Scheduling System Adaptation II This mechanism is so flexible that allowed the adaptation of SAM-Grid to LCG Job Management (submit, status poll, kill, output gathering, …) is implemented via an LCG “scheduler” handler The handler uses the LCG UI to submit jobs to an LCG broker (logically part of the FW node, in practice can be anywhere)

9 Gabriele Garzoglio Sep 28, 2005 Overview The Interoperability Test Bed Motivations Architecture  Status Report Lesson learned / Problems encountered Still discussing… Conclusions

10 Gabriele Garzoglio Sep 28, 2005 Status Report We can submit real DZero data reprocessing and montecarlo jobs to LCG via SAM-Grid Jobs land on the available LCG clusters Jobs rely on the SAM station at CCIN2P3 to handle input (binaries and data) and output …see the SAM-Grid monitoringmonitoring

11 Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned I Scratch management is responsibility of the site OR the application. DZero requirements on local scratch space Cannot run on NFS because of intensive I/O Need 4 GB of local space SAM-Grid uses job wrappers to do “smart” scratch management (find best scratch area to use) These wrappers rely on the job managers to set up scratch variables ($TMP_DIR, …) Under discussion: one aspect of considering a cluster DZero-certified should be having the scratch variables defined

12 Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned II Use of the LCG brokers Experienced problems with disk space for the input sandbox (input sandbox 4 MB, all the rest via SAM) Needed administrative action to resolve the problem Possibly mitigated since we can use multiple brokers (tested with Wupperal and CCIN2P3 brokers)

13 Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned III Job Failure Analysis In general, for a single SAM-Grid job, the forwarding node submits multiple LCG jobs (aggregation management). The output of all the jobs is bundled together in an output sandbox. We observed problems retrieving the output of “aborted” LCG jobs “Maradona” fails in handling the output In this case, it is tough to understand what went wrong with the job

14 Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned IV Resubmission of non-reentrant jobs Some jobs should not be resubmitted in case of failure. They will be recovered as a separate activity Problems overriding retrials of job submission from the JDL and the UI configuration Is this a known bug? A configuration problem on our part?

15 Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned V Network configuration Sites hosting SAM must allow incoming network traffic from the FW node and from all LCG clusters (worker nodes) to allow data handling control and transport SAM should be modified to provide port range control

16 Gabriele Garzoglio Sep 28, 2005 Problems/Lesson Learned VI SAM configuration SAM can only use TCP-based communication (as expected, UDP does not work in practice on the WAN) SAM had to be modified to allow service accessibility for jobs within private networks (pull-based vs call-back interfaces)

17 Gabriele Garzoglio Sep 28, 2005 Still discussing... I What does it mean certifying LCG for a certain DZero activity? For reprocessing, all the SAM-Grid clusters have undergone an initial certification phase The cluster processes a well known dataset, then results are compared with a reference result What do we do for LCG? Should every individual cluster be certified? Should the LCG as a whole be certified? The answer probably depends on the type of activity (Reprocessing, Montecarlo, Analysis, …)

18 Gabriele Garzoglio Sep 28, 2005 Still discussing... II Who operates the SAM-Grid / LCG interoperability system? For the SAM-Grid DZero reprocessing, people at the facilities had interest in having their resources utilized: people at each facility have run operations submitting jobs to their own facilities Running “operations” means being responsible for the production of the data (routine job submission/monitoring, troubleshooting, facility maintenance/upgrade, …) How do we organize the people that operate the LCG interoperability system? Is one responsible person enough?

19 Gabriele Garzoglio Sep 28, 2005 Still discussing... III Support on LCG In case something goes wrong on the LCG, DZero has to learn the best channels to request support What response can DZero expect now and in 2 years? As the system becomes more complex, it becomes difficult for the operators to pin point the reasons for job failures. LCG will get reports for failures of the SAM-Grid side… and vice-versa.

20 Gabriele Garzoglio Sep 28, 2005 Overview The Interoperability Test Bed Motivations Architecture Status Report Lesson learned / Problems encountered Still discussing…  Conclusions

21 Gabriele Garzoglio Sep 28, 2005 Conclusions / SAM We are moving the test bed to “production” by expanding the system ramping up usage We are discussing open issues in operating the interoperability system LCG certification Organizing the operations Obtaining support for LCG problems Our principal target production application is montecarlo for DZero

22 Gabriele Garzoglio Sep 28, 2005 Conclusions / LCG Grid batch job environment variables Proposal for standardization made at last HEPIX and last Operations Workshop (Bologna) http://edms.cern.ch/document/630962 What is the next step ? How to proceed with implementation ? Make easier the MW errors handling By using a well defined set of MW error codes ? Suitable for automatic handling

23 Gabriele Garzoglio Sep 28, 2005 More info at… http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG- integration.pdf http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG- integration-Lyon-report.pdf http://samgrid.fnal.gov:8080/ http://www-d0.fnal.gov/computing/grid/ http://d0db.fnal.gov/sam/


Download ppt "The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard ("

Similar presentations


Ads by Google