Presentation on theme: "SAM-Grid Status Core SAM development SAM-Grid architecture Progress Future work."— Presentation transcript:
SAM-Grid Status http://d0db.fnal.gov/sam Core SAM development SAM-Grid architecture Progress Future work
Core SAM development http://d0db.fnal.gov/sam SAM is a production system 300 active users 60,000 file replicas 5,000 files/day cache turnover (1TB) A fine-tuning example: The Friday afternoon opportunity Many users submit several projects for the w/e Station has project limit, 0(Ncpus) Queue projects, but then how to keep required data in cache Parallelisation & re-education: N processes per project, not N projects with 1 process each
Multi-Process projects Project Manager Together, processes see each file once. Process is simple: Asks: Give me a file Responds to: Here`s the path Hang on None left Processes
SAM Grid RC Condor MMS Condor-G GRAM Grid sensors Job Definition and Management Monitoring and Information Data Handling Request Broker Compute Element Resource Logging and Bookkeeping Job Scheduler Info Processor and Converter Replica Catalog DH Resource Management Data Delivery and Caching Resource Info Job Client Job Status Updates Principal Component Service Implementation or Library Information GSI Batch System Site Gatekeeper AAA MDS-2 Condor Class Ads SAM-Grid Architecture Job Definition and Management Based on the Match Making Service of Condor® through collaboration with University of Wisconsin CS Group Monitoring and Information Services Provides a view of the status and history of the system, as well as the information relevant for job and data management Data Handling The existing SAM system, developed at Fermilab to accommodate high volume data management, plays a principal role in providing Data Handling services to the Job Management infrastructure
Job Definition and Management Condor MMS Condor-G GRAM Grid sensors Request Broker Compute Element Resource Job Scheduler Job Client Job Status Updates Batch System Site Gatekeeper Job Management Globus GRAM for inter-operability CondorG for remote submission Condor MMS for resource brokerage Condor is Resource Broker Collaboration with Condor group Condor members at weekly SAM-Grid meetings CVS branch of v6_3_2 with our requested functionality Ability to choose globus-scheduler External function calls allowed in MMS – can query SAM Db
Grid RC Monitoring and Information Logging and Bookkeeping Info Processor and Converter Replica Catalog Resource Info GSI AAA MDS-2 Condor Class Ads Monitoring & Information Package of information providers to interrogate: SAM Station: project progress, disk caches Replica Catalogue: file location, size Batch Systems: free cpus Resources: os, code releases present, memory, disk space,…
SAM Data Handling DH Resource Management Data Delivery and Caching Data Handling Existing SAM system Added gridftp as a transfer protocol (also kerb-rcp,bbftp available) Use server certificates issued by FNAL Kerberized CA Delegation of user proxy not (yet) done (accounting, security) Server runs as unprivileged user Report bug, receive patch. Apply. Re-build on Linux and Ultrix, repackage,… i.e. very poor support. Globus bundles packaged as upd products During testing - re-discovered globus-url-copy bug STILL in downloadable globus release! Repeat above procedure? No, take EDG special globus-url-copy binary.
Future Work n th order brokering 0 th order: Submit to site where most data replicated. Trivial with condor additions. 1 st order: Sense grid connectivity using WP7 tools as plugin to condor Inter-site parallelisation: Split datasets, move jobs to data Dynamic station installation To use non-dedicated resources and clean-up afterwards upd has almost no dependencies on native packages Auto-tailoring forced by CDF makes this possible Further MC production/SAM integration.
0 th order brokering File Count: 99 Average File Size: 674153 Total File Size: 66741199 Total Event Count: 214914 4 known domains and 3 stations At wuppertal :- 4719Mb( 7%) from fnal.gov at 0.5Mb/s. 48539Mb( 73%) from ic.ac.uk at 2.0Mb/s. 13483Mb( 20%) from pnfs at 0.5Mb/s. Transfer time =18.0hrs. Plus 2 tape mounts. At imperial-test :- 4719Mb( 7%) from fnal.gov at 0.5Mb/s. 48539Mb( 73%) from ic.ac.uk at 10.0Mb/s. 13483Mb( 20%) from pnfs at 0.5Mb/s. Transfer time =11.8hrs. Plus 2 tape mounts. At central-analysis :- 51909Mb( 78%) from pnfs at 10.0Mb/s. 14831Mb( 22%) from fnal.gov at 100.0Mb/s. Transfer time =1.5hrs. Plus 2 tape mounts. …but no free cpu! enstore tape
Conclusions SAM production system Heavy and increasing D0 use. Fine tuning. CDF deployment – no show stoppers SAM-Grid taking shape Monitoring & Information prototype available GridFTP pre-deployment tests. System failed me. Remote job submission works. CondorG enhancements allow site matching in MMS by query of SAM replica catalogue. Outreach-SAM offers unique, working example of a PP grid already some interest in PP data access patterns. expect more interest in real data handling & optimisation. Wise learn from other peoples mistakes