Presentation is loading. Please wait.

Presentation is loading. Please wait.

The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.

Similar presentations


Presentation on theme: "The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division."— Presentation transcript:

1 The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

2 Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure The Job Management and Condor-G Real life experience Future work

3 Gabriele Garzoglio Apr 16, 2004 High Energy Physics Challenges High Energy Physics studies the fundamental interactions of Nature. Few laboratories around the world provide each unique facilities (accelerators) to study particular aspects of the field: the collaborations are geographically distributed. Experiments become every decade more challenging/expensive: the collaborations are large groups of people. The phenomena studied are statistical in nature and very rare events: a lot of data/statistics is needed

4 Gabriele Garzoglio Apr 16, 2004 A HEP laboratory: Fermilab

5 Gabriele Garzoglio Apr 16, 2004 FNAL Run II detectors

6 Gabriele Garzoglio Apr 16, 2004 DZero FNAL Run II detectors

7 Gabriele Garzoglio Apr 16, 2004 The Size of the D0 Collaboration ~500 Physicists 72 institutions 18 Countries DZero and CDF Institutions

8 Gabriele Garzoglio Apr 16, 2004 Data size for the D0 Experiment Detector Data 1,000,000 Channels Event size 250KB Event rate ~50 Hz On-line Data Rate 12 MBps 100 TB/year Total data detector, reconstructred, simulated 400 TB/year

9 Gabriele Garzoglio Apr 16, 2004 Typical DZero activities

10 Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics  The SAM-Grid computing infrastructure The Job Management and Condor-G Real life experience Future work

11 Gabriele Garzoglio Apr 16, 2004 The SAM-Grid Project Mission: enable fully distributed computing for DZero and CDF Strategy: enhance the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing (JIM) History: SAM from 1997, JIM from end of 2001 Funds: the Particle Physics Data Grid (US) and GridPP (UK) People: Computer scientists and Physicists from Fermilab and the collaborating Institutions

12 Gabriele Garzoglio Apr 16, 2004

13 Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure  The Job Management and Condor-G Real life experience Future work

14 Gabriele Garzoglio Apr 16, 2004 Job Management: Requirements Foster site autonomy Operate in batch mode: submit and disconnect Reliability: handle the job request persistently; execute it and retrieve output and/or errors. Flexible automatic resource selection: optimization of various metrics/policies Fault tolerance: transient service disruption; automatic rematching and resubmitting capabilities Automatic execution of complex interdependent job structures.

15 Gabriele Garzoglio Apr 16, 2004 Service Architecture Site Resource Selector Info Collector Info Gatherer Match Making User Interface Submission Global Job Queue Grid Client Submission User Interface Global DH Services SAM Naming Server SAM Log Server Resource Optimizer SAM DB Server RCMetaData Catalog Bookkeeping Service SAM Stager(s) SAM Station (+other servs) Data Handling Worker Nodes Grid Gateway Local Job Handler (CAF, D0MC, BS,...) JIM Advertise Local Job Handling Cluster AAA Dist.FS Info Manager XML DB server Site Conf. Glob/Loc JID map... Info Providers MDS MSS Cache Site Web Serv Grid Monitoring User Tools Flow of: jobdata meta-data

16 Gabriele Garzoglio Apr 16, 2004 Technological choices (2001) Low level resource management: Globus GRAM. Clearly not enough... Condor-G: right components and functionalities, but not enough in 2001... DZero and the Condor Team have been collaborating since, under the auspices of PPDG to address the requirements of a large distributed system, with distributively owned and shared resources.

17 Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities I Use of the condor Match Making Service as Grid Resource Selector Advertisement of grid site capabilities to the MMS Dynamic $$(gatekeeper) selection for jobs specifying requirements on grid sites Concurrent submission of multiple jobs to the same grid resource at any given moment, a grid site is capable of accepting up to N jobs the MMS was modified to push up to N jobs to the same site in the same negotiation cycle

18 Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities II Flexible Match Making logic the job/resource match criteria should be arbitrarily complex (based on more info than what fits in the classad), statefull (remembers match history), “pluggable” (by administrators and users) Example: send the job where most of the data are. The MMS contacts the site data handling service to rank a job/site match This leads to a very thin and flexible “grid broker”

19 Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities III Light clients A user should be able to submit a job from a laptop and turn it off Client software (condor_submit, etc.) and queuing service (condor_schedd) should be on different machines This leads to a 3 tiers architecture for Condor-G: client, queuing, execution sites. Security was implemented via X509.

20 Gabriele Garzoglio Apr 16, 2004 Condor-G: added functionalities IV Resubmission/Rematching logic If the MMS matched a job to a site, which cannot accept it after trying the submission N times, the job should be rematched to a different site Flexible penalization of already failed matches

21 Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure The Job Management and Condor-G  Real life experience Future work

22 Gabriele Garzoglio Apr 16, 2004 JOB Computing Element Submission Client User Interface Queuing System Job Management User Interface Broker Match Making Service Information Collector Execution Site #1 Submission Client Match Making Service Computing Element Grid Sensors Execution Site #n Queuing System Grid Sensors Storage Element Computing Element Storage Element Data Handling System Storage Element Informatio n Collector Grid Sensor s Computin g Element Data Handling System ext. logic ext. logic MyType "Machine" TargetType "Job" Name "ccin2p3-analysis.d0.prd.jobmanager-runjob" gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob" DbURL "http://ccd0.in2p3.fr:7080/Xindice" sam_nameservice_ "IOR:000000000000002a49444c3........." station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3"... MyType "Machine" TargetType "Job" Name "ccin2p3-analysis.d0.prd.jobmanager-runjob" gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob" DbURL "http://ccd0.in2p3.fr:7080/Xindice" sam_nameservice_ "IOR:000000000000002a49444c3........." station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3"... MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" &&...) Rank 0.000000 station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar..." Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457"...... MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" &&...) Rank 0.000000 station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar..." Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457"...... job_type = montecarlo station_name = ccin2p3-analysis runjob_requestid = 11866 runjob_numevts = 10000 d0_release_version = p14.05.01 jobfiles_dataset = san_jobset2 minbias_dataset = ccin2p3_minbias_dataset sam_experiment = d0 sam_universe = prd group = test instances = 1 job_type = montecarlo station_name = ccin2p3-analysis runjob_requestid = 11866 runjob_numevts = 10000 d0_release_version = p14.05.01 jobfiles_dataset = san_jobset2 minbias_dataset = ccin2p3_minbias_dataset sam_experiment = d0 sam_universe = prd group = test instances = 1

23 Gabriele Garzoglio Apr 16, 2004 Montecarlo Production Statistics Started beginning of 2004. Ramped up in March. 3 Sites: Wisconsin (...via Miron), Manchester, Lyon. New sites are joining (UTA, LU, OU, LTU,...) Inefficiency due to the Grid infrastructure « 5% 30 GB/week = 80,000 events/week (about 1/4 of total production)

24 Gabriele Garzoglio Apr 16, 2004 Overview Computation in High Energy Physics The SAM-Grid computing infrastructure The Job Management and Condor-G Real life experience  Future work

25 Gabriele Garzoglio Apr 16, 2004 Future work of DZero with Condor Use of DAGMan to automate the management of interdependent grid job structures. Address potential scalability limits. Investigate non-central brokering service via grid flocking. Integrate/Implement a proxy management infrastructure (e.g. MyProxy). All the rest (...fix bugs, improve error reporting, hand holding, sailing...)

26 Gabriele Garzoglio Apr 16, 2004 Conclusions The collaboration between DZero and the Condor team has been very fruitful since 2001. DZero has worked together with Condor to enhance the Condor-G framework, in order to address the requirements on distributed computing of a large HEP experiment. DZero is running “production” jobs on the Grid.

27 Gabriele Garzoglio Apr 16, 2004 Acknowledgments Condor Team PPDG DZero CDF

28 Gabriele Garzoglio Apr 16, 2004 More info at… http://www- d0.fnal.gov/computing/grid/ http://samgrid.fnal.gov:8080/ http://d0db.fnal.gov/sam/


Download ppt "The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division."

Similar presentations


Ads by Google