Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.

Similar presentations


Presentation on theme: "SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester."— Presentation transcript:

1 SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.

2 What is SAM? SAM is Sequential data Access via Meta-data Project started in 1997 to handle D0’s needs for Run II data system. Current SAM team includes: –Andrew Baranovski, Lauri Loebel-Carpenter, Gabriele Garzoglio, Chris Jozwiak, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders) http://d0db.fnal.gov/sam

3 SAM is a Distributed System Database Server(s) (Central Database) Name Server Global Resource Manager(s) Log server Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local Shared Locally Arrows indicate Control and data flow

4 Job Submission Executable –Runtime environment Executable&assoc. files (user specific). Experiment environment. Data –Dataset definition Select by metadata. Converted to LFN`s at submit time, ie.datasets change. Build SQL query…then…execute query.

5 Dataset

6 Job Running & Job Control Client Local SM (Station Master) Batch System Process Manager (SAM wrapper script) User Task Job Manager (Project Master) 2.submit to SM 4.submit To BS 6.start job8.invoke 5.Submission ok 10.resubmit 9.setJobCount/stop 3.invoke jobEnd 1. sam submit –defname=mydata –script=myexe 7.Started (Run this exe | on this data)

7 User exe Job control User exe getNextFile() Here`s the path to a local file: /sam/cache1/boo/mydata1.dat WaitFinished Replica Catalogue LFN PFN Stager Fetch PFN BS Release 1 2 3 4 Physics & wrapper

8 Data Management Replica Catalogue Replication Cache Management

9 Replica Catalogue Combined with Metadata in an Oracle database, although logically distinct –Query on metadata to create a dataset list of LFN`s Experiment specific (D0/CDF). –Query on LFN to locate physical file. Generic replica catalogue. node:/path/to/cache/myfile.dat

10 Replica Catalogue 600,000 files increasing at 3000/day, 120TB. 150,000 in cache 5000 files per day replicated, 5000 destroyed. ½ million queries per day, (90% SELECT).

11 Cache Managment 13.6TB, in several 100 individually managed caches. 1TB in and out/day (10k files) Cache lifetime ~10 days Various prescriptions for cache replacement, e.g. 1 st in, 1 st out, last use. 70% hit rate(~6000 files/day)

12 Replication Easy – use your favourite ftp. BUT……what could go wrong. –Cache space – Cache Management. –network, dead node, corrupted file - retries. –dead disk, uncached – fail-over. –sluggish robot, slow delivery – hold job. A stroll through my log file.

13 05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery status: Simple Status: Code: delivery error (Category SAM Internal) Severity level: ERROR Generated on 07 May 16:01:51 by eworker In the context: executed process samcp cab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000 imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE: 256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcp d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000 /sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled. trying normal rcp (/usr/bsd/rcp) WARNING: NO ENCRYPTION! d0cs015.fnal.gov: Connection refused, method name: samcp Recommended action: Please contact sam-admin@fnal.gov 05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery failed, scheduling retry in 3 secondssam-admin@fnal.gov Retry

14 05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Delivery status: Simple Status: Code: delivery error (Category SAM Internal) Severity level: ERROR Generated on 07 May 16:02:35 by eworker In the context: executed process samcp cab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000 imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE: 256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcp d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000 /sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled. trying normal rcp (/usr/bsd/rcp) WARNING: NO ENCRYPTION! d0cs015.fnal.gov: Connection refused, method name: samcp Recommended action: Please contact sam-admin@fnal.gov 05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Maximum number of retrials exceeded. Will not retry again from this source! 05/07/02 16:02:35 imperial-test.SM.Repler 11698: Will avoid locations: (cab:d0cs015.fnal.gov:/sam/cache/boo) 05/07/02 16:02:35 imperial-test.SM.Repler 11698: No loc is preferred, selecting enstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all(prl733.24)sam-admin@fnal.gov Give up on this source. Avoid this location. Get another location from RC, and retry.

15 05/07/02 16:10:53 imperial-test.SM.imperial-test 11698: Delivery status: Simple Status: Code: OK (Category Enstore) Severity level: SUCCESS Generated on 07 May 16:10:53 by eworker In the context: executed process samcp enstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000 imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE: 0 STDOUT: INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000 OUTFILE=/sam/cache20/lancs/boo FILESIZE=1369320147 LABEL=PRL859 LOCATION=0000_000000000_0000067 DRIVE=d0enmvr9a:/dev/rmt/tps0d1n DRIVE_SN=4560020042 TRANSFER_TIME=160.38 SEEK_TIME=73.47 MOUNT_TIME=25.36 QWAIT_TIME=65.79 TIME2NOW=329.78 STATUS=ok STDERR: Completed transferring 1369320147 bytes in 1 files in 329.720216036 sec. Overall rate = 3.96 MB/sec. Drive rate = 8.14 MB/sec. Network rate = 8.13 MB/sec. Exit status Got it

16 05/07/02 15:46:09 imperial-test.SM.PBS BS Adapter 11698: Remembering that job 1760.gw39.hep.ph.ic.ac.uk for project 61983_sam_ is held -------------------------- 05/07/02 16:00:56 imperial-test.SM.imperial-test 11698: Delivery status: Simple Status: Code: OK (Category Enstore) Severity level: SUCCESS Generated on 07 May 16:00:56 by eworker In the context: executed process samcp enstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000 imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE: 0 STDOUT: INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000 OUTFILE=/sam/cache20/lancs/boo FILESIZE=788805399 LABEL=PRL829 LOCATION=0000_000000000_0000025 DRIVE=d0enmvr9a:/dev/rmt/tps0d1n DRIVE_SN=4560020042 TRANSFER_TIME=90.08 SEEK_TIME=45.05 MOUNT_TIME=27.14 QWAIT_TIME=225.50 TIME2NOW=392.28 STATUS=ok STDERR: Completed transferring 788805399 bytes in 1 files in 392.221878052 sec. Overall rate = 1.92 MB/sec. Drive rate = 8.35 MB/sec. Network rate = 8.35 MB/sec. Exit status = 0., method name: samcp Recommended action: Please contact sam-admin@fnal.gov --------------------------- 05/07/02 105/07/02 16:00:57 imperial-test.SM.PBS BS Adapter 11698: Will execute: qrls 1760.gw39.hep.ph.ic.ac.uksam-admin@fnal.gov Hold in queue until 1 st file delivered. Release File arrives

17 Conclusions Executable is stupid - no knowledge of data transfer. Job manager does the clever stuff. SAM has a fully featured, tried and tested data management system. No GSI, GridFTP, or CondorG as yet, …but you need more than G`s to make a grid!


Download ppt "SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester."

Similar presentations


Ads by Google