Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

Similar presentations


Presentation on theme: "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,"— Presentation transcript:

1 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007, CNAF

2 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 2 Cloud Status Scheduled and unscheduled downtimes –direct emails from sites –EGEE broadcasts –GOCDB: https://goc.gridops.org/site/https://goc.gridops.org/site/ ARDA Dashboard pages –T0 to T1 transfers http://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site –all other transfers http://dashb-atlas-data.cern.ch/dashboard/request.py/site

3 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 3 VOBoxes at CERN https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedDat aManagementARDAMachineshttps://twiki.cern.ch/twiki/bin/view/Atlas/DistributedDat aManagementARDAMachines separate machines for db services and site services CNAF: –dq2db-cnaf – db services –dq2cnaf – site services for CNAF and T2’s Access via an account ddmusr02 –limited possibilities, check /tmp/dq2.log Account ddmusr01 restricted to developers –why ??? Installation done by developers

4 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 4 Panda Monitoring panda pages –DS on sites http://gridui01.usatlas.bnl.gov:25880/server/pandamon/query?ov erview=dslist –AOD: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?m ode=listAODReplications http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?m ode=listAODReplications –aborted DS: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?m ode=listAbortedDatasets –M4: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?m ode=listM4

5 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 5 More Monitoring Stephane’s overview of disks occupancies http://lapp.in2p3.fr/atlas/Informatique/Offline/monitor_fi les_sites/all_sites/list_sites.html Per data type version, DE cloud: http://www.etp.physik.uni- muenchen.de/ddm/DE/summary.html Site status monitored by GOC – gstat –http://goc.grid.sinica.edu.tw/gstat/RU-Protvino- IHEP/GIISQuery_Usage_store_.htmlhttp://goc.grid.sinica.edu.tw/gstat/RU-Protvino- IHEP/GIISQuery_Usage_store_.html

6 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 6 FTS monitoring FTS 1.5 –DE cloud: http://grid.fzk.de/monitoring/fts/transfers.htmlhttp://grid.fzk.de/monitoring/fts/transfers.html –SARA: http://winnetou.matrix.sara.nl/monitoring/datatransfer/http://winnetou.matrix.sara.nl/monitoring/datatransfer/ –glite-transfer commands: glite-transfer-channel-list -s https://fts.grid.sara.nl:8443/glite-data- transfer-fts/services/ChannelManagementhttps://fts.grid.sara.nl:8443/glite-data- transfer-fts/services/ChannelManagement

7 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 7 Typical tasks Errors spotted via monitoring –check reasons –contact site –possibly close the FTS channel –verify when corrected –open FTS channel

8 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 8 Deletion of Aborted DS Mail sent to T1 cloud responsibles (usually 1 per week) Different procedures in different clouds –FZK  Cedric’s script delete_dataset_aborted.py  run regularly from a crontab  uses: dq2.deleteDatasetReplicas, dq2.deleteDatasetSubscription, dq2.listFilesInDataset, lcg-del, lcg-uf  list of DS from a file  part of MyFrameWork: /afs/cern.ch/user/s/serfon/public/ddm/Myframework  will be published on Thursday

9 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 9 Deletion of Aborted DS II SARA cloud: wrappers around dq2_cleanup: dq2_delete_aborted.sh #!/bin/sh # delete aborted DS using dq2_cleanup # start 1 d2_cleanup instance per site # input via parameter. # Parameter 1: list of aborted dataset and sites # example: # ideal0_mc12.007042.singlepart_gamma_Et60.simul.HITS.v12003103_tid010675 ITEP # tested from lxplus, when grid and dq2 environment was set and # production proxy obtained like this: # # source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh # voms-proxy-init -voms atlas:/atlas/Role=production -valid 96:0 # source /afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2.sh SITES="SARADISK SARATAPE NIKHEF ITEP IHEP JINR SINP" DSLIST=$1 for SITE in $SITES ; do dq2_delete_aborted_site.sh $DSLIST $SITE & done

10 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 10 Deletion of Aborted DS II dq2_delete_aborted_site.sh #!/bin/sh # delete aborted DS from a site using dq2_cleanup # # Input # parameter 1: list of aborted DS # parameter 2: SITENAME DSLIST=$1 SITE=$2 DQ2_CLEANUP=/afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2_clea nup LOG="${SITE}_${DSLIST}_`date +%Y%m%d_%H%M`.log" touch $LOG grep $SITE $DSLIST | while read DS ; do $DQ2_CLEANUP $DS >>$LOG 2>&1 done

11 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 11 Integrity checks Cedric’ script –http://atlas-sw.cern.ch/cgi-bin/viewcvs- atlas.cgi/offline/Production/swing/scripts/ddm/integrity_check.py?view=loghttp://atlas-sw.cern.ch/cgi-bin/viewcvs- atlas.cgi/offline/Production/swing/scripts/ddm/integrity_check.py?view=log –some assumptions (/pnfs access) Simple compare of dumps: #!/bin/bash # read files from a DPM dump and match them with an LFC dump # DPM dump obtained by select name from Cns_file_metadata where gid=1307 and filesize > 0; DPM_DUMP=$1 LFC_DUMP=$2 FOUND=$1.found MISS=$1.miss cat $DPM_DUMP | while read FN FILEID; do grep -q $FN $LFC_DUMP if [ $? == 0 ] ; then echo "$FN $FILEID" >> $FOUND else echo "$FN $FILEID" >> $MISS fi done

12 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 12 Data loss https://twiki.cern.ch/twiki/bin/view/Main/AtlasDDMLostFiles Only production files are treated Get list of lost files (provided by a sysadmin) Remove information about lost files from the SE db (must be done by a sysadmin) – see later talk Delete lost entries from an LFC catalogue Locate replicas of lost files. If they exist, consider replication to the affected SE. If they do not exist, remove lost files from datasets (DQ2 db) and pass the list of really lost files to prodsys group. DB of lost files – will be part of DQ2

13 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 13 T2 cleaning remove_t2_in_t1.py by Stephane –A file is deleted if it fullfills all the following requests:  The file in the T2 is replicated in the T1DISK? of the name cloud?  The file belongs to a dataset which is not complete at the site  The file belongs to a dataset (with _tid) which is not subscribed to the T2 site ( Be carefull: During DDM migration to 0.3, all subscriptions are removed. You might deleted too many files untill subscriptions are put back. ) –Since v1.4, you can provide a list of restricted datasets to be deleted (even if subscribed) –It first scan the LFC catalog at the Tier1 (it is possible to use a local dump of the LFC catalog), scans the T2 entries in the LFC and deletes duplicated files on the T2 (using lcg-del). To run : python remove_t2_in_t1.py LAPP LPC or python remove_t2_in_t1.py LAPP LPC dataset1 dataset2

14 Enabling Grids for E-sciencE INFSO-RI-508833 ATLAS DDM Operations 14 More scripts https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationsScripts Framework in preparation


Download ppt "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,"

Similar presentations


Ads by Google