Presentation is loading. Please wait.

Presentation is loading. Please wait.

July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.

Similar presentations


Presentation on theme: "July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory."— Presentation transcript:

1 July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory

2 July 26, 2007Parag Mhashilkar, Fermilab2 Overview DZero & Samgrid Samgrid – OSG Job Forwarding DZero P20 Reprocessing Steps involved in starting Production On A Site Problems faced New sites Sites in steady state operation Shortcomings of the infrastructure Improving efficiency of running jobs on OSG Using OSG resources beyond P20 reprocessing.

3 July 26, 2007Parag Mhashilkar, Fermilab3 DZero & Samgrid SAMGrid (JIM + SAM) DZero’s way of using computing resources on the grid. Job Handling: Job and Information Management (JIM). Data Handling: Sequential Access via Metadata (SAM) Applications supported over the Grid- Monte Carlo Reprocessing Refixing Skimming (Beta Testing) Computing Elements Native Samgrid execution sites OSG forwarding node(s) LCG forwarding node(s) Storage Elements SAM SE SE with SRM interfaces

4 July 26, 2007Parag Mhashilkar, Fermilab4 Samgrid – OSG Job Forwarding Samgri d OSG SAM-Grid / OSG Forwarding Node Flow of Samgrid Job Flow of Local Jobs Offers Services Samgrid client, submission, broker: d0mino0x.fnal.gov Job Forwarding: d0srv015.fnal.gov d0srv047.fnal.gov d0srv066.fnal.gov OSG Sites: Fermilab, USCMS Farm, Oklahoma University, Indiana University, University of Nebraska,… SAM Services OSG Station: osg-ouhep on d0srv047.fnal.gov Storage Elements: SAM SE: ouhep00.nhn.ou.edu, d0srv015.fnal.gov, d0rsam01.fnal.gov, d0srv071.fnal.gov SAM-SRM SE: UNL, SPRACE, UW Madison Durable Location: ouhep00.nhn.ou.edu, d0srv063.fnal.gov, d0srv065.fnal.gov Samgrid Job

5 July 26, 2007Parag Mhashilkar, Fermilab5 DZero P20 Reprocessing Doing production on OSG first time on such a large scale. Process 75 TB (500 million events) of raw data in ~4 months 40 TB of output stored in SAM SE in FNAL. Computing Resources: 12 OSG Sites (FNGP Farm used for merging and not listed in the graph) 2 Samgrid sites (CCIN2P3, WESTGRID) 3 LCG sites (MANCHSTR, LANCASTR, CLERMONT) Resource Utilization: ~1200 jobs running with ~1500 idle on OSG sites

6 July 26, 2007Parag Mhashilkar, Fermilab6 Starting Production On A Site: Steps Site needs to be certified before it is considered for production. Site should satisfy following requirements to run DZero jobs - Worker nodes have outgoing network access Worker nodes have at least 6-8 GB of local storages Certification Means to verify the quality of data produced at a site. Certification jobs are production jobs run with test options. Results from certification runs compared with well known results by DZero experts. Certification could take from few days to couple of weeks. Certification jobs are run only once per site for major changes to the experiment binaries. Considerations Since the certification is fairly time consuming, it is preferable to have bigger sites rather than smaller sites. Considering the amount of data moved between the sites hosting storage elements and the Fermilab, sites with good network connectivity are preferred.

7 July 26, 2007Parag Mhashilkar, Fermilab7 Problems: New Site New site supports the DZero VO, but … VO users are not authenticated (no mapping). Account users mapped to does not exist. Account exists, but, the home directory does not exist. Site has enough scratch space on the worker nodes, but this space is not local. Example: NERSC Made changes to Samgrid infrastructure to support this. Amount of I/O activity affected the performance if several jobs started or were involved in the I/O activity at same time. Worker nodes do not have good (or see effective) bandwidth to Fermilab. Use on site SRM SE to supply data to the worker nodes. Example: UNL Use $OSG_APP to use some pre-uploaded files. Example: LONI (LTU) Resolution In past: talk to the site administrators. Not scalable as number of sites increase. Now, Open a GOC ticket. Involve ‘Troubleshooting Task Force’

8 July 26, 2007Parag Mhashilkar, Fermilab8 Problems: Steady State Operation Random Authentication/Authorization failures. Authorization policy on the site the changes. Service downtime/crashes Reporting site maintenance schedule. No initial notice given to the users. Users find this after their jobs crash. Cleanup of scratch space on the worker nodes VO jobs exiting normally should do the cleanup. If the job is killed because of site policies, who should do the cleanup? VO jobs: Job has already ended and does not have any control. Cleanup tools run by site: Are these tools available through OSG stack. Any attempts to standardize them? Discrepancy in the job status reported Globus reports job is done, CondorG reports job is idle. Affects production. Troubleshooting team investigating the problem.

9 July 26, 2007Parag Mhashilkar, Fermilab9 Problems: Shortcomings of the Infrastructure … These problems are not necessarily OSG specific problems but challenges we faced during our first massive production run on OSG. Ticketing system In the initial phase of this activity, turn around time from GOC was considerably high. This improved over the period of time. Thanks! Supplying data to thousands of worker nodes. Resolved by adding more SAM/SRM SE Implemented queues for data transfers to categorize transfers based on network type (LAN v/s WAN) and type of data. Use local storages whenever possible and available. Relies on worker nodes configured with domain names. Not all sites have worker nodes with FQDN. Resource Selection Service Not all sites advertise to the OSG ReSS making resource selection difficult. Without ReSS, automation is a challenge. Results in either under utilization or over utilization of the resources.

10 July 26, 2007Parag Mhashilkar, Fermilab10 … Problems: Shortcomings of the Infrastructure Lack of Monitoring Service Monalisa was very useful in getting a snapshot of the system like number of idle/running jobs at sites. Monalisa not supported any more. Gratia, takes VO-centric approach to determine the success and failure of the jobs. Job exist status does not fully represent failure or success. DZero jobs are successful if processed data makes it to SAM. OSG does not have good means/services for reporting such VO specific metrics. Enhancing Gratia to allow VO specific plug-in to measure the success of the jobs could be a solution. DZero had to develop a lot of in-house monitoring to overcome this shortage (via DZero-specific XML databases). Wider availability of SRM storages on the OSG Sites often offer NFS-like (POSIX) storages, but the lack of built-in load protection makes them dangerous to use. DZero had a good experience with new products on the market (Pansas at LONI). These systems are costly and not wide spread solutions. SRM storages would be a valuable alternative.

11 July 26, 2007Parag Mhashilkar, Fermilab11 Improving the efficiency on OSG Success/failure analysis based on the log file size for jobs Color Range Meaning blue 0-8k Worker node incompatibility, Lost standard output, OSG no assign aqua 8k-25k Forwarding node crash, service failure, could not start bootstrap executable. pink 25k-80k SAM problem. Could not get RTE, possibly raw files red 80k-160k SAM problem. Could not get raw files, possibly RTE gray 160k-250k Possible D0 runtime crash green >250k OK Initial weeks of production Failure analysis after rigorous troubleshooting with the help of ‘Troubleshooting Task Force’

12 July 26, 2007Parag Mhashilkar, Fermilab12 Using OSG Resources Beyond P20 Reprocessing Monte Carlo on OSG Certification and Site validation policy same as that of reprocessing. MC production on OSG sites ramping up. More OSG sites added to the list of certified sites. Overall production 7.7M MC events in the week (Jul 16 – Jul 22) with OSG production setting a weekly record of 3.1M FNAL Farm is now a part of Fermigrid and will be used to do Primary processing. Under testing phase. Using OSG resources for running other job types like skimming, CAF tree production, etc Doing Analysis on the Grid. …

13 Questions?


Download ppt "July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory."

Similar presentations


Ads by Google