July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.
Minerva Infrastructure Meeting – October 04, 2011.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Grid Information Systems. Two grid information problems Two problems  Monitoring  Discovery We can use similar techniques for both.
Name Resolution Domain Name System.
High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.
OSG Public Storage and iRODS
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Computing Division Helpdesk Activity Report Rick Thies May 23, 2006.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.
DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Dzero MC production on LCG How to live in two worlds (SAM and LCG)
June 10, D0 Use of OSG D0 relies on OSG for a significant throughput of Monte Carlo simulation jobs, will use it if there is another reprocessing.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.
May Donatella Lucchesi 1 CDF Status of Computing Donatella Lucchesi INFN and University of Padova.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Sep 25, 20071/5 Grid Services Activities on Security Gabriele Garzoglio Grid Services Activities on Security Gabriele Garzoglio Computing Division, Fermilab.
December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.
Parag Mhashilkar Computing Division, Fermilab.  Status  Effort Spent  Operations & Support  Phase II: Reasons for Closing the Project  Phase II:
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
5/12/06T.Kurca - D0 Meeting FNAL1 p20 Reprocessing Introduction Computing Resources Architecture Operational Model Technical Issues Operational Issues.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
Belle II Physics Analysis Center at TIFR
Monte Carlo Production and Reprocessing at DZero
Production Resources & Issues p20.09 MC-data Regeneration
DØ MC and Data Processing on the Grid
The LHCb Computing Data Challenge DC06
Presentation transcript:

July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory

July 26, 2007Parag Mhashilkar, Fermilab2 Overview DZero & Samgrid Samgrid – OSG Job Forwarding DZero P20 Reprocessing Steps involved in starting Production On A Site Problems faced New sites Sites in steady state operation Shortcomings of the infrastructure Improving efficiency of running jobs on OSG Using OSG resources beyond P20 reprocessing.

July 26, 2007Parag Mhashilkar, Fermilab3 DZero & Samgrid SAMGrid (JIM + SAM) DZero’s way of using computing resources on the grid. Job Handling: Job and Information Management (JIM). Data Handling: Sequential Access via Metadata (SAM) Applications supported over the Grid- Monte Carlo Reprocessing Refixing Skimming (Beta Testing) Computing Elements Native Samgrid execution sites OSG forwarding node(s) LCG forwarding node(s) Storage Elements SAM SE SE with SRM interfaces

July 26, 2007Parag Mhashilkar, Fermilab4 Samgrid – OSG Job Forwarding Samgri d OSG SAM-Grid / OSG Forwarding Node Flow of Samgrid Job Flow of Local Jobs Offers Services Samgrid client, submission, broker: d0mino0x.fnal.gov Job Forwarding: d0srv015.fnal.gov d0srv047.fnal.gov d0srv066.fnal.gov OSG Sites: Fermilab, USCMS Farm, Oklahoma University, Indiana University, University of Nebraska,… SAM Services OSG Station: osg-ouhep on d0srv047.fnal.gov Storage Elements: SAM SE: ouhep00.nhn.ou.edu, d0srv015.fnal.gov, d0rsam01.fnal.gov, d0srv071.fnal.gov SAM-SRM SE: UNL, SPRACE, UW Madison Durable Location: ouhep00.nhn.ou.edu, d0srv063.fnal.gov, d0srv065.fnal.gov Samgrid Job

July 26, 2007Parag Mhashilkar, Fermilab5 DZero P20 Reprocessing Doing production on OSG first time on such a large scale. Process 75 TB (500 million events) of raw data in ~4 months 40 TB of output stored in SAM SE in FNAL. Computing Resources: 12 OSG Sites (FNGP Farm used for merging and not listed in the graph) 2 Samgrid sites (CCIN2P3, WESTGRID) 3 LCG sites (MANCHSTR, LANCASTR, CLERMONT) Resource Utilization: ~1200 jobs running with ~1500 idle on OSG sites

July 26, 2007Parag Mhashilkar, Fermilab6 Starting Production On A Site: Steps Site needs to be certified before it is considered for production. Site should satisfy following requirements to run DZero jobs - Worker nodes have outgoing network access Worker nodes have at least 6-8 GB of local storages Certification Means to verify the quality of data produced at a site. Certification jobs are production jobs run with test options. Results from certification runs compared with well known results by DZero experts. Certification could take from few days to couple of weeks. Certification jobs are run only once per site for major changes to the experiment binaries. Considerations Since the certification is fairly time consuming, it is preferable to have bigger sites rather than smaller sites. Considering the amount of data moved between the sites hosting storage elements and the Fermilab, sites with good network connectivity are preferred.

July 26, 2007Parag Mhashilkar, Fermilab7 Problems: New Site New site supports the DZero VO, but … VO users are not authenticated (no mapping). Account users mapped to does not exist. Account exists, but, the home directory does not exist. Site has enough scratch space on the worker nodes, but this space is not local. Example: NERSC Made changes to Samgrid infrastructure to support this. Amount of I/O activity affected the performance if several jobs started or were involved in the I/O activity at same time. Worker nodes do not have good (or see effective) bandwidth to Fermilab. Use on site SRM SE to supply data to the worker nodes. Example: UNL Use $OSG_APP to use some pre-uploaded files. Example: LONI (LTU) Resolution In past: talk to the site administrators. Not scalable as number of sites increase. Now, Open a GOC ticket. Involve ‘Troubleshooting Task Force’

July 26, 2007Parag Mhashilkar, Fermilab8 Problems: Steady State Operation Random Authentication/Authorization failures. Authorization policy on the site the changes. Service downtime/crashes Reporting site maintenance schedule. No initial notice given to the users. Users find this after their jobs crash. Cleanup of scratch space on the worker nodes VO jobs exiting normally should do the cleanup. If the job is killed because of site policies, who should do the cleanup? VO jobs: Job has already ended and does not have any control. Cleanup tools run by site: Are these tools available through OSG stack. Any attempts to standardize them? Discrepancy in the job status reported Globus reports job is done, CondorG reports job is idle. Affects production. Troubleshooting team investigating the problem.

July 26, 2007Parag Mhashilkar, Fermilab9 Problems: Shortcomings of the Infrastructure … These problems are not necessarily OSG specific problems but challenges we faced during our first massive production run on OSG. Ticketing system In the initial phase of this activity, turn around time from GOC was considerably high. This improved over the period of time. Thanks! Supplying data to thousands of worker nodes. Resolved by adding more SAM/SRM SE Implemented queues for data transfers to categorize transfers based on network type (LAN v/s WAN) and type of data. Use local storages whenever possible and available. Relies on worker nodes configured with domain names. Not all sites have worker nodes with FQDN. Resource Selection Service Not all sites advertise to the OSG ReSS making resource selection difficult. Without ReSS, automation is a challenge. Results in either under utilization or over utilization of the resources.

July 26, 2007Parag Mhashilkar, Fermilab10 … Problems: Shortcomings of the Infrastructure Lack of Monitoring Service Monalisa was very useful in getting a snapshot of the system like number of idle/running jobs at sites. Monalisa not supported any more. Gratia, takes VO-centric approach to determine the success and failure of the jobs. Job exist status does not fully represent failure or success. DZero jobs are successful if processed data makes it to SAM. OSG does not have good means/services for reporting such VO specific metrics. Enhancing Gratia to allow VO specific plug-in to measure the success of the jobs could be a solution. DZero had to develop a lot of in-house monitoring to overcome this shortage (via DZero-specific XML databases). Wider availability of SRM storages on the OSG Sites often offer NFS-like (POSIX) storages, but the lack of built-in load protection makes them dangerous to use. DZero had a good experience with new products on the market (Pansas at LONI). These systems are costly and not wide spread solutions. SRM storages would be a valuable alternative.

July 26, 2007Parag Mhashilkar, Fermilab11 Improving the efficiency on OSG Success/failure analysis based on the log file size for jobs Color Range Meaning blue 0-8k Worker node incompatibility, Lost standard output, OSG no assign aqua 8k-25k Forwarding node crash, service failure, could not start bootstrap executable. pink 25k-80k SAM problem. Could not get RTE, possibly raw files red 80k-160k SAM problem. Could not get raw files, possibly RTE gray 160k-250k Possible D0 runtime crash green >250k OK Initial weeks of production Failure analysis after rigorous troubleshooting with the help of ‘Troubleshooting Task Force’

July 26, 2007Parag Mhashilkar, Fermilab12 Using OSG Resources Beyond P20 Reprocessing Monte Carlo on OSG Certification and Site validation policy same as that of reprocessing. MC production on OSG sites ramping up. More OSG sites added to the list of certified sites. Overall production 7.7M MC events in the week (Jul 16 – Jul 22) with OSG production setting a weekly record of 3.1M FNAL Farm is now a part of Fermigrid and will be used to do Primary processing. Under testing phase. Using OSG resources for running other job types like skimming, CAF tree production, etc Doing Analysis on the Grid. …

Questions?