Computing Sector, Fermi National Accelerator Laboratory 4/12/12GlobusWorld 2012: Experience with

Slides:



Advertisements
Similar presentations
High Throughput Data Program at Fermilab R&D Parag Mhashilkar Grid and Cloud Computing Department Computing Sector, Fermilab Network Planning for ESnet/Internet2/OSG.
Advertisements

Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Dec 14, 20061/10 VO Services Project – Status Report Gabriele Garzoglio VO Services Project WBS Dec 14, 2006 OSG Executive Board Meeting Gabriele Garzoglio.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Introduction The Open Science Grid (OSG) is a consortium of more than 100 institutions including universities, national laboratories, and computing centers.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Minerva Infrastructure Meeting – October 04, 2011.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Assessment of Core Services provided to USLHC by OSG.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
GlobusWorld 2012: Experience with EXPERIENCE WITH GLOBUS ONLINE AT FERMILAB Gabriele Garzoglio Computing Sector Fermi National Accelerator.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
100G R&D at Fermilab Gabriele Garzoglio (for the High Throughput Data Program team) Grid and Cloud Computing Department Computing Sector, Fermilab Overview.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
GlobusWorld 2012: Experience with EXPERIENCE WITH GLOBUS ONLINE AT FERMILAB Gabriele Garzoglio Computing Sector Fermi National Accelerator.
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
Mar 28, 20071/9 VO Services Project Gabriele Garzoglio The VO Services Project Don Petravick for Gabriele Garzoglio Computing Division, Fermilab ISGC 2007.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
Data Intensive Science Network (DISUN). DISUN Started in May sites: Caltech University of California at San Diego University of Florida University.
OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Spectrum of Support for Data Movement and Analysis in Big Data Science Network Management and Control E-Center & ESCPS Network Management and Control E-Center.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Storage and Data Movement at FNAL D. Petravick CHEP 2003.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Sep 25, 20071/5 Grid Services Activities on Security Gabriele Garzoglio Grid Services Activities on Security Gabriele Garzoglio Computing Division, Fermilab.
High Throughput Data Program (HTDP) at FNAL Mission: investigate the impact of and provide solutions for the scientific computing challenges in Big Data.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
1 Particle Physics Data Grid (PPDG) project Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99.
GlobusWorld 2012: Experience with EXPERIENCE WITH GLOBUS ONLINE AT FERMILAB Gabriele Garzoglio Computing Sector Fermi National Accelerator.
Fabric for Frontier Experiments at Fermilab Gabriele Garzoglio Grid and Cloud Services Department, Scientific Computing Division, Fermilab ISGC – Thu,
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
100G R&D at Fermilab Gabriele Garzoglio (for the High Throughput Data Program team) Grid and Cloud Computing Department Computing Sector, Fermilab Overview.
2 CMS 6 PB raw/run Phobos 50 TB/run E917 5 TB/run.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Dynamic Deployment of VO Specific Condor Scheduler using GT4
GLOW A Campus Grid within OSG
Presentation transcript:

Computing Sector, Fermi National Accelerator Laboratory 4/12/12GlobusWorld 2012: Experience with

Overview  Integration of Workload Management and Data Movement Systems with GO 1. Center for Enabling Distributed Petascale Science (CEDPS): GO integration with glideinWMS 2. Data Handling prototype for Dark Energy Survey (DES)  Performance tests of GO over 100 Gpbs networks 3. GO on the Advanced Network Initiative (ANI) testbed  Data Movement on OSG for end users 4. Network for Earthquake Engineering Simulation (NEES) 4/12/12GlobusWorld 2012: Experience with

Fermilab’s interest in GO  Data Movement service for end users Supporting user communities on the Grid Evaluating GO services in the workflows of our stakeholders  Data Movement service integration Evaluate GO as a component of middleware systems e.g. Glidein Workload Management Evaluate performance of GO for exa-scale networks (100 GE) 4/12/12GlobusWorld 2012: Experience with

1. CEDPS  CEDPS: The five year project , funded by Department of Energy (DOE)  Goals Produce technical innovations for rapid and dependable data placement within a distributed high performance environment and for the construction of scalable science services for data and computing from many clients. Address performance and functionality troubleshooting of these and other related distributed activities.  Collaborative Research Mathematics & Computer Science Division, Argonne National Laboratory Computing Division, Fermi National Accelerator Laboratory Lawrence Berkeley National Laboratory Information Sciences Institute, University of Southern California Dept of Computer Science, University of Wisconsin Madison  Collaborative work done by Fermi National Lab, Argonne National Lab, University of Wisconsin Supporting the integration of data movement mechanisms with scientific Glidein workload management system Integration of asynchronous data stage-out mechanisms in overlay workload management systems 4/12/12GlobusWorld 2012: Experience with

glideinWMS  Pilot-based WMS that creates on demand a dynamically- sized overlay condor batch system on Grid resources to address the complex needs of VOs in running application workflows  User Communities CMS Communitties in the Fermilab ○ CDF ○ DZero ○ Intensity Frontier Experiments (Minos, Minerva, Nova …) OSG Factory at UCSD & Indian Univ ○ Serves OSG VO Frontends, including ICECube, Engage, LSST, … CoralWMS - Frontend for TeraGrid community Atlas - Evaluating glideinWMS interfaced with Panda framework for their analysis framework User community growing rapidly 4/12/12GlobusWorld 2012: Experience with

Glideinwms Scale of Operations CMS Production Factory (up) & Frontend at CERN OSG Factory & CMS Analysis at UCSD 4/12/12GlobusWorld 2012: Experience with CMS serving ~400K jobs OSG serving ~200K jobs CMS serving pool with ~50K jobs CMS Analysis serving pool with ~25K jobs

Integrating glideinWMS with GO  Goals: Middleware handle data movement, rather than the application Middleware optimize use of computing resources (CPU do not block on data movement)  Users provide data movement directives in the Job Description File (e.g. storage services for IO)  glideinWMS procures resources on the Grid and run jobs using Condor  Data movement is delegated to the underlying Condor system  globusconnect is instantiated and GO plug-in is invoked using the directives in the JDF  Condor optimizes resources 4/12/12GlobusWorld 2012: Experience with VO Infrastructure Grid Site Worker Node Condor Scheduler Job glideinWMS Glidein Factory, WMS Pool VO Frontend glidein Condor Startd Condor Central Manager globusonline.org

Validation Test Results  Tests – Modified Intensity Frontier experiment (Minerva) jobs to transfer output sandbox to GO endpoint using transfer plugin Jobs: 2636, with 500 running at a time Total files transferred: Upto 500 dynamically created GO endpoints at a given time.  Lessons Learned Integration tests successful with 95% transfer success rate -- stressing scalability of GO in an unintended way GO team working on the scalability issues identified Efficiency and scalability can be increased by modifying the plugin to reuse GO endpoints and by transferring multiple files at the same time. 4/12/12GlobusWorld 2012: Experience with

2. Prototype integration of GO with DES Data Access Framework  Motivation Support Dark Energy Survey preparation for data taking See Don Petravick’s talk on Wed  DES Data Access Framework (DAF) uses a network of GridFTP servers to reliably move data across sites.  In Mar 2011, we investigated the integration of DAF with GO to address 2 issues: 1. DAF data transfer parameters were not optimal for both small and large files. 2. Reliability was implemented inefficiently by sequentially verifying real file size with DB catalogue. 4/12/12GlobusWorld 2012: Experience with

Results and improvements  Tested DAF moving 31,000 files (184 GB) with GO vs. UberFTP  Results Time for Transfer + Verification is the same (~100 min) Transfer time is 27% faster with GO than with UberFTP Verification time is 50% slower with GO than sequentially with UberFTP  Proposed Improvements: Allow specification of src / dest transfer reliability semantics (e.g. same size, same CRC, etc.) – Implemented for size Allow finer-grain failure model (e.g. specify number of transfer retrials instead of time deadline) Provide interface for efficient (pipelined) ls of src / dest files. 4/12/12GlobusWorld 2012: Experience with

3. GO on the ANI Testbed Motivation: Testing Grid middleware readiness to interface 100 Gbits links on the Advanced Network Initiative (ANI) Testbed. Characteristics: GridFTP data transfers (small, medium, large, all sizes) 300GB of data split into files (8KB – 8GB) Network: aggregate 3 x 10Gbit/s to bnl-1 test machine Local tests (reference) initiated on bnl-1 FNAL and GO tests: initiated on “FNAL initiator”; GridFTP control forwarded through “VPN gateway” Work by Dave Dykstra w/ contrib. by Raman Verma & Gabriele Garzoglio 11 4/12/12GlobusWorld 2012: Experience with

Test results  GO (yellow) does almost as well as practical max (red) for medium-size files.  Working with GO to improve transfer parameters for big and small files.  Small files have very high overhead over wide area control channels  GO auto-tuning works better for medium files than for the large files  Counterintuitively, increasing concurrency and pipelining on small files reduced the transfer throughput. Work by Dave Dykstra w/ contrib. by Raman Verma & Gabriele Garzoglio 12 4/12/12GlobusWorld 2012: Experience with

4. Data Movement on OSG for NEES  Motivation supporting NEES group at UCSD to run computations on the Open Science Grid (OSG)  Goal Perform parametric studies that involve large-scale nonlinear models of structure or soil-structure systems with large number of parameters and OpenSees runs.  Application example nonlinear time-history (NLTH) analyses of advanced nonlinear finite element (FE) model of a building Probabilistic seismic demand hazard analysis making use of the “cloud method”: 90 bi-directional historical earthquake record Sensitivity of probabilistic seismic demand to FE model parameters 4/12/12GlobusWorld 2012: Experience with A. R. Barbosa, J. P. Conte, J. I. Restrepo, UCSD 30 days on OSG vs. 12 yrs on Desktop

Success and challenges  Jobs submitted from RENCI (NC) to ~ 20 OSG sites. Output collected at RENCI.  NEES scientist moved 12 TB from the RENCI server to the user’s desktop at UCSD using GO Operations: every day, set up the data transfer update for the day: fire and forget …almost…  …there is still no substitute for a good network administrator Initially, we had 5 Mbps  eventually 200 Mbps (over 600 Mbps link). Improvements: ○ Upgrade eth card on user desktop ○ Migrate from Windows to Linux ○ Work with the user to use GO ○ Find a good net admin to find and fix broken fiber at RENCI, when nothing else worked.  Better use of GO on OSG: Integrate GO with the Storage Resource Broker (SRM) 4/12/12GlobusWorld 2012: Experience with

Conclusions  Fermilab has worked with the GO team to improve the system for several use cases: Integration with glidein Workload Management – Stress the “many-globusconnect” dimension Integration with Data Handling for DES – New requirements on reliability semantics Evaluation of performance over 100 Gbps networks – Verify transfer parameters auto- tuning at extreme scale Integrate GO with NEES for regular operations on OSG – Usability for GO’s intended usage 4/12/12GlobusWorld 2012: Experience with

Acknowledgments  GlobusOnline team for their support in all of these activities.  Integration of Glideinwms and globusonline.org was done as a part of CEDPS project  glideinWMS infrastructure is developed in Fermilab in collaboration with the Condor team from Wisconsin and High Energy Physics experiments. Most of the glideinWMS development work is funded by USCMS (part of CMS) experiment. Currently used in production by CMS, CDF and DZero, MINOS, ICECube with several other VOS evaluating it for their use case.  The Open Science Grid (OSG)  Fermilab is operated by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the United States Department of Energy. 4/12/12GlobusWorld 2012: Experience with

References 1. CEDPS Report: GO Stress Test Analysis bin/RetrieveFile?docid=4474;filename=GlobusOnline%20 PluginAnalysisReport.pdf;version=1 bin/RetrieveFile?docid=4474;filename=GlobusOnline%20 PluginAnalysisReport.pdf;version=1 2. DES DAF Integration with GO ESIntegrationWithGlobusonline ESIntegrationWithGlobusonline 3. GridFTP & GO on the ANI Testbed 5ico01vXcFsgyIGZH5pqbbGeI7t8/edit?hl=en_US&pli=1 5ico01vXcFsgyIGZH5pqbbGeI7t8/edit?hl=en_US&pli=1 4. OSG User Support of NEES gageOpenSeesProductionDemo gageOpenSeesProductionDemo 4/12/12GlobusWorld 2012: Experience with

Conclusions and Lessons Learned DISCARDING THE SLIDE -- TOO MUCH DETAIL  Logging Need better logging and means to unify logs from all GO transfers to a single file. All the files go to the GO endpoints and nothing gets transferred back WHAT DOES THIS MEAN?  Reusing/Reducing number of Globus Online endpoints In the GlideinWMS integration, each file transfer creates a new GO endpoint. We can use condor hooks to start globusconnect once and use it for all the transfers.  Supporting multiple transfers It is inefficient to transfer one file at a time. Fail earlier than 1 day default for Globus Online  Overriding default GO deadline of 1 day WHY? Start the transfer in background and periodically poll it.  CAN WE ADD MORE?  Need to work with developers to fit the service to the need… 4/12/12GlobusWorld 2012: Experience with