Parag Mhashilkar (Fermi National Accelerator Laboratory)

Slides:



Advertisements
Similar presentations
WS-JDML: A Web Service Interface for Job Submission and Monitoring Stephen M C Gough William Lee London e-Science Centre Department of Computing, Imperial.
Advertisements

Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
Data Management for Physics Analysis in PHENIX (BNL, RHIC) Evaluation of Grid architecture components in PHENIX context Barbara Jacak, Roy Lacey, Saskia.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Microsoft SharePoint 2013 SharePoint 2013 as a Developer Platform
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Minerva Infrastructure Meeting – October 04, 2011.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
GlobusWorld 2012: Experience with EXPERIENCE WITH GLOBUS ONLINE AT FERMILAB Gabriele Garzoglio Computing Sector Fermi National Accelerator.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Distribution After Release Tool Natalia Ratnikova.
Installation and Development Tools National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The SEASR project and its.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Grid job submission using HTCondor Andrew Lahiff.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Globus online Software-as-a-Service for Research Data Management Steve Tuecke Deputy Director, Computation Institute University of Chicago & Argonne National.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Computing Sector, Fermi National Accelerator Laboratory 4/12/12GlobusWorld 2012: Experience with
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
INFSO-RI ETICS Local Setup Experiences A Case Study for Installation at Customers Location 4th. All Hands MeetingUwe Müller-Wilm VEGA Bologna, Nov.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
Workload Management System
and Alexandre Duarte OurGrid/EELA Interoperability Meeting
Exploring the Power of EPDM Tasks - Working with and Developing Tasks in EPDM By: Marc Young XLM Solutions
Presentation transcript:

Parag Mhashilkar (Fermi National Accelerator Laboratory)

Overview  Problem Statement  Proto-type Running jobs on FermiGrid Running jobs outside Fermilab  Deployment Steps  Running a test job  Plugin Internals  Stress Test Results  Conclusion  References 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline2

Problem Statement  Current Model Intensity Frontier Community uses glideinWMS to run workflows Transfer output files - ○ to Bluearc experiment mount points mounted on worker nodes and many other servers ○ using custom scripts (CPN)  Problem Output files transferred back are owned by glidein user and not by the actual user 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline3

Proposed Solution  Proposed Solution Provide an end-to-end solution that would integrate well with glideinWMS and the custom data transfer solutions Two sub-projects ○ Setup gridftp server to facilitate data transfer Required component to preserve the file ownership ○ Integrate with the Globus Online Service Optional component that leverages on the modern techniques and solutions like Globus Online as provided by the Grid community Futuristic – CPN script can not work outside Fermigrid setup  Work was done as part of CEDPS activity in the Fermilab 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline4

Proto-type  Services hosted on FermiCloud VMs  Core setup Setup Gridftp service on FermiCloud VM (if-gridftp.fnal.gov) ○ VM per experiment (minerva-gridftp.fnal.gov, minos-gridftp.fnal.gov, …) ○ VMs are identical Easy to clone for different experiments from a base image ○ Scalability Setup multiple VMs behind LVS If using Globus Online (GO), it will do the load balancing of the gridftp servers Tools to generate - ○ Gridftp access list (gridmap file) populated Fermilab VOMS server ○ Local user accounts from experiment NIS servers Scale testing by Dennis Box ○ Ran ~500 jobs & ~600 files transferred using globus-url-copy Enhance CPN to use globus-url-copy ○ Work in progress by Dennis Box 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline5

Proto-type Scenario: Computing on FermiGrid resources FermiCloud 10/19/20116End-To-End Solution: GlideinWMS & GlobusOnline FermiGrid Site Transfers through if-gridftp.fnal.gov using custom scripts (CPN)

End-To-End Solution using GlideinWMS & Globus Online FermiCloud 10/19/20117End-To-End Solution: GlideinWMS & GlobusOnline Grid Site (Not FermiGrid) Transfers through if-gridftp.fnal.gov using custom scripts (CPN)

Proto-type […contd.]  Setup using third-party file transfer services Running workflows on non Fermilab sites ○ Experiment disks not mounted on worker nodes ○ CPN script cannot be used for locking without major changes – is it even possible to use CPN? Other solutions ○ Let infrastructure transfer the files rather than application ○ Use condor file transfer plugin (condor v7.6+) Globus Online plugin through glideinWMS Reuse the infrastructure from core setup Register if-gridftp.fnal.gov as a Globus Online end point ○ Scheme can be extended to other file transfer protocols 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline8

Deployment: End-To-End Solution  Step 1: Setup a gridftp server In our example: if-gridftp.fnal.gov  Step 2: Setup related to Globus Online (GO) Create a Globus Online account (few simple steps) Create a GO end point for the gridftp server endpoint-add if-go-ep -p if-gridftp.fnal.gov endpoint-modify --public if-go-ep  Step 3: Configure glideinWMS Frontend to use GO plugin Add plugin, globusconnect & setup script as custom scripts in the frontend.xml Reconfigure the frontend  Step 4: Run Test job File URI in condor JDF globusonline://go_user:endpoint:filepath_on_gridftpserver 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline9

Frontend Configuration.... <file absfname=“/tmp/globusconnect“ after_entry="True“ after_group="False“ const="True" executable="False“ relfname="globusconnect" untar="False" wrapper="False"> <file absfname=“/tmp/globusonline_plugin.py” after_entry="True“ after_group="False" const="True" executable="False” relfname="globusonline_plugin.py" untar="False" wrapper="False"> <file absfname=“/tmp/condor-go-plugin_setup.sh” after_entry="True" after_group="False" const="True" executable="True” untar="False" wrapper="False"> 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline10

Test job: Condor JDF universe = vanilla executable = /local/home/testuser/testjobs/info-from-go.sh initialdir = /local/home/testuser/testjobs/joboutput output = out.$(cluster).$(process) error = err.$(cluster).$(process) log = log.$(cluster).$(process) should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT # Input files transfer_input_files = globusonline://parag:paragtest:/grid/data/parag/data # Output sandbox transfer output_destination = globusonline://parag:if-go-ep:/tmp/uploaded_data-fnalgrid transfer_output_files = data_to_upload x509userproxy=/local/home/testuser/testjobs/x509up_u11017.fermilab +DESIRED_Sites = "GR7x1“ Requirements = (stringListMember(GLIDEIN_Site,DESIRED_Sites)) && (Disk > 0 && (Arch == "INTEL" || Arch == "X86_64")) queue 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline11

What plugin does?  Guess X509_USER_PROXY Globusconnect (GC) GO user, GO endpoint & file location from the URI  Bootstrap Globusconnect Unpack Add GC endpoint Run setup Start GC  Activate the GO endpoint  Initiate scp between GC & GO endpoint  On scp completion Stop GC Remove the GC endpoint Exit with the status of the transfer 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline12

What does the plugin require?  Not dependent on the glideinwms release  Requires condor 7.6+  GC - transferred via custom scripts or pre-installed on the worker node Globus connect version v1.1+  Transfers Can only take place between a GC and a GO endpoint File specified in the GO URI format 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline13

Stress Tests & Results  2636 Minerva Jobs submitted  Output files transferred back through GO 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline14

Limitations  Condor X509_USER_PROXY ○ X509_USER_PROXY is correct when called for input file transfers but is different when plugin is called for output file transfers. Getting Logs/Debugging Info back to the submitted ○ All the files go to the GO endpoints and nothing gets transferred back to submit node ○ If there is a problem accessing GO itself, logs are lost. Supporting multiple transfers Syntax too rigid: plugin source destination Hopefully, newer plugin architecture in Condor will address these issues  Globus Online Fail earlier than 1 day default for Globus Online Current limit of 3 active transfers per user. You can have more queued.  Transfer Plugin Which GLOBUSCONNECT ○ We need to do a better job of figuring out which GLOBUSCONNECT to use. Reusing/Reducing number of Globus Online endpoints ○ Each file transfer creates a new GO endpoint. Start the transfer in background and periodically poll it. Fail accordingly rather than the default GO deadline of 1d 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline15

Conclusion  End-To-End Solution Independent of the glideinWMS release  Integrates WMS with data management layer  Solution using Condor’s transfer plugin architecture, currently supports GO but can be extended to different data transfer services/protocol  Data transfer done by the infrastructure and not by job 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline16

References  CEDPS  Project:  Test Results:  Condor File transfer plugin: ns/zmiller-cw2011-data-placement.pdf  Globus Connect: 10/19/2011End-To-End Solution: GlideinWMS & GlobusOnline17