June 21-25, 2004Lecture4: Grid Data Management1 Lecture 4 Grid Data Management Jaime Frey UW-Madison Condor Group Slides prepared in.

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
GridFTP: File Transfer Protocol in Grid Computing Networks
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
Chapter 9 Chapter 9: Managing Groups, Folders, Files, and Object Security.
Grid Data Management Kasturi Chatterjee. 2 Motivation: The Data Problem Motivate our discussion with the large physics experiments Laser Interferometer.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Nine Managing File System Access.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Chapter 5 Roles and features. objectives Performing management tasks using the Server Manager console Understanding the Windows Server 2008 roles Understanding.
Microsoft Windows 2003 Server. Client/Server Environment Many client computers connect to a server.
Grid Data Management. 2 Data Management Distributed community of users need to access and analyze large amounts of data Requirement arises in both simulation.
Version Control with Subversion. What is Version Control Good For? Maintaining project/file history - so you don’t have to worry about it Managing collaboration.
GridFTP Guy Warner, NeSC Training.
Linux Operations and Administration
Hands-On Microsoft Windows Server 2008
Part Three: Data Management 3: Data Management A: Data Management — The Problem B: Moving Data on the Grid FTP, SCP GridFTP, UberFTP globus-URL-copy.
LSC Segment Database Duncan Brown Caltech LIGO-G Z.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
July Lecture 4: Grid Data Management1 Grid Data Management.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
FTP Server and FTP Commands By Nanda Ganesan, Ph.D. © Nanda Ganesan, All Rights Reserved.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
File and Object Replication in Data Grids Chin-Yi Tsai.
Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.
Globus GridFTP and RFT: An Overview and New Features Raj Kettimuthu Argonne National Laboratory and The University of Chicago.
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
Oracle 10g Database Administrator: Implementation and Administration Chapter 2 Tools and Architecture.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Grid Data Management. 2 Data Management Want to move data around:  Store it long term in appropriate places (e.g., tape silos) ‏  Move input to where.
Grid Data Management. 2 Data Management Want to move data around:  Store it long term in appropriate places (e.g., tape silos) ‏  Move input to where.
Grid Data Management. March 24-25, 2007 Grid Data Management 2 Motivation: The Data Problem Motivate our discussion with the large physics experiments.
Application Layer Khondaker Abdullah-Al-Mamun Lecturer, CSE Instructor, CNAP AUST.
Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University.
Part Four: The LSC DataGrid Part Four: LSC DataGrid A: Data Replication B: What is the LSC DataGrid? C: The LSCDataFind tool.
STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
AERG 2007Grid Data Management1 Grid Data Management Replica Location Service Carolina León Carri Ben Clifford (OSG)
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
GridFTP Richard Hopkins
1 Stork: State of the Art Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison
Scott Koranda, UWM & NCSA 14 January 2016www.griphyn.org Lightweight Data Replicator Scott Koranda University of Wisconsin-Milwaukee & National Center.
AERG 2007Grid Data Management1 Grid Data Management GridFTP Carolina León Carri Ben Clifford (OSG)
FTP COMMANDS OBJECTIVES. General overview. Introduction to FTP server. Types of FTP users. FTP commands examples. FTP commands in action (example of use).
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Data Management The European DataGrid Project Team
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Data Management The European DataGrid Project Team
STAR Scheduling status Gabriele Carcassi 9 September 2002.
GridFTP Guy Warner, NeSC Training Team.
1 GridFTP and SRB Guy Warner Training, Outreach and Education Team, Edinburgh e-Science.
Protocols and Services for Distributed Data- Intensive Science Bill Allcock, ANL ACAT Conference 19 Oct 2000 Fermi National Accelerator Laboratory Contributors:
Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.
Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.
Chapter 7: Using Network Clients The Complete Guide To Linux System Administration.
Scott Koranda, UWM & NCSA 20 November 2016www.griphyn.org Lightweight Replication of Heavyweight Data Scott Koranda University of Wisconsin-Milwaukee &
Chapter 2: System Structures
Building Grids with Condor
Part Three: Data Management
STORK: A Scheduler for Data Placement Activities in Grid
Lecture 4: File-System Interface
Presentation transcript:

June 21-25, 2004Lecture4: Grid Data Management1 Lecture 4 Grid Data Management Jaime Frey UW-Madison Condor Group Slides prepared in part by Scott Koranda UW-Milwaukee & NCSA Grid Summer Workshop June 21-25, 2004

Lecture4: Grid Data Management 2 Motivation? Why is the Grid community concerned with data/file management? Why might you be concerned with data/file management?

June 21-25, 2004 Lecture4: Grid Data Management 3 Motivation: The Data Problem Motivate our discussion with the large physics experiments (part of GriPhyN and Grid2003)  Laser Interferometer Gravitational Wave Observatory Detect spacetime ripples from blackholes & other sources Generates data at 10 MB per second, just under 1 TB per day  Sloan Digital Sky Survey Catalog more stars and galaxies then ever before More than 15 TB of data catalogs  Compact Muon Solenoid and ATLAS Detect the Higgs Boson (a fundamental particle) 100 MB per second, about 1 Petabyte per year (per detector)

June 21-25, 2004 Lecture4: Grid Data Management 4 Really Two Data Problems The amount of data  High-performance tools needed to manage the huge raw volume of data Store it Move it  Measure in terabytes, petabytes, and ??? The number of data files  High-performance tools needed to manage the huge number of filenames filenames is expected soon Collection of of anything is a lot to handle efficiently

June 21-25, 2004 Lecture4: Grid Data Management 5 Three Data Questions on the Grid Essentially three (3) questions for which you want Grid tools to address 1. What data/files exist? 2. What data/files are where? 3. How do I move data/files from A to B?

June 21-25, 2004 Lecture4: Grid Data Management 6 Three Data Questions on the Grid Examine these questions last to first …because even if you don’t have TBs of data you will want to move files so start with #3 1. What data/files exist? 2. What data/files are where? 3. How do I move data/files from A to B?

June 21-25, 2004 Lecture4: Grid Data Management 7 How to move data/files? Requirements  Fast – as fast as networks and protocols allow I2 sites should expect at least 10 MB/s sustained  Secure Server must only share files with strongly authenticated clients No passwords in the clear or similar  Robust Fault tolerant, time-tested protocol

June 21-25, 2004 Lecture4: Grid Data Management 8 GridFTP Extension to well known File Transfer Protocol (FTP)  Extensions include  Strong authentication, encryption via Globus GSI  Multiple, parallel data channels  Third-party transfers  Tunable network & I/O parameters  Server side processing, command pipelining

June 21-25, 2004 Lecture4: Grid Data Management 9 Necessary Semantics… GridFTP is the protocol A server or client that implements the GridFTP protocol is GridFTP-enabled or Grid-enabled  Often hear “the GridFTP server…” or “the GridFTP client…”  Correct is “the GridFTP-enabled server from the Globus team” or the particular client being used  Let it slide…easier to use the slang…but  Distinction more important soon as groups outside of Globus release GridFTP-enabled clients & servers

June 21-25, 2004 Lecture4: Grid Data Management 10 GridFTP Server Built on top of wuftpd, our old friend  A brand new server from scratch in beta now… Most configuration details same as wuftpd Runs as a inetd (xinetd) service 1. Connection is attempted on port Xinetd looks up port in /etc/services and finds responsible service 3. Xinetd starts service according to configuration with data from communication send on stdin

June 21-25, 2004 Lecture4: Grid Data Management 11 GridFTP Server From /etc/services [services]$ tail /etc/services gsiftp 2811/tcp #Grid-FTP Server globus-gatekeeper 2119/tcp #Globus Gatekeeper From /etc/xinetd.d/ [xinetd.d]$ cat gsiftp service gsiftp { socket_type = stream protocol = tcp env = LD_LIBRARY_PATH=/opt/ldg-2.0/globus/lib wait = no user = root server = /opt/ldg-2.0/globus/sbin/in.ftpd server_args = -l -a -G /opt/ldg-2.0/globus log_on_success += DURATION USERID log_on_failure += USERID nice = 10 disable = no }

June 21-25, 2004 Lecture4: Grid Data Management 12 GridFTP Server Environment variables  LD_LIBRARY_PATH Point to $GLOBUS_LOCATION/lib  GRIDMAP Path to grid-mapfile for authentication Generic GSI environment variable  X509_CERT_DIR Directory in which CA signing certificates held Generic GSI environment variable

June 21-25, 2004 Lecture4: Grid Data Management 13 GridFTP Server Logging to system log  On most Linux /var/log/messages Jun 10 10:46:59 basil gridftpd[21857]: GSSAPI user /DC=org/DC=doegrids/OU=People/CN=Scott Koranda is authorized as skoranda Jun 10 10:46:59 basil gridftpd[21857]: FTP LOGIN FROM oregano.phys.uwm.edu [ ], skoranda Uses host certificate for mutual authentication [ root]# grid-cert-info -file /etc/grid-security/hostcert.pem -subject/DC=org/DC=doegrids/OU=Services/CN=basil.phys.uwm.edu

June 21-25, 2004 Lecture4: Grid Data Management 14 GridFTP Server Third-party transfers Client directs transfers between two servers ygraine.aei.mpg.de GridFTP client basil.phys.uwm.edu GridFTP server ldas-cit.ligo.caltech.edu GridFTP server “move file1 to ldas-cit.ligo.caltech.edu” file1

June 21-25, 2004 Lecture4: Grid Data Management 15 GridFTP clients Globus-url-copy GridFTP-compliant client from the Globus team Copy files from one URL to another URL  One URL is usually a gsiftp:// URL  Another URL is usually a file:/ URL  To move a file from remote GridFTP-enabled server to local machine globus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file:/home/skoranda/file1

June 21-25, 2004 Lecture4: Grid Data Management 16 Globus-url-copy Alternative forms for file:/ URLs globus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file://localhost/home/skoranda/file1 globus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file://basil.phys.uwm.edu/home/skoranda/file1 If GridFTP server runs on a non-standard port? globus-url-copy gsiftp://dataserver.phys.uwm.edu:15000/data/file1 file:/home/skoranda/file1

June 21-25, 2004 Lecture4: Grid Data Management 17 Globus-url-copy To put file onto server reverse URLs globus-url-copy file:/home/skoranda/file1 gsiftp://dataserver.phys.uwm.edu/data/file1 By default 1 data channel used  average performance  monitor performance using –vb flag $ globus-url-copy -vb gsiftp://ldas- cit.ligo.caltech.edu:15000/usr1/grid/smallfile file:/tmp/smallfile bytes KB/sec avg KB/sec inst

June 21-25, 2004 Lecture4: Grid Data Management 18 Going fast Multiple channels dramatically boosts ‘xfer rate $ globus-url-copy -vb -p 4 gsiftp://ldas- cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile bytes KB/sec avg KB/sec inst Still faster by using large TCP windows $ globus-url-copy -vb -p 4 -tcp-bs gsiftp://ldas- cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile bytes KB/sec avg KB/sec inst Still faster by using large memory buffers $ globus-url-copy -vb -p 4 -bs tcp-bs gsiftp://ldas- cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile bytes KB/sec avg KB/sec inst

June 21-25, 2004 Lecture4: Grid Data Management 19 Faster! Depending on network & weather you can go very fast! $ globus-url-copy -vb -p 8 -bs tcp-bs gsiftp://ldas- cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile bytes KB/sec avg KB/sec inst

June 21-25, 2004 Lecture4: Grid Data Management 20 Third-party transfers Transfers from server to server directed by client  Use gsiftp:// URLs for both  requires both servers be configured to allow 3 rd party $ hostname basil.phys.uwm.edu $ globus-url-copy gsiftp://hydra.phys.uwm.edu/tmp/file1 gsiftp://contra.phys.uwm.edu/tmp/file1

June 21-25, 2004 Lecture4: Grid Data Management 21 Debugging Use –dbg to see control channel communication $ globus-url-copy -dbg gsiftp://hydra.phys.uwm.edu/tmp/file1 file:/tmp/file1 debug: starting to get gsiftp://hydra.phys.uwm.edu/tmp/file1 debug: connecting to gsiftp://hydra.phys.uwm.edu/tmp/file1 debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1: 220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu (gcc32dbg, ) ready. debug: authenticating with gsiftp://hydra.phys.uwm.edu/tmp/file1 debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1: 230 User skoranda logged in. debug: sending command: FEAT debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1: 211-Extensions supported: REST STREAM ESTO ERET MDTM SIZE PARALLEL DCAU 211 END

June 21-25, 2004 Lecture4: Grid Data Management 22 Globus-url-copy Acutally a general purpose URL copying tool No GSI authentication used Parallel channels and like won’t work $ globus-url-copy file:/tmp/yahoo $ globus-url-copy ftp://ftp.globus.org/banner.msg file:/tmp/banner.msg

June 21-25, 2004 Lecture4: Grid Data Management 23 GridFTP clients UberFTP  developed and supported at National Center for Supercomputing Applications (NCSA)  interactive like our old (insecure) friend ‘ftp’  use –a GSI for GSI authentication  supports multiple channels using –c flag $ uberftp -H hydra.phys.uwm.edu -a GSI 220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu (gcc32dbg, ) ready. 230 User skoranda logged in. uberftp>

June 21-25, 2004 Lecture4: Grid Data Management 24 GridFTP clients “Roll your own” Add functionality directly to your applications  Your application find and download its own data?  Your application deliver output data files when finished computing? Globus Toolkit offers APIs to code against  C  Java  Python

June 21-25, 2004 Lecture4: Grid Data Management 25 GridFTP and Firewalls Nice document by Globus team at Firewall Requirements-5.pdf Tip: when debugging GridFTP and firewalls  remember which way connections established  1 single data channel data connection established from client to server  2 or more data channels data connection established in direction data will flow  control connection always from client to server

June 21-25, 2004 Lecture4: Grid Data Management 26 Hints for Experts To make GridFTP go really fast use fast disks/filesystems  filesystem should read/write > 30 MB/second configure TCP for performance  See TCP Tuning Guide at patch your Linux kernel with web100 patch  See  Important work-around for Linux TCP “feature” understand your network path

June 21-25, 2004 Lecture4: Grid Data Management 27 Three Data Questions on the Grid 1. What data/files exist? 2. What data/files are where? 3. How do I move data/files from A to B?

June 21-25, 2004 Lecture4: Grid Data Management 28 What data/files are where? Requirements  Catalog 10 8 files and their locations What files are where (possibly at more then one place) Across multiple sites within a Grid Mappings from logical filenames (LFNs) to physical filenames (PFNs) or URLs  No single point of failure No central catalog/server to be single point of failure

June 21-25, 2004 Lecture4: Grid Data Management 29 Globus Replica Location Service Globus RLS Each RLS server usually runs two catalogs  LRC Local replica catalog Catalog of what files you have (LFNs) and mappings to URL(s) or PFNs  RLI Replica location index Catalog of while files (LFNs) that other LRCs in your data grid know about

June 21-25, 2004 Lecture4: Grid Data Management 30 Globus RLS Network of RLS servers inform each other  Each site has LRC with mappings of LFNs to PFNs usually contains the “local” mappings where files located at the site Site at Milwaukee might have this mapping in its LRC H-R gwf → gsiftp://dataserver.phys.uwm.edu/LIGO/H-R gwf  LRC catalog at each site tells remote RLIs what LFNs it has mappings for Milwaukee tells Caltech it has a mapping for H-R gwf So Caltech RLI has mapping H-R gwf → LRC at Milwaukee

June 21-25, 2004 Lecture4: Grid Data Management 31 Globus RLS file1→ gsiftp://serverA/file1 file2→ gsiftp://serverA/file2 LRC RLI file3→ rls://serverB/file3 file4→ rls://serverB/file4 rls://serverA:39281 file1 file2 site A file3→ gsiftp://serverB/file3 file4→ gsiftp://serverB/file4 LRC RLI file1→ rls://serverA/file1 file2→ rls://serverA/file2 rls://serverB:39281 file3 file4 site B

June 21-25, 2004 Lecture4: Grid Data Management 32 Globus RLS Typical way to query RLS network and find files in your Grid Ask your local LRC “do you know about the file H-R gwf?” If yes…  Ask your local LRC for the corresponding URL(s)  It answers “H-R gwf is at URL gsiftp://basil.phys.uwm.edu/LIGO/H-R gwf” If no…  Ask your local RLI “who does know about this file?”  It answers “The RLS server at MIT knows about this file?”  Go ask the MIT RLS server “I am told you know about the file H-R gwf…please tell me the URL for it?” It answers “H-R gwf is at URL gsiftp://ldas.mit.edu/LIGO/H-R gwf”

June 21-25, 2004 Lecture4: Grid Data Management 33 Globus RLS Quick Review  LFN → logical filename (think of as simple filename)  PFN → physical filename (think of as a URL)  LRC → your local catalog of maps from LFNs to PFNs H-R gwf → gsiftp://dataserver.phys.uwm.edu/LIGO/H-R gwf  RLI → your local catalog of maps from LFNs to LRCs H-R gwf → LRCs at MIT, PSU, Caltech, and UW-M  LRCs inform RLIs about mappings known  Find files on your Grid by querying RLI(s) to get LRC(s), then query LRC(s) to get URL(s)

June 21-25, 2004 Lecture4: Grid Data Management 34 Globus RLS: Server Perspective 1. Listens on port (default) for clients 2. Responds to client queries  what LFNs in local catalog, the LRC?  what other LRCs know about LFNs?  checks against access control list for each client 3. Accepts publishing of new LFNs into LRC  add files to local catalog 4. Sends updates of LRC to other servers  tell remote RLI catalogs what LFNs you have mappings for locally

June 21-25, 2004 Lecture4: Grid Data Management 35 Globus RLS: Server Perspective Listens on port (default) for clients  Server address is URL rls://dataserver.phys.uwm.edu rls://dataserver.phys.uwm.edu:39281 rls://dataserver rls://localhost  Uses a host certificate to identify itself must run as root if host cert is owned by root often copy host cert/key to other non-root limited privilege account and configure to use that copy

June 21-25, 2004 Lecture4: Grid Data Management 36 Globus RLS: Server Perspective Mappings LFNs → PFNs kept in database  Uses generic ODBC interface to talk to any (good) RDBM  MySQL, PostgreSQL, Oracle, DB2,...  All RDBM details hidden from administrator and user well, not quite RDBM may need to be “tuned” for performance but one can start off knowing very little about RDBMs

June 21-25, 2004 Lecture4: Grid Data Management 37 Globus RLS: Server Perspective Mappings LFNs → LRCs stored in 1 of 2 ways table in database  full, complete listing from LRCs that update your RLI  requires each LRC to send your RLI full, complete list as number of LFNs in catalog grows, this becomes substantial 10 8 filenames at 64 bytes per filename ~ 6 GB in memory in a special hash called Bloom filter  10 8 filenames stored in as little as 256 MB easy for LRC to create Bloom filter and send over network to RLIs  can cause RLI to lie when asked if knows about a LFN only false-positives tunable error rate acceptable in many contexts

June 21-25, 2004 Lecture4: Grid Data Management 38 Globus RLS: Configuring the Server Single configuration file  usually $GLOBUS_LOCATION/etc/globus-rls-server.conf Send server a HUP signal to refresh configuration  kill –SIGHUP Access control  each “client” given one or more of lrc_read : permission to query the LRC for mappings lrc_update : permission to add new mappings in LRC rli_read : permission to query RLI for mappings rli_update : permission to inform RLI of remote LRC mappings stats : permission to query server for statistics admin : permission to change configuration on the fly

June 21-25, 2004 Lecture4: Grid Data Management 39 Globus RLS: Configuring the Server Access control  access given to certificate subject acl /DC=org/DC=doegrids/OU=People/CN=Scott Koranda: lrc_read  access given to UID mapped in grid-mapfile which grid-mapfile examined controlled by GRIDMAP environment variable acl skoranda: lrc_read  must give remote LRCs permission to update your RLI remote RLS server uses host certificate to identify itself acl /DC=org/DC=doegrids/OU=Services/CN=ldas.mit.edu: rli_update

June 21-25, 2004 Lecture4: Grid Data Management 40 Globus RLS: Configuring the Server globus-rls-admin tool for configuration  need GSI credential to talk to server  must have acl with admin privileges for your credential  manual page is available NAME globus-rls-admin - Replica Location Service Administration SYNOPSIS globus-rls-admin -A|-a|-C option value|-c option|-D|-d|-e|-p|-q|-r|-S|-s|-t timeout|-u|-v [ rli ] [ pattern ] [ server ] DESCRIPTION The program globus-rls-admin performs administrative oper- ations on a RLS server (see globus-rls-server(8)).  ping the server to see if alive $ globus-rls-admin -p rls://localhost ping rls://localhost: 0 seconds

June 21-25, 2004 Lecture4: Grid Data Management 41 Globus RLS: Configuring the Server Query server for statistics $ globus-rls-admin -S rls://localhost Version: Uptime: 02:46:19 LRC stats update method: lfnlist update method: bloomfilter updates bloomfilter: rls://mini.astro.cf.ac.uk:39281 last 06/15/04 11:39:12 updates bloomfilter: rls://ygraine.aei.mpg.de:39281 last 12/31/69 18:00:00 updates bloomfilter: rls://ldas-cit.ligo.caltech.edu:39281 last 12/31/69 18:00:00 lfnlist update interval: bloomfilter update interval: 900 numlfn: numpfn: nummap: RLI stats updated by: rls://mini.astro.cf.ac.uk:39281 last 06/15/04 11:47:56 updated by: rls://ygraine.aei.mpg.de:39281 last 06/15/04 11:25:23 updated by: rls://ldas-cit.ligo.caltech.edu:39281 last 06/15/04 11:43:31 updated via bloomfilters

June 21-25, 2004 Lecture4: Grid Data Management 42 Globus RLS: Configuring the Server Tell LRC what remote RLIs to update  local LRC should update the RLI at MIT using Bloom filter $ globus-rls-admin –A rls://ldas.mit.edu rls://localhost  use –a if updating via lists rather than Bloom filter

June 21-25, 2004 Lecture4: Grid Data Management 43 Globus RLS: Client Perspective Two ways for clients to interact with RLS Server globus-rls-cli simple command-line tool  query  create new mappings “roll your own” client by coding against API  Java  C  Python

June 21-25, 2004 Lecture4: Grid Data Management 44 Globus-rls-cli Simple query to LRC to find a PFN for LFN Note more then 1 PFN may be returned $ globus-rls-cli query lrc lfn H-R gwf rls://dataserver:39281 H-R gwf: file://localhost/netdata/s001/S1/R/H/ /H-R gwf H-R gwf: file://medusa- slave001.medusa.phys.uwm.edu/data/S1/R/H/ /H-R gwf H-R gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_storage/ data/s001/S1/R/H/ /H-R gwf Server and client sane if LFN not found $ globus-rls-cli query lrc lfn "foo" rls://dataserver LFN doesn't exist: foo $ echo $? 1

June 21-25, 2004 Lecture4: Grid Data Management 45 Globus-rls-cli Be sure to quote LFN if it has funny characters $ globus-rls-cli query lrc lfn file& rls://dataserver [1] bash: rls://dataserver: No such file or directory datarobot]$ connect(file): Bad URL: globus_url_parse(file): Error code -3 [1]+ Exit 1 globus-rls-cli query lrc lfn file datarobot]$ globus-rls-cli query lrc lfn "file&" rls://dataserver LFN doesn't exist: file&

June 21-25, 2004 Lecture4: Grid Data Management 46 Globus-rls-cli Wildcard searches of LRC supported  probably a good idea to quote LFN wildcard expression $ globus-rls-cli query wildcard lrc lfn "H-R *-16.gwf" rls://dataserver:39281 H-R gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_stor age/data/s001/S1/R/H/ /H-R gwf H-R gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_stor age/data/s001/S1/R/H/ /H-R gwf

June 21-25, 2004 Lecture4: Grid Data Management 47 Globus-rls-cli Bulk queries also supported obtain PFNs for more then one LFN at a time $ globus-rls-cli bulk query lrc lfn H-R gwf H-R gwf rls://dataserver H-R gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/ cluster_storage/data/s001/S1/R/H/ /H- R gwf H-R gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/ cluster_storage/data/s001/S1/R/H/ /H- R gwf

June 21-25, 2004 Lecture4: Grid Data Management 48 Globus-rls-cli Simple query to RLI to locate a LFN to LRC map  then query that LRC for the PFN $ globus-rls-cli query rli lfn H-R gwf rls://dataserver H-R gwf: rls://ldas-cit.ligo.caltech.edu:39281 $ globus-rls-cli query lrc lfn H-R gwf rls://ldas-cit.ligo.caltech.edu:39281 H-R gwf: gsiftp://ldas- cit.ligo.caltech.edu:15000/archive/S1/L0/LHO/H-R-7140/H-R gwf

June 21-25, 2004 Lecture4: Grid Data Management 49 Globus-rls-cli Bulk queries to RLI also supported $ globus-rls-cli bulk query rli lfn H-R gwf H-R gwf rls://dataserver H-R gwf: rls://ldas- cit.ligo.caltech.edu:39281 H-R gwf: rls://ldas- cit.ligo.caltech.edu:39281 Wildcard queries to RLI may not be supported!  no wildcards when using Bloom filter updates $ globus-rls-cli query wildcard rli lfn "H-R *-16.gwf" rls://dataserver Operation is unsupported: Wildcard searches with Bloom filters

June 21-25, 2004 Lecture4: Grid Data Management 50 Globus-rls-cli RLS with Bloomfilter updates to RLI fast and efficient Bloom filter is hash of information in a LRC remote LRC creates Bloom and sends it to RLI RLI can test to see if a particular LFN in the LRC’s Bloom filter  can’t do a wildcard search  will sometimes lie!  only false positives  if can’t have any false positives use full list updates

June 21-25, 2004 Lecture4: Grid Data Management 51 Globus-rls-cli Create new LFN → PFN mappings  use create to create 1 st mapping for a LFN $ globus-rls-cli create file1 gsiftp://dataserver/file1 rls://dataserver  use add to add more mappings for a LFN $ globus-rls-cli add file1 file://dataserver/file1 rls://dataserver  use delete to remove a mapping for a LFN when last mapping is deleted for a LFN the LFN is also deleted cannot have LFN in LRC without a mapping $ globus-rls-cli delete file1 file://file1 rls://dataserver

June 21-25, 2004 Lecture4: Grid Data Management 52 Globus-rls-cli LRC can also store attributes about LFN and PFNs  size of LFN in bytes?  md5 checksum for a LFN?  ranking for a PFN or URL?  extensible...you choose attributes to create and add  can search catalog on the attributes  attributes limited to strings integers floating point (double) date/time

June 21-25, 2004 Lecture4: Grid Data Management 53 Globus-rls-cli Create attribute first then add values for LFNs $ globus-rls-cli attribute define md5checksum lfn string rls://dataserver $ globus-rls-cli attribute add file1 md5checksum lfn string 42947c86b8a08f067b178d56a77b2650 rls://dataserver Then query on the attribute $ globus-rls-cli attribute query file1 md5checksum lfn rls://dataserver md5checksum: string: 42947c86b8a08f067b178d56a77b2650

June 21-25, 2004 Lecture4: Grid Data Management 54 Three Data Questions on the Grid 1. What data/files exist? 2. What data/files are where? 3. How do I move data/files from A to B?

June 21-25, 2004 Lecture4: Grid Data Management 55 Metadata Catalog Metadata catalog  store data about...data!  help answer question about what data exists MCS from Globus still a research project  One realization of a metadata catalog  other projects offer solutions with different capabilities and limitations  very active research on what type of service a metadata catalog should offer  how should metadata information flow from site to site?  is there a single solution for most uses on the Grid?

June 21-25, 2004 Lecture4: Grid Data Management 56 Metadata Catalog One scenario useful in a Data Grid  data generated/collected into files at some detector site  location of data files published into RLS H-R gwf → gsiftp://someserver/path/to/H-R gwf  existence of data files and important metadata published into metadata catalog H-R gwf →  data from detector in Hanford, WA  raw data file contains all data (no downsampling)  data starts at GPS time  file contains 16 seconds of data  detector was in “science” mode with good noise properties  a simulated pulsar signal was being injected at the time  the operator on duty was D. Brown  the calibration parameters are  = and  =  and so on...

June 21-25, 2004 Lecture4: Grid Data Management 57 Metadata Catalog To run an application that analyzes the data on the Grid 1. Query metadata catalog for LFNs that contain data of interest Q: “Show me files where interferometer was locked and calibration had  < 1.6 for GPS times from to ” A: H-R gwf H-R gwf H-R gwf H-R gwf H-R gwf H-R gwf H-R gwf H-R gwf 2. Query RLI catalog to find out where those LFNs/files are known about $ globus-rls-cli query rli lfn H-R gwf rls://dataserver H-R gwf: rls://ldas-cit.ligo.caltech.edu:39281

June 21-25, 2004 Lecture4: Grid Data Management 58 Metadata Catalog 3. Query LRC catalog to get URLs for those files of interest $ globus-rls-cli query lrc lfn H-R gwf: rls://ldas- cit.ligo.caltech.edu:39281 H-R gwf: gsiftp://ldas-cit.ligo.caltech.edu:15000/archive/S1/L0/LHO/H- R-7140/H-R gwf 4. Move files from storage to analysis site using GridFTP globus-url-copy –p 4 gsiftp://ldas- cit.ligo.caltech.edu:15000/archive/S1/L0/LHO/H-R-7140/H-R gwf gsiftp://hydra.phys.uwm.edu/skoranda/analysis1/H-R gwf

June 21-25, 2004 Lecture4: Grid Data Management 59 Summary Metadata catalog, Globus RLS, and Globus GridFTP provide powerful way to manage data on the Grid and do more science  figure out what data/files are needed  find it  move it  do science with it!

June 21-25, 2004 Lecture4: Grid Data Management 60 But… What about a higher-level tool? We want something that will…  Locate the data  Send data to processing sites  Share the results with other sites  Allocate and de-allocate storage  Clean-up everything  Do these reliably, efficiently, and without human supervision

June 21-25, 2004 Lecture4: Grid Data Management 61 Stork A scheduler for data placement activities in the Grid What Condor is for computational jobs, Stork is for data placement Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”

June 21-25, 2004 Lecture4: Grid Data Management 62 The Concept Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Individual Jobs

June 21-25, 2004 Lecture4: Grid Data Management 63 The Concept Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs

June 21-25, 2004 Lecture4: Grid Data Management 64 DAGMan The Concept Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F

June 21-25, 2004 Lecture4: Grid Data Management 65 Why Stork? Stork understands the characteristics and semantics of data placement jobs. Can make smart scheduling decisions, for reliable and efficient data placement. Integrates seamlessly with Condor-G

June 21-25, 2004 Lecture4: Grid Data Management 66 Failure Recovery and Efficient Resource Utilization Fault tolerance  Just submit a bunch of data placement jobs, and then go away.. Control number of concurrent transfers from/to any storage system  Prevents overloading Space allocation and De-allocations  Make sure space is available

June 21-25, 2004 Lecture4: Grid Data Management 67 Support for Heterogeneity Protocol translation using Stork memory buffer.

June 21-25, 2004 Lecture4: Grid Data Management 68 Support for Heterogeneity Protocol translation using Stork Disk Cache.

June 21-25, 2004 Lecture4: Grid Data Management 69 Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]

June 21-25, 2004 Lecture4: Grid Data Management 70 Run-time Adaptation Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]

June 21-25, 2004 Lecture4: Grid Data Management 71 Run-time Adaptation Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs = 1024KB;//block size tcp_bs= 1024KB;//TCP buffer size p= 4; ]