Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop.

Similar presentations


Presentation on theme: "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop."— Presentation transcript:

1 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop on Data Handling in Production Grids, Monterey 25 June 2007

2 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 File Transfer Service FTS overview The File Transfer Service (FTS) is a data movement fabric service –It is a multi-VO service, used to balance usage of site resources according to VO and site policies Why is it needed ? –For the user, the service it provides is the reliable point to point movement of files –For the site manager, it provides a reliable and manageable way of serving file movement requests from their experiments –For the production manager, it provides ability to control requests coming from his users  Re-ordering, prioritization,… –The focus is on the “service”  It should make it easy to do these things well

3 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Who uses it 1 The sites use it as part of their fabric –It’s designed to make it easier for a multi-VO site to run the transfers of its VOs –Tier-1 sites run the FTS servers and are responsible for processing the transfer requests from tier-2s and transferring data between tier-1s –Tier-0 export is run from CERN –The focus is on the service delivered, the ease of manageability and service monitoring File Transfer Service 3

4 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 File Transfer Service Who uses it 2 FTS is used by experiment frameworks –Typically end-users do not interact directly with it – they interact with their experiment framework –Production managers sometimes query it directly to debug / chase problems Experiment framework decides it wants to move a set of files –The expt. framework is responsible for staging-in (for now..) –It packages up a set of source/destination file pairs and submits transfer jobs to FTS –The state of each job is tracked as it progresses through the various transfer stages –The experiment framework can poll the status at any time

5 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Service APIs FTS has 3 basic API group: Job submission / tracking –Used by experiment frameworks to submit requests Service / channel management –Used by admins and VO production managers to control the service Statistics tools –Providing aggregate statistics on what the service has been doing, current failure rates, classes, etc –This is being done as part of the WLCG monitoring group to make sure the information is available to all interested stakeholders File Transfer Service 5

6 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Security model Transfers are always run using user credential –VOMS credential is now used (and renewed as necessary) in FTS 2.0 Authorization to service is done using: –Grid mapfile mechanism or –VOMS role VO production manager roles Channel administrator roles Service manager role File Transfer Service 6

7 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 User API Uses a submit / poll pattern with unique job ID –Jobs can contain multiple copy requests Various polling methods with different detail –Overall job status (“is it done yet?”) –Job summary –Detailed status of individual file failures / status Job cancelation and priority reshuffling by suitably authorised users –i.e. VO production managers No notification mechanism yet –The submit/poll pattern isn’t so efficient… Much commonality with Globus RFT API –We’ve been talking… File Transfer Service 7

8 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Channel concept For management ease, the service supports splitting jobs onto multiple “channels” –Once a job is submitted to the FTS it is assigned to a suitable channel for serving A channel may be: –A point to point network link (e.g. we manage all the T0-export links in WLCG on a separate channel) –Various “catch-all” channels  (e.g. everything else coming to me, or everything to one of my tier-2 sites) –More flexible “grouping of sites” channel definitions are on the way Channels are uni-directional –e.g. at CERN we have one set for the export and one set for the import File Transfer Service 8

9 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Channels… “Channel”: it’s not a great name –It always causes confusion... (but we’re ~stuck with the name now) –It isn’t tied to a physical network path –It’s just a management concept –“Queue” might be a better name All file transfer jobs on the same channel are served as part of the same queue –Inter-VO priorities for the queue (Atlas gets 75%, CMS gets the rest) –Internal-VO priorities within a VO –Each channel has its own set of transfer parameters –Number of concurrent files running, number streams, TCP buffer, etc Given the transfers your FTS server is required to support (as defined by experiment computing models and WLCG), channels allow you to split up the management of these as you see fit File Transfer Service 9

10 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 FTS topology Simplified tiered infrastructure FTS servers are located at CERN and Tier-1 sites To provide full “coverage” WLCG defines what transfers a given FTS server has to support FTS servers are independent File Transfer Service 10

11 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 FTS and data scheduling FTS provides the reliable and manageable transport layer It does not (and will not) provide more complex data scheduling –Multi-hop transfers –Broadcast transfers –Dataset collation But it may be used as the underlying management layer for services providing this Much of this extra functionality is currently provided in the experiment layer –It’s quite computing model dependent –e.g. Phedex from CMS File Transfer Service 11

12 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 File Transfer Service FTS server architecture All components are decoupled from each other –Each interacts only with the database Experiments interact via web-service VO agents do VO-specific operations (1 per VO) Channel agents do channel specific operation (e.g. the transfers) Monitoring and statistics can be collected via the DB

13 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 FTS server architecture Designed for high availability and scalability User front-end web-service is stateless and (should be) load balanced to provide availability and scalability –Service interventions that don’t require a DB schema upgrade can be made with zero user-visible downtime Agent daemons are designed to scale over multiple nodes as necessary with load Critical component is central DB –WLCG production services on Oracle RAC to provide availability and scalability File Transfer Service 13

14 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 File Transfer Service 14 FTS 2.0 FTS 2.0 server: new features –Delegation of proxy from the client to the FTS service –Improved monitoring capabilities  Critical to the ‘overall transfer service’ operational stability  Much more data retained in the database, some new methods to access them in the admin API –Beta SRM 2.2 support  This is now being tested on the EGEE pre-production service as part of the SRM 2.2 testing activity –Better administration tools  Make it easier to run the service –Better database model  Improve the performance and scalability –Placeholders for future functionality  Minimise the impact of future upgrade interventions

15 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 File Transfer Service 15 FTS developments –Evolve the SRM 2.2 code as we understand the SRM 2.2 implementations (based on feedback from PPS) –Incrementally improve service monitoring  FTS will have the capacity to give very detailed measurements about the current service level and problems currently being observed with sites  Integration with experiment and operations dashboards  Design work ongoing –Site grouping in channel definition (“clouds”)  To make it easier to implement the computing models of CMS and ALICE  Code exists: to be tested on pilot service –Incrementally improve service administration tools –SRM/gridFTP split –Notification of job state changes Not planned –Not planning to produce a non-Oracle version  Sites with lower production requirements can use restricted Oracle XE

16 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 FTS current status Current FTS production status –CERN has just moved to FTS 2.0 –All T1 sites currently using FTS 1.5 –> 10 petabytes exported from CERN since SC4 –A few more petabytes moved between tier-1 sites and from tier-1 to tier-2 sites FTS infrastructure runs well –CERN and T1 sites ~understand the software –Most problems ironed out last year –Remainder of the problems understood with experiments and we have a plan to address them There are still problems with ‘overall transfer service’ File Transfer Service 16

17 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Issues 1 “There are still problems with overall transfer service” The overall system is very complex –Understanding the cross-site “end to end transfer service” is still an issue  Experiment layer, FTS, SRM at source, SRM at destination, gridFTP servers, network, tape backends  It can be done, but the manpower required is significant and is not sustainable in the long term –The number of retries needed to get files from A to B is still rather high: reduced efficiency –Improving services’ stability is critical (FTS included ) –Monitoring will help  “Understanding the whole system” is our primary focus  Can we coordinate the logging / monitoring of FTS and SRMs to improve this situation ? File Transfer Service 17

18 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Issues 2 Behaviour under error conditions is different for different SRM implementations –This took a lot of effort to resolve in SRM 1.1 –The hope is that the SRM 2.2 standard is better in this regard Still, a conservative deployment schedule must anticipate problems of this type for SRM 2.2 deployment in production –The “overall production service” will not be stable until any such integration problems are understood File Transfer Service 18

19 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Issues 3 FTS easily lets you throttle channels writing to your storage –This was a deployment choice of WLCG –But source overloading a still a problem  Recently reported by ATLAS (e.g. BNL) –It would be good if the SRMs could indicate their busy-ness to FTS by some mechanism, so it could back off  The other proposed solution of having all the FTS servers and other SRM clients cooperating (in a data scheduler model) so as not to overload an SRM is not seen as credible by WLCG File Transfer Service 19

20 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Summary FTS is designed as a highly available and scalable service to help sites manage the file transfer requests from their VOs Focus is upon service management Current WLCG FTS infrastructure runs well Problems with “overall transfer service” –Complexity: cross-site debugging is expensive –Resilience: too easy to overload services, ‘standard’ interfaces not always quite standard, especially under error conditions This is where we need to focus File Transfer Service 20


Download ppt "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop."

Similar presentations


Ads by Google