Presentation is loading. Please wait.

Presentation is loading. Please wait.

Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)

Similar presentations


Presentation on theme: "Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)"— Presentation transcript:

1 Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)
Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)

2 Outline CASTOR1 at CNAF CASTOR2 architecture
CASTOR2 deployment at CNAF Test results Conclusions

3 CASTOR1 Client: rfopen(…) RFIOD Stager ... Disk Servers RTCPD
/castor/cnaf.infn.it/… NameServer Disk Servers RFIOD Stager stagerastor stager_castor staer_castor ... tape1,tape2 tape3… VMGR Tape Servers drive1, drive2, … RTCPD TapeDaemon VDQM

4 disk tape 7 stager: 4 LHC, 2 non-LHC, 1 SC ~15 disk server
Tot = 75 TB 7 stager: 4 LHC, 2 non-LHC, 1 SC ~15 disk server 10 tape server disk STK 5500 6 LTO-2 drives with 1200 tapes (240 TB) 4 9940B drives (+ 3 to be installed next weeks) with tapes (130 TB =>260 TB) Tot = 224 TB tape

5 CASTOR2 Architecture (1)
Client: rfopen(…) Rmmaster RH Stager Scheduler Oracle DB Disk Servers Mover GC StagerJob MigHunter NameServer RTCPClientD Tape Servers TapeDaemon VMGR RTCPD VDQM DLF

6 CASTOR 2 Architecture (2)
Database centric architecture “Surrounding” daemons are stateless Important operational decisions can be translated into SQL statements Preparation of migration or recall streams Weighting of file systems used for migration/recall Draining of disk servers or file systems Garbage collection decision Two databases supported : Oracle and MySQL (not ready) Request throttling thanks to the request handler Stateless components can be restarted/parallelized easily -> no single point of failure stager split in many independent services distinction between queries, user requests and admin requests fully scalable Disk access is scheduled All user requests are scheduled Advanced scheduling features for ‘free’ (e.g. fair-share) 2 schedulers provided : LSF and Maui (not ready)

7 CASTOR 2 Architecture (3)
Plugable Mover Client can choose its mover rfio and rootd supported at this time. xrootd to come ? Dynamic migration/recall streams Multiple concurrent requests for same volume will be processed together New requests arriving after the stream has started are automatically added to the stream Configurable garbage collector policy depends on the service class decision implemented in SQL framework automatically deletes marked files Distributed Logging Facility used by all components

8 same tapesrvs as for Castor1
CNAF deployment stagerdb Oracle/RHE dlfdb Oracle/SL ns db Oracle/RHE oracle01 diskserv-san-13 castor-4 Castor1 services (vdqm, vmgr, ns, cupvd) RH, stager, MigHunter, rtcpclientd DLF, rmmaster, expertd LSF master castor castor-6 diskserv-san-13 castorlsf01 2x2TB 2x2TB 2x2TB 2x2TB diskserv-san-33 diskserv-san-34 diskserv-san-35 diskserv-san-36 same tapesrvs as for Castor1

9 Test (1) 100 job write 1GB files 100 job read 1GB files
Average rate = 2.3 MB/s Total rate 230 MB/s Average rate = 2.7 MB/s Total rate 270 MB/s

10 Test (2) Write Read

11 Conclusions CASTOR2 services are stable and works fine. The admin interfaces is not mature. The installation is not easy but quattor gives a big help. There are no (known) limits in the # files on disk and provides a better logic for tape recalls (less rewinds, mounts and dismounts) Some more work is needed before production: DB configuration (archives log rotation, table space size, backups) tuning of # LSF slots for the disk servers experiences with admin tasks such as draining fs and servers… evaluate the LSF+Oracle overhead with smal files Next stress test will be the throughput phase of SC4.

12

13 Client Several interfaces provided No pending connection SRM provided
rfcp, RFIO API backward compatible can talk to new and old stagers stager API and commands not backward compatible No pending connection opens a port calls Request Handler and closes connection waits on port for the Request Replier SRM provided backward compatible, can talk to new and old stagers

14 Migration The decision which file to migrate and when is entirely externalized to one or several ‘hunter’ processes Files to be migrated are linked to one or more “Streams” Stream is a container of files to be migrated Migration candidate is a TapeCopy with status TAPECOPY_WAITINSTREAMS that is associated with one or several Streams. When a migration candidate is associated with several Streams, it will be picked up by one of them. This allows for almost 100% drive utilization A running Stream is associated with a Tape. However, the same Stream may ‘survive’ several Tapes. The Stream is destroyed when there are no more eligible TapeCopy candidates Stream creation and linking of migration candidates to streams are pure DB operations Could be performed directly with a SQL script Several scripts for different policies can work concurrently A default ‘MigHunter’ is provided Optionally supports a legacy mode emulating the current CASTOR stager New stager will NOT segment files

15 Recall Recall differs from Migration in that it is usually executed on demand An active request is waiting for the file to be recalled However, with the new architecture the decision what to recall and when can be externalized (‘hunter’ processes) A recall candidate is TapeCopy associated with a tape Segments (Tape + Segment) Requests requiring a recall are scheduled like normal file access requests  scheduler can be configured to prevent massive recall attacks The “job” will simply put the DiskCopy and SubRequest in WAITRECALL status, create the tape Segment information in the catalogue and exit Use of the recallPolicy attribute in SvcClass: If no recall policy is defined (recallPolicy attribute is empty in SvcClass), the job triggers the recall immediately by setting the Tape status to >= TAPE_PENDING If a recall policy is defined, the TapeCopy and tape Segment are created without modifying the Tape status: If tape is already mounted it will automatically pick up the candidate Otherwise an offline process can later decide to trigger the recall by updating the Tape status.

16 Garbage collection Like for migration/recall a disk file garbage collection is triggered via a DB update Disk files (DiskCopy class in catalog) to be garbage collected are marked with a special status: DISKCOPY_GCCANDIDATE gcdaemon retrieves a list of local files to be removed and updates the catalog when the files have been removed (can be lazy since the DiskCopy is already marked for GC) The GC policy deciding which files to be removed is configurable per SvcClass and written in PL/SQL directly in the DB A gcWeight attribute of the disk copy is provided for externally setting its weight to be used when compared with other candidates This could be based on experiment policies: e.g. all files beginning with “ABC” should be given low weight for removal By default all weights are zero

17 Internal file replication
Disk file replication of “hot” files is supported SvcClass attributes regulates the replication: maxReplicaNb: limits the maximum number of replicas allowed by the SvcClass replicationPolicy names a policy to be called if maxReplicaNb is not defined (≤0) Replication is performed on demand, when a job is started and the file is not on the scheduled filesystem If maxReplicaNb or replicationPolicy allows for it otherwise the file system will be forced via a job submission attribute

18 File system selection The file system selection is called from several places When scheduling access for a given client request When selecting the best migration candidate When selecting a file system for recalling a tape file The FileSystem table has several attributes updated by external policies based on load and status monitoring “free” is the free space on the file system “weight” reflects the current load calculated using an associated policy “fsDeviation” is the deviation to be subtracted from the weight every time a new stream is added to the file system. This assures that the same file system is not selected twice

19 ‘Hunter’ processes The ‘Hunter’ processes are not strictly part of the stager itself Can be daemons or cron-jobs that run offline and independently of other CASTOR servers implement specific policies either via calls to the expert system (expertd) or via SQL queries to the catalog database The action taken by the Hunter would normally result in the triggering of a CASTOR task, e.g. Migration Recall Garbage collection Retry of tape exceptions Internal replication

20 Logging facility All new castor components use the Distributed Logging Facility (DLF) Log to files and/or database (Oracle or MySQL) Web based GUI for problem tracing using the DLF database The following services currently log to DLF rhserver stager rtcpclientd, migrator, recaller stagerJob MigHunter

21 DLF GUI

22 Instant performance views
Cmonitd has been part of CASTOR since 2002 Central daemon collecting UDP messages from Tape movers (rtcpd) Tape daemon (mount/unmount) Original GUI written in Python GUI rewritten in Java (swing), September ‘04 Web start Drive performance time-series plots

23 Monitoring GUI


Download ppt "Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)"

Similar presentations


Ads by Google