Presentation is loading. Please wait.

Presentation is loading. Please wait.

Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo.

Similar presentations


Presentation on theme: "Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo."— Presentation transcript:

1 Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo Zappi (INFN, CNAF) Lunca Magnoni (INFN, CNAF) Elisabetta Ronchieri (INFN, CNAF) Vincenzo Vagnoni (INFN, Bologna )

2 07/05/2008 2HEPiX 2008, Geneve Storage classes @ CNAF Implementation of 3 Storage Classes needed for LHC Implementation of 3 Storage Classes needed for LHC Disk0Tape1 (D0T1)  CASTOR Disk0Tape1 (D0T1)  CASTOR Space managed by system Space managed by system Data migrated to tapes and deleted from when staging area is full Data migrated to tapes and deleted from when staging area is full Disk1tape0 (D1T0)  GPFS/StoRM (in production) Disk1tape0 (D1T0)  GPFS/StoRM (in production) Space managed by VO Space managed by VO Disk1tape1 (D1T1)  CASTOR (production), GPFS/StoRM (production prototype for LCHb only) Disk1tape1 (D1T1)  CASTOR (production), GPFS/StoRM (production prototype for LCHb only) Space managed by VO (i.e. if disk is full, copy fails) Space managed by VO (i.e. if disk is full, copy fails) Large permanent buffer of disk with tape back-end and no gc Large permanent buffer of disk with tape back-end and no gc

3 07/05/2008 3HEPiX 2008, Geneve Looking into HSM solution on the base of StoRM/GPFS/TSM Project developed as a collaboration between: Project developed as a collaboration between: GPFS development team (US) GPFS development team (US) TSM HSM development team (Germany) TSM HSM development team (Germany) End-users (INFN-CNAF) End-users (INFN-CNAF) Main idea is to combine new features of GPFS (v.3.2) and TSM (v.5.5) with SRM (StoRM), to provide transparent GRID- friendly HSM solution. Main idea is to combine new features of GPFS (v.3.2) and TSM (v.5.5) with SRM (StoRM), to provide transparent GRID- friendly HSM solution. Information Lifecycle Management (ILM) used to order moving of data between disks and tapes Information Lifecycle Management (ILM) used to order moving of data between disks and tapes Interface between GPFS and TSM is on our shoulders Interface between GPFS and TSM is on our shoulders Improvements and development are needed from all sides Improvements and development are needed from all sides Transparent recall vs. massive (list ordered, optimized) recalls Transparent recall vs. massive (list ordered, optimized) recalls

4 07/05/2008 4HEPiX 2008, Geneve What we have now GPFS and TSM are widely used as separate products GPFS and TSM are widely used as separate products Build-in functionality in both products to implement backup and archiving from GPFS. Build-in functionality in both products to implement backup and archiving from GPFS. In GPFS v.3.2 concept of “external storage pool” extends use of policy driven ILM to tape storage. In GPFS v.3.2 concept of “external storage pool” extends use of policy driven ILM to tape storage. Some groups in HEP world are starting to investigate this solution or expressed interest to start Some groups in HEP world are starting to investigate this solution or expressed interest to start

5 07/05/2008 5HEPiX 2008, Geneve GPFS Approach: “External Pools” External pools are really interfaces to external storage managers, e.g. HPSS or TSM External pools are really interfaces to external storage managers, e.g. HPSS or TSM External pool “rule” defines script to call to migrate/recall/etc. files External pool “rule” defines script to call to migrate/recall/etc. files RULE EXTERNAL POOL ‘PoolName’ EXEC ‘InterfaceScript’ [ OPTS ’options’] GPFS policy engine builds candidate lists and passes them to external pool scripts GPFS policy engine builds candidate lists and passes them to external pool scripts External storage manager actually moves the data External storage manager actually moves the data

6 07/05/2008 6HEPiX 2008, Geneve Storage class Disk1-Tape1 D1T1 prototype in GPFS/TSM was tested for about two months D1T1 prototype in GPFS/TSM was tested for about two months Quite simple when no competition between migration and recall Quite simple when no competition between migration and recall D1T1 requires that every file written to disk will be copied to tape (and remain resident on disk) D1T1 requires that every file written to disk will be copied to tape (and remain resident on disk) recalls needed only in case of data loss (on disk) recalls needed only in case of data loss (on disk) Although the D1T1 is a living concept… Although the D1T1 is a living concept… Some adjustments were needed in StoRM Some adjustments were needed in StoRM Basically to place a file on hold for migration until the write operation is completed (SRM “putDone” on file) Basically to place a file on hold for migration until the write operation is completed (SRM “putDone” on file) Definitely positive results of the test with the current testbed hardware Definitely positive results of the test with the current testbed hardware Need to more tests up with a larger scale Need to more tests up with a larger scale Need to establish production model Need to establish production model

7 07/05/2008 7HEPiX 2008, Geneve Storage class Disk0-Tape1 Prototype is ready and being tested now Prototype is ready and being tested now More complicated logic is needed More complicated logic is needed Define priority between reads and writes Define priority between reads and writes For example in actual version of CASTOR migration to tape have absolute priority For example in actual version of CASTOR migration to tape have absolute priority logic of reordering of recall “list optimized recall”: by tapes and by files inside a tape logic of reordering of recall “list optimized recall”: by tapes and by files inside a tape The logic is realized by means of special scripts The logic is realized by means of special scripts First tests are encouraging, even considering the complexity of the problem First tests are encouraging, even considering the complexity of the problem Modification were requested in StoRM to implement recall logic and file pinning for files in use. Modification were requested in StoRM to implement recall logic and file pinning for files in use. The identified solutions are simple and linear The identified solutions are simple and linear

8 07/05/2008 8HEPiX 2008, Geneve GPFS+TSM tests So far we have performed full tests of a D1T1 solution (StoRM+GPFS+TSM) and the D0T1 implementation is being developed in close contact with IBM GPFS and TSM developers So far we have performed full tests of a D1T1 solution (StoRM+GPFS+TSM) and the D0T1 implementation is being developed in close contact with IBM GPFS and TSM developers The D1T1 is entering now its first production phase, being used by LHCb during this month’s CCRC08 The D1T1 is entering now its first production phase, being used by LHCb during this month’s CCRC08 As well as the D1T0, which is served by the same GPFS cluster but without migrations As well as the D1T0, which is served by the same GPFS cluster but without migrations GPFS/StoRM based D1T0 is also already used since February by Atlas GPFS/StoRM based D1T0 is also already used since February by Atlas

9 07/05/2008 9HEPiX 2008, Geneve D1T0 and D1T1 @CNAF using StoRM/GPFS/TSM 3 STORM instances 3 STORM instances 3 major HEP experiments 3 major HEP experiments 2 Storage classes 2 Storage classes 12 servers, 200TB of disk space 12 servers, 200TB of disk space 3 LTO2 tape drives 3 LTO2 tape drives

10 07/05/2008 10HEPiX 2008, Geneve Hardware used for test 40TB GPFS File system (v.3.2.0-3) served by 4 I/O NSD servers (SAN devices are EMC CX3-80) 40TB GPFS File system (v.3.2.0-3) served by 4 I/O NSD servers (SAN devices are EMC CX3-80) FC (4Gbit/s) interconnection between servers and disks array FC (4Gbit/s) interconnection between servers and disks array TSM v.5.5 TSM v.5.5 2 servers (1Gb Ethernet) HSM front-ends each one acting as: 2 servers (1Gb Ethernet) HSM front-ends each one acting as: GPFS client (reads and writes on the file-system via LAN) GPFS client (reads and writes on the file-system via LAN) TSM client (reads and writes from/to tapes via FC) TSM client (reads and writes from/to tapes via FC) 3 LTO-2 tape drives 3 LTO-2 tape drives Sharing of the tape library (STK L5500) between Castor e TSM Sharing of the tape library (STK L5500) between Castor e TSM i.e. working together with the same tape library i.e. working together with the same tape library

11 07/05/2008 11HEPiX 2008, Geneve GPFS Server GPFS/TSM client TSM server Tape drive GPFS TSM Gigabit LAN FC SAN GPFS Server gridftp Server DB TSM server (backup) DB mirror 2 EMC CX3-80 controllers 4 GPFS server 2 StoRM servers 2 Gridftp Servers 2 HSM frontend nodes 3 Tape Drive LTO-2 1 TSM server 1/10 Gbps Ethernet 2/4 Gbps FC LHCb D1T0 and D1T1 details … FC TAN

12 07/05/2008 12HEPiX 2008, Geneve How it works GPFS performs file system metadata scans according to ILM policies specified by the administrators GPFS performs file system metadata scans according to ILM policies specified by the administrators The metadata scan is very fast (is not a find…) and is used by GPFS to identify the files which need to be migrated to tape The metadata scan is very fast (is not a find…) and is used by GPFS to identify the files which need to be migrated to tape Once the list of files are obtained, it is passed to an external process which is run on the HSM nodes and it actually performs the migration to TSM Once the list of files are obtained, it is passed to an external process which is run on the HSM nodes and it actually performs the migration to TSM This is in particular what we implemented This is in particular what we implemented Note: Note: The GPFS file system and the HSM nodes can be kept completely decoupled, in the sense that it is possible to shutdown the HSM nodes without interrupting the file system availability The GPFS file system and the HSM nodes can be kept completely decoupled, in the sense that it is possible to shutdown the HSM nodes without interrupting the file system availability All components of the system are having intrinsic redundancy (GPFS failover mechanisms). All components of the system are having intrinsic redundancy (GPFS failover mechanisms). No need to put in place any kind of HA features (apart from the unique TSM server) No need to put in place any kind of HA features (apart from the unique TSM server)

13 07/05/2008 13HEPiX 2008, Geneve Example of a ILM policy /* Policy implementing T1D1 for LHCb: -) 1 GPFS storage pool -) 1 GPFS storage pool -) 1 SRM space token: LHCb_M-DST -) 1 SRM space token: LHCb_M-DST -) 1 TSM management class -) 1 TSM management class -) 1 TSM storage pool */ -) 1 TSM storage pool */ /* Placement policy rules */ RULE 'DATA1' SET POOL 'data1' LIMIT (99) RULE 'DATA2' SET POOL 'data2' LIMIT (99) RULE 'DEFAULT' SET POOL 'system' /* We have 1 space token: LHCb_M-DST. Define 1 external pool accordingly. */ RULE EXTERNAL POOL 'TAPE MIGRATION LHCb_M-DST‘ EXEC '/var/mmfs/etc/hsmControl' OPTS 'LHCb_M-DST‘ EXEC '/var/mmfs/etc/hsmControl' OPTS 'LHCb_M-DST‘ /* Exclude from migration hidden directories (e.g..SpaceMan), baby files, hidden and weird files. */ baby files, hidden and weird files. */ RULE 'exclude hidden directories' EXCLUDE WHERE PATH_NAME LIKE '%/.%' RULE 'exclude hidden file' EXCLUDE WHERE NAME LIKE '.%' RULE 'exclude empty files' EXCLUDE WHERE FILE_SIZE=0 RULE 'exclude baby files' EXCLUDE WHERE (CURRENT_TIMESTAMP-MODIFICATION_TIME)<INTERVAL '3' MINUTE WHERE (CURRENT_TIMESTAMP-MODIFICATION_TIME)<INTERVAL '3' MINUTE

14 07/05/2008 14HEPiX 2008, Geneve Example of a ILM policy (cont.) /* Migrate to the external pool according to space token (i.e. fileset). */ space token (i.e. fileset). */ RULE 'migrate from system to tape LHCb_M-DST' MIGRATE FROM POOL 'system' THRESHOLD(0,100,0) WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME) TO POOL 'TAPE MIGRATION LHCb_M-DST' FOR FILESET('LHCb_M-DST') RULE 'migrate from data1 to tape LHCb_M-DST' MIGRATE FROM POOL 'data1' THRESHOLD(0,100,0) WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME) TO POOL 'TAPE MIGRATION LHCb_M-DST' FOR FILESET('LHCb_M-DST') RULE 'migrate from data2 to tape LHCb_M-DST' MIGRATE FROM POOL 'data2' THRESHOLD(0,100,0) WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME) TO POOL 'TAPE MIGRATION LHCb_M-DST' FOR FILESET('LHCb_M-DST')

15 07/05/2008 15HEPiX 2008, Geneve Example of configuration file # HSM node list (comma separated) HSMNODES=diskserv-san-14,diskserv-san-16 # system directory path SVCFS=/storage/gpfs_lhcb/system # filesystem scan minimum frequency (in sec) SCANFREQUENCY=1800 # maximum time allowed for a migrate session (in sec) MIGRATESESSIONTIMEOUT=4800 # maximum number of migrate threads per node MIGRATETHREADSMAX=30 # number of files for each migrate stream MIGRATESTREAMNUMFILES=30 # sleep time for lock file check loop LOCKSLEEPTIME=2 # pin prefix PINPREFIX=.STORM_T1D1_ # TSM admin user name TSMID=xxxxx # TSM admin user password TSMPASS=xxxxx # report period (in sec) REPORTFREQUENCY=86400 # report email addresses (comma separated) REPORTEMAILADDRESS=Vladimir.Sapunenko@c naf.infn.it,Daniele.Gregori@cnaf.in fn.it,Luca.dellAgnello@cnaf.infn.it,Angelo.Carbone@bo.infn.it,Vincenzo.Vagnoni@bo.infn.it # alarm email addresses (comma separated) ALARMEMAILADDRESS=t1-admin@cnaf.infn.it # alarm email delay (in sec) ALARMEMAILDELAY=7200

16 07/05/2008 16HEPiX 2008, Geneve Example of a report A first automatic reporting system has been implemented --------------------------------------------------------------------------------------------------------- Start: Sun 04 May 2008 11:38:48 PM CEST Stop: Mon 05 May 2008 08:03:15 AM CEST Seconds: 30267 --------------------------------------------------------------------------------------------------------- Tape Files Failures File throughput Total throughput L00595 5 0 31.0798 MiB/s 0.702259 MiB/s L00599 10 0 32.4747 MiB/s 1.41891 MiB/s L00611 57 0 29.0862 MiB/s 6.59165 MiB/s L00614 47 0 31.5084 MiB/s 6.61944 MiB/s L00615 46 0 30.3926 MiB/s 6.57133 MiB/s L00617 47 0 31.1735 MiB/s 6.5116 MiB/s L00618 62 0 28.4119 MiB/s 6.06469 MiB/s L00619 44 0 27.0226 MiB/s 4.10937 MiB/s L00620 53 0 27.1009 MiB/s 7.13976 MiB/s L00621 66 0 28.9043 MiB/s 6.67269 MiB/s L00624 44 0 11.4347 MiB/s 5.82468 MiB/s L00626 62 0 30.4792 MiB/s 6.53114 MiB/s --------------------------------------------------------------------------------------------------------- Drive Files Failures File throughput Total throughput DRIVE3 218 0 30.2628 MiB/s 25.7269 MiB/s DRIVE4 197 0 29.5188 MiB/s 23.6487 MiB/s DRIVE5 128 0 21.5395 MiB/s 15.3819 MiB/s --------------------------------------------------------------------------------------------------------- Host Files Failures File throughput Total throughput diskserv-san-14 285 0 29.9678 MiB/s 34.0331 MiB/s diskserv-san-16 258 0 25.6928 MiB/s 30.7245 MiB/s --------------------------------------------------------------------------------------------------------- Files Failures File throughput Total throughput Files Failures File throughput Total throughput Total 543 0 27.9366 MiB/s 64.7575 MiB/s --------------------------------------------------------------------------------------------------------- Alarm part is being developed An email is sent with the reports every day (period of time is configurable by the option file)

17 07/05/2008 17HEPiX 2008, Geneve Description of the tests Test A Test A Data transfer of LHCb files from CERN Castor-disk to CNAF StoRM/GPFS using the File Transfer Service Data transfer of LHCb files from CERN Castor-disk to CNAF StoRM/GPFS using the File Transfer Service Automatic migration of the data files from GPFS to TSM while the data was being transferred by FTS Automatic migration of the data files from GPFS to TSM while the data was being transferred by FTS This is a realistic scenario This is a realistic scenario Test B Test B 1GiB zero’ed files created locally on the GPFS file system with the migration turned off, then migrated to tape when the writes were finished 1GiB zero’ed files created locally on the GPFS file system with the migration turned off, then migrated to tape when the writes were finished The migration of zero’ed files to tape is faster due to compression  measures physical limits of the system The migration of zero’ed files to tape is faster due to compression  measures physical limits of the system Test C Test C Similar to Test B, but with real LHCb data files instead of dummy zero’ed files Similar to Test B, but with real LHCb data files instead of dummy zero’ed files Realistic scenario, e.g. when for maintenance a long queue of files to be migrated accumulates in the file system Realistic scenario, e.g. when for maintenance a long queue of files to be migrated accumulates in the file system

18 07/05/2008 18HEPiX 2008, Geneve Test A: input files Most of the files are of 4 and 2 GiB size, with a bit of other sizes in addition Most of the files are of 4 and 2 GiB size, with a bit of other sizes in addition data files are LHCb stripped DST data files are LHCb stripped DST 2477 files 2477 files 8 TiB in total 8 TiB in total File size distribution

19 07/05/2008 19HEPiX 2008, Geneve Test A: results Black curve: net data throughput from CERN to CNAF vs. time Red curve: net data throughput from GPFS to TSM FTS transfers were temporarily interrupted Just two LTO-2 drives A third LTO-2 drive was added A drive was removed 8 TiB in total were transferred to tape in 150k seconds (almost 2 days) from CERN About 50 MiB/s to tape with two LTO-2 drives and 65 MiB/s with three LTO-2 drives Zero tape migration failures Zero retrials

20 07/05/2008 20HEPiX 2008, Geneve Test A: results (II) Most of the files were migrated within less than 3 hours with a tail up to 8 hours Most of the files were migrated within less than 3 hours with a tail up to 8 hours The tail comes from the fact that at some point the CERN-to-CNAF throughput raised to 80 MiB/s, overcoming max performance of tape migration at that time. So, GPFS/TSM accumulated a queue of files with respect to the FTS transfers The tail comes from the fact that at some point the CERN-to-CNAF throughput raised to 80 MiB/s, overcoming max performance of tape migration at that time. So, GPFS/TSM accumulated a queue of files with respect to the FTS transfers Retention time on disk (time since file is written until it is migrated to tape)

21 07/05/2008 21HEPiX 2008, Geneve Test A: results (III) The distribution peaks at about 33 MiB/s which is the maximum sustainable for LHCb data files by the LTO-2 drives The distribution peaks at about 33 MiB/s which is the maximum sustainable for LHCb data files by the LTO-2 drives Due to compression the actual performance depend on the content of the files… Due to compression the actual performance depend on the content of the files… Tail is mostly due to the fact that some of the tapes showed much smaller throughputs Tail is mostly due to the fact that some of the tapes showed much smaller throughputs For this test we reused old tapes no longer used by Castor For this test we reused old tapes no longer used by Castor Distribution of throughput per migration to tape What is this secondary peak? It is due to files which are written to the end of the tapes and the TSM splits them to a subsequent tape (i.e. must dismount and remount a new tape to continue writing the file)

22 07/05/2008 22HEPiX 2008, Geneve Intermezzo Between Test A and Test B we realized that the interface logics was not perfectly balancing between the two HSM nodes Between Test A and Test B we realized that the interface logics was not perfectly balancing between the two HSM nodes Then the logics of the interface has been slightly changed in order to improve the performance Then the logics of the interface has been slightly changed in order to improve the performance

23 07/05/2008 23HEPiX 2008, Geneve Test B: results File system prefilled with 1000 files of 1 GiB size each all filled with zeroes File system prefilled with 1000 files of 1 GiB size each all filled with zeroes migration to tape turned off while writing data to disk migration to tape turned off while writing data to disk Migration to tape turned on when prefilling finished Migration to tape turned on when prefilling finished Hardware compression is very effective for such files Hardware compression is very effective for such files About 100 MiB/s observed over 10k seconds About 100 MiB/s observed over 10k seconds What is this valley here? Explained in the next slide where they are more visible Net throughput to tape versus time No tape migration failures and no retrials observed

24 07/05/2008 24HEPiX 2008, Geneve Test C: results Similar to Test B, but with real LHCb data files taken from the same sample of Test A instead of zero’ed files Similar to Test B, but with real LHCb data files taken from the same sample of Test A instead of zero’ed files The valleys clearly visible here have a period of exactly 4800 seconds The valleys clearly visible here have a period of exactly 4800 seconds They were also partially present in Test A, but not clearly visible in the plot due to larger binning They were also partially present in Test A, but not clearly visible in the plot due to larger binning The valleys are due to a tunable feature of our interface The valleys are due to a tunable feature of our interface Each migration session is timed out if not finished within 4800 seconds Each migration session is timed out if not finished within 4800 seconds After the timeout GPFS performs a new metadata scan and a new migration session is initiated After the timeout GPFS performs a new metadata scan and a new migration session is initiated 4800 seconds is not a magic number, could be larger or even infinite 4800 seconds is not a magic number, could be larger or even infinite No tape migration failures and no retrials observed Net throughput to tape versus time About 70 MiB/s on average with peaks up to 90 MiB/s

25 07/05/2008 25HEPiX 2008, Geneve Conclusions and outlook First phase of tests for T1D1 StoRM/GPFS/TSM-based concluded First phase of tests for T1D1 StoRM/GPFS/TSM-based concluded LHCb is now starting the first production experience with such a T1D1 system LHCb is now starting the first production experience with such a T1D1 system Work is ongoing for a T1D0 implementation in collaboration with IBM GPFS and TSM HSM development teams Work is ongoing for a T1D0 implementation in collaboration with IBM GPFS and TSM HSM development teams T1D0 is more complicated since it should include active recalls optimization, concurrence between migrations and recalls, etc. T1D0 is more complicated since it should include active recalls optimization, concurrence between migrations and recalls, etc. IBM will introduce efficient ordered recalls features in the next major release of TSM IBM will introduce efficient ordered recalls features in the next major release of TSM Waiting for that release, in the meanwhile we are implementing it through an intermediate layer of intelligence between GPFS and TSM driven by StoRM Waiting for that release, in the meanwhile we are implementing it through an intermediate layer of intelligence between GPFS and TSM driven by StoRM A first proof of principle prototype already exists, but this is something to be discussed in a future talk… stay tuned! A first proof of principle prototype already exists, but this is something to be discussed in a future talk… stay tuned! New library recently acquired at CNAF New library recently acquired at CNAF Once the new library will be online and old data files will be repacked to the new one, the old library will be devoted entirely to TSM production systems and testbeds Once the new library will be online and old data files will be repacked to the new one, the old library will be devoted entirely to TSM production systems and testbeds About 15 drives, much more realistic and interesting scale than 3 drives About 15 drives, much more realistic and interesting scale than 3 drives


Download ppt "Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo."

Similar presentations


Ads by Google