1 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November 21 2006 T2 storage issues M. Biasotto – INFN Legnaro.

Slides:



Advertisements
Similar presentations
LNL CMS M.Biasotto, Padova, 24 aprile Farm monitoring Massimo Biasotto - LNL.
Advertisements

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.
Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…
Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.
Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
HEPiX GFAL and LCG data management Jean-Philippe Baud CERN/IT/GD.
GridKa January 2005 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Doris Ressmann 1 Mass Storage at GridKa Forschungszentrum Karlsruhe GmbH.
SE-292 High Performance Computing
Tom Hamilton – America’s Channel Database CSE
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Jos van Wezel Doris Ressmann GridKa, Karlsruhe TSM as tape storage backend for disk pool managers.
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Node Lessons Learned James Hudson Wisconsin Department of Natural Resources.
25 seconds left…...
Deployment Team. Deployment –Central Management Team Takes care of the deployment of the release, certificates the sites and manages the grid services.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari
Distributed Tier1 scenarios G. Donvito INFN-BARI.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
16 th May 2006Alessandra Forti Storage Alessandra Forti Group seminar 16th May 2006.
StoRM Some basics and a comparison with DPM Wahid Bhimji University of Edinburgh GridPP Storage Workshop 31-Mar-101Wahid Bhimji – StoRM.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
VMware vSphere Configuration and Management v6
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
1 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November T2 storage issues M. Biasotto – INFN Legnaro.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Padova, 5 October StoRM Service view Riccardo Zappi INFN-CNAF Bologna.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
EGI-Engage Data Services and Solutions Part 1: Data in the Grid Vincenzo Spinoso EGI.eu/INFN Data Services.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
Martina Franca (TA), 07 November Installazione, configurazione, testing e troubleshooting di Storage Element.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
An Introduction to GPFS
PRIN STOA-LHC: STATUS BARI BOLOGNA-18 GIUGNO 2014 Giorgia MINIELLO G. MAGGI, G. DONVITO, D. Elia INFN Sezione di Bari e Dipartimento Interateneo.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Jean-Philippe Baud, IT-GD, CERN November 2007
Dynamic Extension of the INFN Tier-1 on external resources
DPM at ATLAS sites and testbeds in Italy
LCG Service Challenge: Planning and Milestones
Introduction to Distributed Platforms
StoRM: a SRM solution for disk based storage systems
Status of the SRM 2.2 MoU extension
dCache “Intro” a layperson perspective Frank Würthwein UCSD
StoRM Architecture and Daemons
SRM2 Migration Strategy
STORM & GPFS on Tier-2 Milan
Ákos Frohner EGEE'08 September 2008
The INFN Tier-1 Storage Implementation
Storage Virtualization
Data Management cluster summary
INFNGRID Workshop – Bari, Italy, October 2004
Presentation transcript:

1 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November T2 storage issues M. Biasotto – INFN Legnaro

2 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November T2 issues Storage management is the main issue for a T2 site CPU and network management are easier years of experience stable tools (batch systems, installation,...) total number of machines for average T2 is small: ~XX Several different issues in storage hardware: which kind of architecture and technology? hw configuration and optimization storage cpu network storage resource managers

3 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Hardware Which kind of hardware for T2 storage? SAN based on SATA/FC disk-arrays and controllers flexibility and reliability DAS (Direct Attached Storage) Servers cheap and good performances Others? iSCSI, AoE (ATA over Ethernet),.... There are already working groups dedicated to this (technolgy tracking, tests, etc.), but information is a bit dispersed Important, but not really critical? once you have bought some disks, you are stuck with them for years, but mixing different types usually is not a problem.

4 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Current status of italian T2s SiteHardwareStorage ManagerTeraBytes BariDASdCache10 CataniaSATA/FCDPM19 FrascatiSATA/FCDPM6 LegnaroSATA/FCDPM17 MilanoSATA/FCDPM3 NapoliSATA/SCSIDPM5 PisaSATA/FCdCache RomaDPM TorinoSATA/FC

5 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Storage configuration Optimal storage configuration is not easy, a lot of factors to take in consideration how many TB per server? which RAID configuration? fine tuning of parameters: in disk-arrays, controllers and servers (cache, block sizes, buffer sizes, kernel params,... a long list) Disk-pools architecture: is one large pool enough, or do we need to split? buffer pools (WAN transfer buffer, local WN buffer)? different pools for different activities (production pool, analysis pool)? Network configuration: avoid bottlenecks between servers and CPU Optimal configuration depends strongly on the application 2 main (very different) types of access: remote I/O from WN or local copy to/from WN. Currently remote I/O for CMS and local for Atlas. production and analysis activities have different access pattern

6 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Storage configuration Optimal configuration varies depending on many factors: there is no one simple solution: every site will have to fine tune its own storage But having some guidelines would be useful leverage on current experience (mostly at T1) Can have huge effects on performances, but its not so critical many of these can be easily changed and adjusted

7 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Storage Resource Manager Which Storage Resource Manager for a T2? DPM, dCache, Storm Xrootd protocol required by Alice (dove lo metto questo?) The choice of a SRM is a more critical issue: its much more difficult to change adopting one and learning how to use it is a large investment: know-how in deployment, configuration, optimization, problem finding and solving,... obvious practical problems if a site has a lot of data already stored First half of 2007 last chance for a final decision? of course nothing is ever final, but after that a transition would be much more problematic

8 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Requirements Performance & scalability how much is needed for a T2? WAN bandwith~ 100 MB/s LAN bandwith> 300 MB/s ?? Disk~ 500 TB Concurrent access> 300 ?? Reliability & stability Advanced features data replication, internal monitoring, xxx, xxx Cost? (in term of human and hardware resources)

9 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November dCache dCache is currently the most mature product used in production since a few years deployed at several large sites: T1 FNAL, T1 FZK, T1 IN2P3, all US-CMS T2s, T2 Desy,... There is no doubt it will satisfy the performance and scalability needs of a T2 Two key features to guarantee performance and scalability: Services can be split among different nodes all access doors (gridftp, srm, dcap) can be replicated also central services (which usually run all on the admin node) can be distributed Access queues to manage high number of concurrent accesses storage access requests are queued and can be distributed, prioritized, limited based on protocol type or access type (read/write) buffer for temporary high load, avoid server overloading

10 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November dCache A lot of advanced features data replication (for 'hot' datasets) pool match-making dynamic and highly configurable pool draining for schdeuled maintenance operations grouping and partitioning of pools internal monitoring and statistics tool

11 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November dCache issues Services are heavy and not much efficient written in java, require a lot of RAM and CPU central services can be split, the problem is: do they need to be split? Even in a T2 site? Having to manage several dCache admin nodes could be a problem More costly in term of human resources needed more difficult to install, not integrated in LCG distribution steeper learning curve, documentation needs to be improved Its more complex, with more advanced features, and this obviously comes at a cost does a T2 need the added complexity and features, can they be afforded? still missing VOMS support and SRM v2, but should both be available soon (dove e meglio metterla questa?)

12 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November INFN dCache experience Used in production at Bari since May 2005, building up a lot of experience and know-how Overall: good stability and perfomance grafici Bari

13 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November INFN dCache experience Performance test at CNAF in ??? 2005 (o era 2006?) ???? demonstrated grafici

14 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November INFN dCache experience Pisa experience: from DPM to dCache (o forse va messo in fondo a DPM, dove si parla dei problemi con CMS che sono stati la cause del passaggio)

15 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Storm Developed in collaboration between INFN-CNAF and ICTP-EGRID (Trieste) Designed for disk-based storage: implements a SRM v2 interface on top of an underlying parallel or cluster file-system (GPFS, Lustre, etc.) Storm takes advantage of the aggregation functionalities of the underlying file-system to provide performance, scalability, load balancing, fault tolerance,... not bound to a specific file-system: in principle allows to exploit the very high research and development activity in the clustering file-systems field support of SRM v2 functionalities (space reservation, lifetime, file pinning, pre-allocation,...) and ACL Full VOMS support So far Storm has been penalized by the fact that it supported only SRM v2, while LCG is still running with SRM v1 no site could deploy it in production

16 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Storm Scalability Storm servers can be replicated centralized database: currently MySql, possible others (Oracle) in future releases Advanced fetaures provided by the underlined file-system GPFS: data replication, pool vacation

17 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Storm issues Not used anywhere in production so far, and few test installations at external sites Its likely that a first field test would result in a lot of small issues and problems (shouldnt be a concern in the longer term) Installation and configuration not easy but mostly due to too few deployment tests recent integration with yaim should bring improvements in this area No access queue for concurrent access management (and avoid server overloading) No internal monitoring There could be compatibility issues between the underlying cluster file- system and some VO applications some file-systems have specific requirements on kernel version

18 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November INFN Storm experience Obvioulsy CNAF has all the needed know-how on Storm Also GPFS experience within INFN, mostly at CNAF but not only (Catania, Trieste, Genova,...) overall good in term of performance, scalability and reliability Permormance test at CNAF in xxx 2005 (?) Storm + GPFS testbed grafici e result (vedi slides di A.Forti ad Otranto) Storm installations for deployment and functionality tests Padova(?) Legnaro (GridCC) altri?

19 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November DPM DPM is the SRM system supported by LCG, distributed with LCG middleware Yaim support: easy installation Possible migration from old classic SE Its the natural choice for a LCG site that needs SRM and doesnt have (pose) too many concerns a lot of DPM installations around.... VOMS support SRM v2 implementation (but still limited functionalities)

20 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November DPM issues Still lacking many functionalities (some of them important) load balancing very simple (round robin among file-systems in pool) and not configurable data replication still buggy in current release pool draining for server maintenance or dismission pool selection based on path internal monitoring support for multi-groups pools Scalability limits? no problem for rfio and gridftp services: easily distributed on pool servers but central services on head node? In principle dpm dpns and mysql services can be split: not tested yet (will it be necessary? will it be enough?) no access queue like in dCache to manage concurrent access DPM/Castor/rfio compatibility issue (dove metto questa?)

21 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November INFN DPM experience Used in production at many INFN sites no major issues or complains, good overall stability but never really stressed citare DPM+GPFS a Bologna ? stability and reliability: CMS LoadTest performance: MC production but even in CSA06 system not stressed enough: so far no evidence of problems or limitations, but performance values reached are still low Pisa experience (qui o in dCache?)

22 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Summary dCache mature product, meets all performance and scalability requirements more costly in term of hw and human resources DPM important features still missing, but this is not a concern in the longer term (no reason why they shouldnt be added) required performance and scalability not proven yet: are there some intrinsic limits? Storm potentially interesting, but must be tried in production required performance and scalability not proven yet: are there some intrinsic limits?

23 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Conclusions

24 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Aknowledgments Acknowledgments

25 M. Biasotto – INFN T1 + T2 cloud workshop, Bologna, November Varie da aggiungere da qualche parte In CMS SC4 and CSA06 the vast majority (almost all) of problems and job failures were related to storage issues bugs, hw failures, interoperability problems, misconfigurations,...