Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hepix 2005 Trip Reports Dantong Yu June/27/2005. 2 Highlights   Hepix was asked and agreed to be technical advisors to IHEPCCC on specific questions.

Similar presentations


Presentation on theme: "Hepix 2005 Trip Reports Dantong Yu June/27/2005. 2 Highlights   Hepix was asked and agreed to be technical advisors to IHEPCCC on specific questions."— Presentation transcript:

1 Hepix 2005 Trip Reports Dantong Yu June/27/2005

2 2 Highlights   Hepix was asked and agreed to be technical advisors to IHEPCCC on specific questions where it has expertise. Examples were given, the status of Linux in HEP and the idea of a virtual organisation for HEP physicists.   The most recent successful HEP collaboration on distribution and widespread acceptance of Scientific Linux. Discussions focused on versions would need to be supported for LHC in the next two years.   LEMON, NGOP and SLAC Nagios Monitoring.   Service Challenge Preparation.   Batch Queue Systems.   DoE budget cuts affected SLAC, FNAL, and BNL.

3 Disk, Tape, Storage and File Systems

4 4 Fermilab Mass Storage  ENSTORE, dCache and SRM for CDF, D0 and CMS  Name Space: PNFS.  Provides a hierarchical namespace for users‘ files in Enstore.  Manages file metadata.  Looks like an NFS mounted file system from user nodes.  Stands for “ Perfectly Normal File System.”  Written at DESY.   ENSTORE Hardware: 6 Silos, Tape Drives: 9 LTO1, 14 LTO2, 9940: 20, 9940B: 52, DLT (4000 & 8000): 8 and 127 commodity PCs.   2.6 Petabyte data, 10.8 million files, 25,000 volumes, and Record rate: 27 Terabyte/day.

5 5 dCache  dCache is deployed on top of ENSTORE or stand alone  Improve performance by using file in caches instead of reading from tape each time when needed.   100 pool nodes with ~225 Terabytes of disk   Lessons Learned  Use the XFS filesystem on the pool disks.  Use direct I/O when accessing the files on the  Local dCache disk.  Users will push the system to its limits. Be prepared.

6 6 SRM   Provides uniform interface for access to multiple storage systems via SRM protocol.   SRM is a broker that works on top of other storage systems.  dCache  UNIXTM filesystem  Enstore: In development

7 7 Caspur file systems: Lustre, AFS, Panasas

8 8 - - High-end Linux units for both servers and clients Servers : 2-way Intel Nocona 3.4 GHz, 2GB RAM, 2 QLA2310 2Gbit HBA Clients : 2-way Intel Xeon 2.4+ GHz, 1GB RAM OS : SuSE SLES 9 on servers, SLES 9 / RHEL 3 on clients - Network – nonblocking GigE switches CISCO - 3570G-24TS (24 ports) Extreme Networks - Summit 400-48t (48 ports) - - SAN Qlogic Sanbox 5200 – 32 ports - Appliances 3 Panasas ActiveScale Shelves Each shelf had 3 Director Blades and 8 Storage Blades Components

9 9 Panasas Storage Cluster Components DirectorBlade StorageBlade Integrated GE Switch Shelf Front 1 DB, 10 SB Shelf Rear Midplane routes GE, power Battery Module (2 Power units)

10 10 Test setup (NFS, AFS, Lustre) Load Farm (16 biprocessor nodes at 2.4+ GHz ) Gigabit Ethernet CISCO/Extreme Server 1 LD 1LD 2 Server 2 LD 1LD 2 Server 3 LD 1LD 2 Server 4 LD 1LD 2 MDS(Lustre) SAN QLogic 5200 On each server, 2 Gigabit Ethernet NICs were bonded (bonding-ALB). LD1, LD2: could be IFT or DDN. Each LD was zoned to a distinct HBA.

11 11 What we measured 1) Massive aggregate I/O (large files, lmdd) - All 16 clients were unleashed together, file sizes varied in the range 5-10GB - Gives a good idea about the system’s overall throughput 2) Pileup. This special benchmark was developed at CERN by R.Többicke. - Emulation of an important use case foreseen in one of the LHC experiments; - Several (64-128) 2GB files are first prepared on the file system under test - The files are then read by a growing number of reader threads (ramp-up) - Every thread selects randomly one file out of the list; - In a single read act, an arbitrary offset within file is calculated, and 50-60 KB are read starting with this offset; - Output is the number of operations times bytes read per time interval - Pileup results are important for future service planning

12 12 A typical Pileup curve

13 13 3) Emulation of a DAQ Data Buffer - A very common scenario in HEP DAQ architecture - Data is constantly arriving from the detector and has to end up on the tertiary strorage (tapes) - A temporary storage area on the way of the data to tapes serves for reorganization of streams, preliminary real-time analysis and as a security buffer to hold against the interrupts of the archival system - Of big interest for service planning: general throughput of a balanced Data Buffer. - A DAQ Manager may moderate the data influx (for instance, by tuning certain trigger rates), thus balancing it with the outflux. - We were running 8 writers and 8 readers, one process per client. Each file was accessed at any given moment by one and only one process. On writer nodes we could moderate the writer speed by adding some dummy “CPU eaters”.

14 14 DAQ Data Buffer

15 15 Results for 8 GigE outlets Massive I/O, MB/sec Balanced DAQ Buffer Influx, MB/sec Pileup, MB/sec WriteRead NFS IFT 704808300-390390380 80-90 80-90 AFS IFT 397453 70 70 LUSTRE IFT 790780 55-60 55-60 LUSTRE DDN 790780 - PANASAS x2 740822 100+ 100+ 2 remarks: - Each of the storage nodes had 2 GigE NICs. We have tried to add a third NIC to see if we could get more out of the node. There was a modest improvement of less than 10 percent so we decided to use 8 NICs on 4 nodes per run. - Panasas shelf had 4 NICs, and we report here its results multiplied by 2, to be able to compare it with all other 8-NIC configurations.

16 16 Conclusions 1) With 8 GigE NICs in the system, one would expect a throughput in excess of 800 MB/sec for large streaming I/O. Lustre and Panasas can clearly deliver this, NFS is also doing quite well. The very fact that we were operating around 800 MB/sec with this hardware means that our storage nodes were well-balanced (no bottlenecks, we even might have had a reserve of 100 MB/sec per setup). 2) Pileup results were relatively good for AFS, and best in case of Panasas. The outcome of this benchmark is correlated with the number of spindles in the system. Two Panasas shelves had 40 spindles, while 4 storage nodes used 64 spindles. So Panasas file system was doing a much better job per spindle than any other solution that we tested (NFS, AFS, Lustre).

17 17 USCMS Tier 1 Update on Use of IBRIX Fusion FS CMS Decided to Implement IBRIX solution – Why?   No specialized hardware required  Enabling us to redeploy hardware if this solution did not work  Made cost of product less than others with hardware components   Provided NFS access  A universal protocol which required no client side code  Initial decision to only use NFS access because of this IBRIX very responsive to our requests and to issues we found during theevaluation   Purely software solution: no specialized hardware dependencies   Comprised of:  Highly scalable POSIX compliant parallel file system  Logical volume manager based on LVM  High availability  Comprehensive management interface which includes a gui

18 18 Current Status   Thin client is working very well and system is stable   User and group quotas are working as part of our managed disk plan, A few requests for enhancement to quota subsystem to improve quota management   Working on getting users to migrate off of ALL nfs mounted work area or data disk   IBRIX file systems via NFS in limited numbers but has been stable Refining admin documentation and knowledge   Will add in 2 more segments servers and another 2.7TB of disk  Plan for 20TB by start of data taking   Thin client rpm's are kernel version dependent; IBRIX committed to providing rpm updates for security releases in a timely fashion  No Performance data is available because the current focus is on functionality.

19 19 SATA Evaluation in FNAL   SATA is found in commodity or mid-level storage configurations (as opposed to enterprise-level) and cannot be expected to give the same performance as more expensive architectures. SATA controllers can be FC or PCI/PCI-X.   a SATA configuration – imperfect firmware, misleading claims by vendors, untested configurations, upgrade is disruptive: “you get what you pay for”.   A number of suggestions were  careful selection of vendor  firmware upgrades  consider paying more if you can be sure of reduced ongoing maintenance costs (including human costs)  Understand properly your needs and estimate the cost and effect of disk losses.  Data Loss is acceptable or not

20 20 Experiences Deploying Xrootd at RAL

21 21 What is xrootd?  xrootd (eXtended Root Daemon) was written at SLAC and INFN Padova as part of the work to migrate the BaBar event store from Objectivity to Root I/O  It’s a fully functional suite of tools for serving data, including server daemons and clients which take to each other using the xroot protocol

22 22 xrootd Architecture Protocol Layer Filesystem Logical Layer Filesystem Physical Layer Filesystem Implementation Protocol Manager (included in distribution)

23 23 Load Balanced Example with MSS 

24 24 Benefits  For Users:  Jobs don’t crash if a disk/server goes down, they back off, contact the olb manager and get the data from somewhere else  Queues aren’t stopped just because 2% of the data is offline  For Admins:  No need for heroic efforts to recover damaged filesystems  Much easier to schedule maintenance

25 25 Conclusion  Xrootd has proved to be easy to configure and link to our MSS. Initial indications are that the production service is both reliable and performant  This should improve both the lives of the users and sysadmins with huge advances in both the robustness of the system and it’s maintainability without sacrificing performance  Talks, software (binaries and source), documentation and example configurations are available at http://xrootd.slac.stanford.edu http://xrootd.slac.stanford.edu

26 Grid Middleware and Service Challenge3

27 BNL Service Challenge3 (Skipped)

28 28 ldF: Background  Need to start now setup of Tier2 to be ready on time  LCG (France) effort concentrated on Tier1 till now  Technical and financial challenges require time to be solved  French HEP institutes are quite small  100-150 persons, small computing manpower  IdF (Ile-de-France, Paris region) has a large concentration of big HEP labs and physicists  6 labs among which DAPNIA : 600, LAL: 350  DAPNIA and LAL involved in Grid effort since beginning of EDG  3 EGEE contracts (2 for operation support)

29 29 Objectives  Build a Tier2 facility for simulation and analysis  80% LHC 4 experiments, 20% EGEE and local  2/3 LHC analysis  Analysis require a large amount of storage  Be ready at LHC startup (2 nd half of 2007)  Resource goals  CPU : 1500 1kSI2K (~ P4 Xeon 2,8 Ghz), max = CMS:800  Storage : 350 TB of disks (no MSS planned), max= CMS:220  Network : 10 Gb/s backbone inside Tier2, 1 or 10 Gb/s external link  Need 1.6 M Euros

30 30 Storage Challenge  Efficient use and management of a large amount of storage seen as the main challenge  Plan to participate SC3.  2006 : Mini Tier2, 2007 : Production Tier2

31 31 SC2 Summary  SC2 met its throughput goals (achieved >600MB/s daily average for 10 days) – and with more sites than originally planned!  A big improvement from SC1  But we still don’t have something we can call a service  Monitoring is better  We see outages when they happen, and we understood why they happen  First step towards operations guides  Some advances in infrastructure and software will happen before SC3  gLite transfer software  SRM service more widely deployed  We have to understand how to incorporate these elements

32 32 Service Challenge 3 - Phases High level view:  Setup phase (includes Throughput Test)  2 weeks sustained in July 2005  “Obvious target” – GDB of July 20 th  Primary goals:  150MB/s disk – disk to Tier1s;  60MB/s disk (T0) – tape (T1s)  Secondary goals:  Include a few named T2 sites (T2 -> T1 transfers)  Encourage remaining T1s to start disk – disk transfers  Service phase  September – end 2005  Start with ALICE & CMS, add ATLAS and LHCb October/November  All offline use cases except for analysis  More components: WMS, VOMS, catalogs, experiment-specific solutions  Implies production setup (CE, SE, …)

33 33 SC3 – Milestone Decomposition  File transfer goals:  Build up disk – disk transfer speeds to 150MB/s with 1GB/s out of CERN  SC2 was 100MB/s – agreed by site  Include tape – transfer speeds of 60MB/s with 300MB/s out of CERN  Tier1 goals:  Bring in additional Tier1 sites wrt SC2 (at least wrt the original plan…)  PIC and Nordic most likely added later: SC4?  Tier2 goals:  Start to bring Tier2 sites into challenge  Agree services T2s offer / require  On-going plan (more later) to address ӌྟྠ : ilestone Decomposit ion ྟྠ ڔs:  Build up disk – disk transfer speeds to 150MB/s with 1GB/s out of CERN  SC2 was 100MB/s – agr eed by site  Incl ude tape – transfer speeds of 60MB/s with 300MB/s out dd additional components  Catalogs, VOs, experiment-specific solutions etc, 3D involvement, …  Choice of software components, validation, fallback, …

34 LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN

35 35 Executive Summary  Tier2 issues have been discussed extensively since early this year  The role of Tier2s, the services they offer – and require – has been clarified  The data rates for MC data are expected to be rather low (limited by available CPU resources)  The data rates for analysis data depend heavily on analysis model (and feasibility of producing new analysis datasets IMHO)  LCG needs to provide:  Installation guide / tutorials for DPM, FTS, LFC  Tier1s need to assist Tier2s in establishing services

36 Tier2 and Base S/W Components  Disk Pool Manager (of some flavour…)  e.g. dCache, DPM, …  gLite FTS client (and T1 services)  Possibly also local catalog, e.g. LFC, FiReMan, …  Experiment-specific s/w and services ( ‘agents’ )

37 37 Tier2s and SC3  Initial goal is for a small number of Tier2-Tier1 partnerships to setup agreed services and gain experience  This will be input to a wider deployment model  Need to test transfers in both directions:  MC upload  Analysis data download  Focus is on service rather than “throughput tests”  As initial goal, would propose running transfers over at least several days  e.g. using 1GB files, show sustained rates of ~3 files / hour T2->T1  More concrete goals for the Service Phase will be defined together with experiments in the coming weeks  Definitely no later than June 13-15 workshop

38 38 T2s – Concrete Target  We need a small number of well identified T2/T1 partners for SC3 as listed above  Do not strongly couple T2-T1 transfers to T0-T1 throughput goals of SC3 setup phase  Nevertheless, target one week of reliable transfers T2->T1 involving at least two T1 sites each with at least two T2s by end July 2005

39 The LCG File Catalog (LFC) Jean-Philippe Baud – Sophie Lemaitre IT-GD, CERN May 2005

40 40 LCG File Catalog  Based on lessons learned in DC’s (2004)  Fixes performance and scalability problems seen in EDG Catalogs  Cursors for large queries  Timeouts and retries from the client  Provides more features than the EDG Catalogs  User exposed transaction API  Hierarchical namespace and namespace operations  Integrated GSI Authentication + Authorization  Access Control Lists (Unix Permissions and POSIX ACLs)  Checksums  Based on existing code base  Supports Oracle and MySQL database backends

41 41 Relationships in the Catalog GUID Xxxxxx-xxxx-xxx-xxx- System Metadata “size” => 10234 “cksum_type” => “MD5” “cksum” => “yy-yy-yy” Replica srm://host.example.com/foo/bar host.example.com Replica srm://host.example.com/foo/bar host.example.com Replica srm://host.example.com/foo/bar host.example.com Replica srm://host.example.com/foo/bar host.example.com Symlink /grid/dteam/mydir/mylink Symlink /grid/dteam/mydir/mylink Symlink /grid/dteam/mydir/mylink LFN /grid/dteam/dir1/dir2/file1.root User Metadata User Defined Metadata

42 42 Features  Namespace operations  All names are in a hierarchical namespace  mkdir(), opendir(), etc…  Also chdir()  GUID attached to every directory and file  Security – GSI Authentication and Authorization  Mapping done from Client DN to uid/gid pair  Authorization done in terms of uid/gid  VOMS will be integrated (collaboration with INFN/NIKHEF)  VOMS roles appear as a list of gids  Ownership of files is stored in catalog  Permissions implemented  Unix (user, group, all) permissions  POSIX ACLs (group and users)

43 43 LFC Tests  LFC has been tested and shown to be scalable to at least:  40 million entries  100 client threads  Performance improved with comparison to RLSs  Stable :  Continuous running at high load for extended periods of time with no crashes  Based on code which has been in production for > 4 years  Tuning required to improve bulk performance

44 44 FiReMan Performance - Insert  Comparison with LFC: 0 50 100 150 200 250 300 350 1 2 5 10 20 50 100 Inserts / Second Number of Threads Fireman - Single Entry Fireman - Bulk 100 LFC

45 45 FiReMan Performance - Queries  Comparison with LFC: 0 200 400 600 800 1000 1200 1 2 5 10 20 50 100 Entries Returned / Second Number Of Threads Fireman - Single Entry Fireman - Bulk 100 LFC

46 46 Tests Conclusion  Both LFC and FiReMan offer large improvements over RLS  Still some issues remaining:  Scalability of FiReMan  Bulk Operations for LFC  More work needed to understand performance and bottlenecks  Need to test some real Use Cases

47 File Transfer Software and Service SC3 Gavin McCance LHC service challenge

48 48 FTS service  It provides point to point movement of SURLs  Aims to provide reliable file transfer between sites, and that’s it!  Allows sites to control their resource usage  Does not do ‘routing’ (e.g. like Phedex)  Does not deal with GUID, LFN, Dataset, Collections  It’s a fairly simple service that provides sites with a reliable and manageable way of serving file movement requests from their VOs  We are understanding together with the experiments the places in the software where extra functionality can be plugged in  How the VO software frameworks can load the system with work  Places where VO specific operations (such as cataloguing), can be plugged-in, if required

49 49 Single channel

50 50 Multiple channels Single set of servers can manage multiple channels from a site

51 51  An Oracle database to hold the state  MySQL is on-the-list but low-priority unless someone screams  A transfer server to run the transfer agents  Agents responsible for assigning jobs to channels managed by that site  Agents responsible for actually running the transfer (or for delegating the transfer to srm-cp).  An application server (tested with Tomcat5)  To run the submission and monitoring portal – i.e. the thing you use to talk to the system What you need to run the server

52 52 Initial use models considered  Tier-0 to Tier-1 distribution  Proposal: put server at Tier-0  This was the model used in SC2  Tier-1 to Tier-2 distribution  Proposal: put server at Tier-1 – push  This is analogous to the SC2 model  Tier-2 to Tier-1 upload  Proposal: put server at Tier-1 – pull

53 53 Summary  Propose server at Tier-0 and Tier-1  Oracle DB, Tomcat application server, transfer node  Propose client tools at T0, T1 and T2  This is a UI / WN type install  Evaluation setup  Initially at CERN T0, interacting with T1 a la SC2  Expand to few agreed T1s interacting with agreed T2s

54 54 Disk Pool Manager aims  Provide a solution for Tier-2s in LCG-2  This implies a few tens of Terabytes in 2005  Focus on manageability  Easy to install  Easy to configure  Low effort for ongoing maintenance  Easy to add/remove resources  Support for multiple physical partitions  On one or more disk server nodes  Support for different space types – volatile and permanent  Support for multiple replicas of hot files within the disk pools

55 55 Manageability  Few daemons to install  No central configuration files  Disk nodes request to add themselves to the DPM  All states are kept in a DB (easy to restore after a crash)  Easy to remove disks and partitions  Allows simple reconfiguration of the Disk Pools  Administrator can temporarily remove file systems from the DPM if a disk has crashed and is being repaired  DPM automatically configures a file system as “unavailable” when it is not contactable

56 56 Features  DPM access via different interfaces  Direct Socket interface  SRM v1  SRM v2 Basic  Also offer a large part of SRM v2 Advanced  Global Space Reservation (next version)  Namespace operations  Permissions  Copy and Remote Get/Put (next version)  Data Access  Gridftp, rfio (ROOTD, XROOTD could be easily added)

57 57 Security  GSI Authentication and Authorization  Mapping done from Client DN to uid/gid pair  Authorization done in terms of uid/gid  Ownership of files is stored in DPM catalog, while the physical files on disk are owned by the DPM  Permissions implemented on files and directories  Unix (user, group, other) permissions  POSIX ACLs (group and users)  Propose to use SRM as interface to set the permissions in the Storage Elements (require v2.1 minimum with Directory and Permission methods)  VOMS will be integrated  VOMS roles appear as a list of gids

58 58 Architecture  The Light Weight Disk Pool Manager consists of  The Disk Pool Manager with its configuration and request DB  The Disk Pool Name Server  The SRM servers  The RFIOD and DPM-aware GsiFTP servers  How many machines ?  DPM, DPNS, and SRM can be installed on the same one  RFIOD : on each disk server managed by the DPM  GsiFTP : on each disk server managed by the DPM

59 59 Status  DPM will be part of LCG 2.5.0 release but is available from now on for testing  Satisfies gLite requirement for SRM interface at Tier-2

60 Thank You


Download ppt "Hepix 2005 Trip Reports Dantong Yu June/27/2005. 2 Highlights   Hepix was asked and agreed to be technical advisors to IHEPCCC on specific questions."

Similar presentations


Ads by Google