RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton, NY, USA October 18, 2004
Facility Overview ● Created in the mid 1990's to provide centralized computing services for the RHIC experiments ● Expanded our role in the late 1990's to act as the tier 1 computing center for ATLAS in the United States ● Currently employ 28 staff members: planning on adding 5 additional employees in the next fiscal year
Facility Overview (Cont.) ● Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway ● RHIC Run 5 scheduled to begin in late December 2004
Centralized Disk Storage ● 37 NFS Servers Running Solaris 9: recent upgrade from Solaris 8 ● Underlying filesystems upgraded to VxFS 4.0 – Issue with quotas on filesystems larger than 1 TB in size ● ~220 TB of fibre channel SAN-based RAID5 storage available: added ~100 TB in the past year
Centralized Disk Storage (Cont.) ● Scalability issues with NFS (network-limited to ~70 MB/s max per server [75-90 MB/s max local I/O] in our configuration): testing of new network storage models including Panasas and IBRIX in progress – Panasas tests look promising. 4.5 TB of storage on 10 blades available for evaluation by our user community. DirectFlow client in use on over 400 machines – Both systems allow for NFS export of data
Centralized Disk Storage (Cont.)
Centralized Disk Storage: AFS ● Moving servers from Transarc AFS running on AIX to OpenAFS on Solaris 9 ● The move from Transarc to OpenAFS motivated by Kerberos4/Kerberos5 issues and Transarc AFS end of life ● Total of 7 fileservers and 6 DB servers: 2 DB servers and 2 fileservers running OpenAFS ● 2 Cells
Mass Tape Storage ● Four STK Powderhorn silos provided, each with the capability of holding ~6000 tapes ● 1.7 PB data currently stored ● HPSS Version 4.5.1: likely upgrade to version 6.1 or 6.2 after RHIC Run 5 ● 45 tape drives available for use ● Latest STK tape technology: 200 GB/tape ● ~12 TB disk cache in front of the system
Mass Tape Storage (Cont.) ● PFTP, HSI and HTAR available as interfaces
CAS/CRS Farm ● Farm of 1423 dual-CPU (Intel) systems – Added 335 machines this year ● ~245 TB local disk storage (SCSI and IDE) ● Upgrade of RHIC Central Analysis Servers/Central Reconstruction Servers (CAS/CRS) to Scientific Linux (+updates) underway: should be complete before next RHIC run
CAS/CRS Farm (Cont.) ● LSF (5.1) and Condor (6.6.6/6.6.5) batch systems in use. Upgrade to LSF 6.0 planned ● Kickstart used to automate node installation ● GANGLIA + custom software used for system monitoring ● Phasing out the original RHIC CRS Batch System: replacing with a system based on Condor ● Retiring 142 VA Linux 2U PIII 450 MHz systems after next purchase
CAS/CRS Farm (Cont.)
Security ● Elimination of NIS, complete transition to Kerberos5/LDAP in progress ● Expect K5 TGT to X.509 certificate transition in the future: KCA? ● Hardening/monitoring of all internal systems ● Growing web service issues: unknown services accessed through port 80
Grid Activities ● Brookhaven planning on upgrading external network connectivity to OC48 (2.488 Gbps) from OC12 (622 Mbps) to support ATLAS activity ● ATLAS Data Challenge 2: jobs submitted via Grid3 ● GUMS (Grid User Management System) – Generates grid-mapfiles for gatekeeper hosts – In production since May 2004
Storage Resource Manager (SRM) ● SRM: middleware providing dynamic storage allocation and data management services – Automatically handles network/space allocation failures ● HRM (Hierarchical Resource Manager)-type SRM server in production – Accessible from within and outside the facility – 350 GB Cache – Berkeley HRM 1.2.1
dCache ● Provides global name space over disparate storage elements – Hot spot detection – Client software data access through libdcap library or libpdcap preload library ● ATLAS & PHENIX dCache pools – PHENIX pool expanding performance tests to production machines – ATLAS pool interacting with HPSS using HSI: no way of throttling data transfer requests as of yet