Presentation on theme: "The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory."— Presentation transcript:
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory
Outline Background Mass Storage Central Disk Storage Linux Farm Monitoring Security & Authentication Future Developments Summary
Background Brookhaven National Lab (BNL) is a U.S. govt funded multi-disciplinary research laboratory. RACF formed in the mid-90s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90s. RACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).
Background (continued) Currently 29 staff members (4 new hires in 2004). RHIC Year 4 just concluded. Performance surpassed all expectations.
Mass Storage 4 StorageTek tape silos managed via HPSS (v 4.5). Using 37 9940B drives (200 GB/tape). Aggregate bandwidth up to 700 MB/s. 10 data movers with 10 TB of disk. Total over 1.5 PB of raw data in 4 years of running (capacity for 4.5 PB).
Central Disk Storage Large SAN served via NFS DST + user home directories + scratch area. 41 Sun servers (E450 & V480) running Solaris 8 and 9. Plan to migrate all to Solaris 9 eventually. 24 Brocade switches & 250 TB of FB RAID5 managed by Veritas. Aggregate 600 MB/s data rate to/from Sun servers on average.
Central Disk Storage (cont.) RHIC and ATLAS AFS cells software repository + user home directories. Total of 11 AIX servers with 1.2 TB for RHIC and 0.5 TB for ATLAS. Transarc on server side, OpenAFS on client side. Considering OpenAFS for server side.
Linux Farm Used for mass processing of data. 1359 rack-mounted, dual-CPU (Intel) servers. Total of 1362 kSpecInt2000. Reliable (about 6 hardware failures per month at current farm size). Combination of SCSI & IDE disks with aggregate of 234+ TB of local storage.
Linux Farm (cont.) Experiments making significant use of local storage through custom job schedulers, data repository managers and rootd. Requires significant infrastructure resources (network, power, cooling, etc). Significant scalability challenges. Advance planning and careful design a must!
Linux Farm Software Custom RH 8 (RHIC) and 7.3 (ATLAS) images. Installed with Kickstart server. Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel). Support for network file systems (AFS, NFS) and local data storage.
Linux Farm Batch Management New Condor-based batch system with custom PYTHON front-end to replace old batch system. Fully deployed in Linux Farm. Use of Condor DAGman functionality to handle job dependencies. New system solves scalability problems of old system. Upgraded to Condor 6.6.5 (latest stable release) to implement advanced features (queue priority and preemption).
LSF v5.1 widely used in Linux Farm, specially for data analysis jobs. Peak rate of 350 K jobs/week. LSF possibly to be replaced by Condor if the latter can scale to similar peak job rates. Current Condor peak rates of 7 K jobs/week. Condor and LSF accepting jobs through GLOBUS. Condor scalability to be tested in ATLAS DC 2.
Security & Authentication Two layers of firewall with limited network services and limited interactive access through secure gateways. Migration to Kerberos 5 single sign-on and consolidation of password DBs. NIS passwords to be phased-out. Integration of K5/AFS with LSF to solve credential forwarding issues. Will need similar implementation for Condor. Implemented Kerberos certificate authority.
Future Developments HIS/HTAR deployment for UNIX-like access to HPSS. Moving beyond NFS-served SAN with more scalable solutions (Panasas, IBRIX, Lustre, NFS v.4.1, etc). dCache/SRM being evaluated as a distributed storage management solution to exploit high-capacity, low-cost local storage in the 1300+ node Linux Farm. Linux Farm OS upgrade plans (RHEL?).
US ATLAS Grid Testbed Internet HPSS Condor pool Gatekeeper Job manager Disks Grid Job Requests Globus -client 17TB 70MB/S atlas02 aafs amds Mover aftpexp00 GridFtp giis01 Information Server AFS server Globus RLS Server Local Grid development currently focused on monitoring, user management and support for DC2 production activities
Summary RHIC run very successful. Increasing staff levels to support increasing level of computing support activities. On-going evaluation of scalable solutions (dCache, Panasas, Condor, etc) in a distributed computing environment. Increased activity to support upcoming ATLAS DC 2 production.