The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.

Slides:



Advertisements
Similar presentations
Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Advertisements

Monday 24 May 2004DAPNIA/Pierre-Francois Honore1 DAPNIA site report.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH Home server AFS using openafs 3 DB servers. Web server AFS Mail Server.
Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.
CT NIKHEF June File server CT system support.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.
Edinburgh Site Report 1 July 2004 Steve Thorn Particle Physics Experiments Group.
The Mass Storage System at JLAB - Today and Tomorrow Andy Kowalski.
27/04/05Sabah Salih Particle Physics Group The School of Physics and Astronomy The University of Manchester
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing.
An Overview of PHENIX Computing Ju Hwan Kang (Yonsei Univ.) and Jysoo Lee (KISTI) International HEP DataGrid Workshop November 8 ~ 9, 2002 Kyungpook National.
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
New Data Center at BNL– Status Update HEPIX – CERN May 6, 2008 Tony Chan - BNL.
Shigeki Misawa RHIC Computing Facility Brookhaven National Laboratory Facility Evolution.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility HEPiX – Fall, 2005.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
The GRID and the Linux Farm at the RCF HEPIX – Amsterdam HEPIX – Amsterdam May 19-23, 2003 May 19-23, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, A.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
U.S. ATLAS Tier 1 Planning Rich Baker Brookhaven National Laboratory US ATLAS Computing Advisory Panel Meeting Argonne National Laboratory October 30-31,
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Jefferson Lab Site Report Sandy Philpott Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,
BNL Wide Area Data Transfer for RHIC & ATLAS: Experience and Plans Bruce G. Gibbard CHEP 2006 Mumbai, India.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
RAL Site report John Gordon ITD October 1999
14 th April 1999CERN Site Report, HEPiX RAL. A.Silverman CERN Site Report HEPiX April 1999 RAL Alan Silverman CERN/IT/DIS.
IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.
Office of Science U.S. Department of Energy NERSC Site Report HEPiX October 20, 2003 TRIUMF.
Tier 1 at Brookhaven (US / ATLAS) Bruce G. Gibbard LCG Workshop CERN March 2004.
US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.
Batch Software at JLAB Ian Bird Jefferson Lab CHEP February, 2000.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
LBNL/NERSC/PDSF Site Report for HEPiX Catania, Italy April 17, 2002 by Cary Whitney
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Jefferson Lab Site Report Sandy Philpott HEPiX Fall 07 Genome Sequencing Center Washington University at St. Louis.
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
Luca dell’Agnello INFN-CNAF
Ákos Frohner EGEE'08 September 2008
Presentation transcript:

The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory

Outline Background Mass Storage Central Disk Storage Linux Farm Monitoring Security & Authentication Future Developments Summary

Background Brookhaven National Lab (BNL) is a U.S. govt funded multi-disciplinary research laboratory. RACF formed in the mid-90s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90s. RACF supports HENP and HEP scientific computing efforts and various general services (backup, , web, off-site data transfer, Grid, etc).

Background (continued) Currently 29 staff members (4 new hires in 2004). RHIC Year 4 just concluded. Performance surpassed all expectations.

Staff Growth at the RACF

RACF Structure

Mass Storage 4 StorageTek tape silos managed via HPSS (v 4.5). Using B drives (200 GB/tape). Aggregate bandwidth up to 700 MB/s. 10 data movers with 10 TB of disk. Total over 1.5 PB of raw data in 4 years of running (capacity for 4.5 PB).

The Mass Storage System

Central Disk Storage Large SAN served via NFS DST + user home directories + scratch area. 41 Sun servers (E450 & V480) running Solaris 8 and 9. Plan to migrate all to Solaris 9 eventually. 24 Brocade switches & 250 TB of FB RAID5 managed by Veritas. Aggregate 600 MB/s data rate to/from Sun servers on average.

Central Disk Storage (cont.) RHIC and ATLAS AFS cells software repository + user home directories. Total of 11 AIX servers with 1.2 TB for RHIC and 0.5 TB for ATLAS. Transarc on server side, OpenAFS on client side. Considering OpenAFS for server side.

The Central Disk Storage System

Linux Farm Used for mass processing of data rack-mounted, dual-CPU (Intel) servers. Total of 1362 kSpecInt2000. Reliable (about 6 hardware failures per month at current farm size). Combination of SCSI & IDE disks with aggregate of 234+ TB of local storage.

Linux Farm (cont.) Experiments making significant use of local storage through custom job schedulers, data repository managers and rootd. Requires significant infrastructure resources (network, power, cooling, etc). Significant scalability challenges. Advance planning and careful design a must!

The Growth of the Linux Farm

The Linux Farm in the RACF

Linux Farm Software Custom RH 8 (RHIC) and 7.3 (ATLAS) images. Installed with Kickstart server. Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel). Support for network file systems (AFS, NFS) and local data storage.

Linux Farm Batch Management New Condor-based batch system with custom PYTHON front-end to replace old batch system. Fully deployed in Linux Farm. Use of Condor DAGman functionality to handle job dependencies. New system solves scalability problems of old system. Upgraded to Condor (latest stable release) to implement advanced features (queue priority and preemption).

Linux Farm Batch Management (cont.)

LSF v5.1 widely used in Linux Farm, specially for data analysis jobs. Peak rate of 350 K jobs/week. LSF possibly to be replaced by Condor if the latter can scale to similar peak job rates. Current Condor peak rates of 7 K jobs/week. Condor and LSF accepting jobs through GLOBUS. Condor scalability to be tested in ATLAS DC 2.

Condor Usage at the RACF

Monitoring Mix of open-source, RCF-designed and vendor-provided monitoring software. Persistency and fault-tolerant features. Near real-time information. Scalability requirements.

Mass Storage Monitoring

Central Disk Storage Monitoring

Linux Farm Monitoring

Temperature Monitoring

Security & Authentication Two layers of firewall with limited network services and limited interactive access through secure gateways. Migration to Kerberos 5 single sign-on and consolidation of password DBs. NIS passwords to be phased-out. Integration of K5/AFS with LSF to solve credential forwarding issues. Will need similar implementation for Condor. Implemented Kerberos certificate authority.

Future Developments HIS/HTAR deployment for UNIX-like access to HPSS. Moving beyond NFS-served SAN with more scalable solutions (Panasas, IBRIX, Lustre, NFS v.4.1, etc). dCache/SRM being evaluated as a distributed storage management solution to exploit high-capacity, low-cost local storage in the node Linux Farm. Linux Farm OS upgrade plans (RHEL?).

US ATLAS Grid Testbed Internet HPSS Condor pool Gatekeeper Job manager Disks Grid Job Requests Globus -client 17TB 70MB/S atlas02 aafs amds Mover aftpexp00 GridFtp giis01 Information Server AFS server Globus RLS Server Local Grid development currently focused on monitoring, user management and support for DC2 production activities

Summary RHIC run very successful. Increasing staff levels to support increasing level of computing support activities. On-going evaluation of scalable solutions (dCache, Panasas, Condor, etc) in a distributed computing environment. Increased activity to support upcoming ATLAS DC 2 production.