Presentation is loading. Please wait.

Presentation is loading. Please wait.

Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.

Similar presentations


Presentation on theme: "Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory."— Presentation transcript:

1 Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory

2 Ofer Rind - RHIC Computing Facility Site Report RCF - Overview Provide computing facilities for RHIC users: General computing environment General interactive tasks (email, document processing, web) Data analysis facility Computing infrastructure for RHIC experiments Code development, repository & distribution Raw data recording & reconstruction Data analysis ACF: US Atlas Tier 1 Computing Facility Shared infrastructure and synergy with RCF Support staff: 25 FTE's (4 dedicated to Linux Farm)

3 Ofer Rind - RHIC Computing Facility Site Report RCF - Structure

4 RCF - Component Summary Mass Storage Subsystem StorageTek library managed by HPSS 4 Silos, 1.2PB capacity (expanding to 4.5PB) In Run-2, raw data recorded at a common rate of 70MB/sec for a total of 170TB Total data store ~300TB Disk Storage Fibre channel SAN served by NFS ~110TB Raid5 14 Sun 450, Solaris 8 [2-02] ( 5 Sun 480 coming online) IBM AFS servers (AIX) Linux Server Farm Ofer Rind - RHIC Computing Facility Site Report

5 Linux Farm Hardware 840 1U and 2U servers (pre-'99 towers have been retired) 69 kSPECint95, expanding to 100 kSPECint95 (2+ TFLOPS) Most have 1GB mem (at least 500MB) Local SCSI disks up to 140GB/node Allocated by experiment Further allocated for Raw Data Reconstruction (CRS) and Re- constructed Data Analysis (CAS) VA Linux PIII 450Mz148Jun 99 VA Linux PIII 700Mz48Aug 00 VA Linux PIII 800Mz168Nov 00 IBM PIII 1000Mz316Aug 01 IBM PIII 1400Mz160Oct 02 Ofer Rind - RHIC Computing Facility Site Report

6 Linux Farm Software Configuration RedHat 7.2 upgraded to 2.4.9-31 kernel Image(s) installed via Kickstart server and customized for RCF environment via rpm NFS + AFS home directory and file access Interactive login allowed on selected nodes Job management: (CAS) LSF 4.2 - slightly re-architected for robustness. Peak throughput before summer conferences was >150K jobs/week. (CRS) Locally produced Perl-based batch system (AIX needed for HPSS API). Approx. 670K jobs processed for Run-2. Expanding use of distributed disk models (rootd, ??) Atlas Grid testbed Ofer Rind - RHIC Computing Facility Site Report

7 Tracking LSF Usage Star queues weekly job statistics (week of Oct. 10) Job starts/hr Avg runtime/hr Runtime Ofer Rind - RHIC Computing Facility Site Report

8 Security and Monitoring Security: RCF firewall within BNL site firewall SSH2 only access through gateway bastion nodes (Solaris x86) User access restricted to a subset of systems (CAS only) Monitoring: 24 hr. on-call staff for critical systems during RHIC operation Cluster mgmt. software: VACM (VA Linux) xCAT (IBM, http://www.x-cat.org) Cron scripts to "clean" nodes and head off possible problems (memory leaks, full disks, etc.) CTS system for problem reports Ofer Rind - RHIC Computing Facility Site Report

9 Farm Alert System Web-monitoring (user-accessible) plus paging/email alerts Python scripts running locally transferring node status information to a MySQL database. Notification of problems with NFS/AFS (e.g. stale file handles), LSF daemons, high load, etc. Ofer Rind - RHIC Computing Facility Site Report

10 Network Operation Status Perl scripts monitor network service connectivity for all nodes (ssh, yp, etc.) Ofer Rind - RHIC Computing Facility Site Report

11 Load Monitoring and History MySQL database for usage history History available back to Sept. '01 via web interface. CPU Load averaged over (98) Phenix machines during the month of September. Ofer Rind - RHIC Computing Facility Site Report

12 Plans for the Near Future 160 newly delivered IBM nodes to be brought online Expect purchase bid to go out for ~220 more nodes at beginning of FY03 (pending funding approval) Scaling up data storage capacity and throughput for Run-3 (up to 10X data increase over Run-2, starting in December) Evaluation of LSF 5 and Condor ongoing, with an eye towards distributed disk services Expanding Atlas GRID services Ofer Rind - RHIC Computing Facility Site Report


Download ppt "Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory."

Similar presentations


Ads by Google