Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division.

Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division

CD mission statement The Computing Divisions mission is to play a full part in the mission of the laboratory and in particular: To proudly develop, innovate, and support excellent and forefront computing solutions and services, recognizing the essential role of cooperation and respect in all interactions between ourselves and with the people and organizations that we work with and serve.

How we are organized

We participate in all areas

Production system capacities

Growth in farms usage

Growth in farms density

Projected growth of computers

Projected power growth

Computer rooms Provide space, power & cooling for central computers Problem: increasing luminosity –~ 2600 computers in FCC –Expect to add ~1,000 systems/year –FCC has run out of power & cooling, cannot add utility capacity New Muon Lab –256 systems for Lattice Gauge theory –CDF early buys of 160 systems + 160 CDF existing systems from FCC –Developing plan for another room Wide Band –Long term phased plan FY04 – 08 –FY04/05 build: 2,880 computers (~$3M) –Tape robot room in FY05 –FY06/07: ~3,000 computers

Computer rooms

Storage and data movement 1.72 PB of data in ATL –Ingest of ~100 TB/mo Many 10s of TB fed to analysis programs each day Recent work: –Parameterizing storage systems for SRM Apply to SAM Apply more generally –VO notions in storage systems

FNAL Starlight dark fiber project FNAL dark fiber to Starlight –Completion: Mid-June, 2004 –Initial DWDM configuration: One 10 Gb/s (LAN_PHY) channel Two 1 Gb/s (OC48) channels Intended uses of link –WAN network R&D projects –Overflow for production traffic: ESnet link to remain production network link –Redundant offsite path

General network improvements Core network upgrades –Switch/router (Catalyst 6500s) supervisors upgraded: 720 Gb/s switching fabric (Sup720s); provides 40Gb/s per slot –Initial deployment of 10 Gb/s backbone links 1000B-T support expanded –Ubiquitous on computer room floors: New farms acquisitions supported on gigabit ethernet ports –Initial deployment in a few office areas

Network security improvements Mandatory node registration for network access –Hotel-like temporary registration utility for visitors –System vulnerability scan is part of the process Automated network scan blocker deployed –Based on quasi-real time network flow data analysis –Blocks outbound & inbound scans VPN service deployed

Central services Email –Spam tagging in place X-Spam-Flag: YES –Capacity upgrades for gateways, imapservers, virus scanning –Redundant load sharing AFS –Completely on OpenAFS –SAN for backend storage –TiBS Backup system –DOE-funded SBIR for performance investigations Windows –Two-tier patching system for Windows 1 st tier under control of OU (patchlink) 2 nd tier domain-wide (SUS) 0 Sasser infections post- implementation

Central services -- backups Site-wide backup plan is moving forward –SpectraLogic T950-5 –8 SAIT-1 drives –Initial 450 tape capacity for 7TB pilot project Plan for modular expansion to over 200 TB

Computer security Missed by Linux rootkit epidemic –but no theoretical reason for immunity Experimenting w/ AFS cross-cell authentication –w/ Kerberos 5 authentication –subtle ramifications DHCP registration process –includes security scan, does not (yet) deny access –a few VIPs have been tapped during meetings Vigorous self-scanning program –based on nessus –maintain database of results –look especially for critical vulnerabilities (& deny access)

Run II – D0 D0 reprocessed 600M events in fall 2003 –using grid style tools, 100M of those event processed offsite at 5 other facilities –Farm production capacity is roughly 25M events per week –MC production capacity is 1 M events per week –about 1B events/week on the analysis systems. Linux SAM station on a 2 TB fileserver to serve the new analysis nodes –next step in the plan to reduce D0min –station has been extremely performant, expanding the Linux SAM cache –station typically delivers about 15 TB of data and 550M events per week. Rolled out a MC production system that has grid-style job submission –JIM component of SAM-Grid Torque (sPBS) is in use on the most recent analysis nodes –has been much more robust than PBS. Linux fileservers are being used as "project" space –physics group managed storage with high access patterns –good results.

MINOS & BTeV status MINOS –data taking in early 2005 –using standard tools Fermi Linux General-purpose farms AFS Oracle enstore & dcache ROOT BTeV –preparations for CD-1 review by DOE included review of online (but not offline) computing novel feature is that much of the Level2/3 trigger software will be part of the offline reconstruction software

US-CMS computing DC04 Data Challenge and the preparation for the computing TDR –preparation for the Physics TDR (P-TDR) –roll out of the LCG Grid service and federating it with the U.S. facilities Develop the required Grid and Facilities infrastructure –increase the facility capacity through equipment upgrades –commission Grid capabilities through Grid2003 and LCG-1 efforts –develop and integrate required functionalities and services Increase the capability of User Analysis Facility –improve how a physicists would use facilities and software –facilities and environment improvements –software releases, documentation, web presence etc

US-CMS computing – Tier 1 136 Worker Nodes (Dual 1 U Xeon Servers and Dual 1U Athlon) –240 CPUs for Production (174 kSI2000) –32 CPUs for Analysis (26 kSI2000) All systems purchased in 2003 are connected over gigabit 37 TB of Disk Storage –24TB in Production for Mass Storage Disk Cache In 2003 we switched to SATA disks in external enclosures connected over fiber channel Only marginally more expensive than 3ware based systems, and much easier to administrate. –5TB of User Analysis Space Highly available, high performance, backed-up space –8TB Production Space 70TB of Mass Storage Space –Limited by tape purchases and not silo space

US-CMS computing

US-CMS computing – DC03 & GRID 2003 Over 72K CPU-hours used in a week 100 TB of data transferred across Grid3 sites Peak numbers of jobs approaching 900 Average numbers during the daytime over 500

US-CMS computing – DC04

1 st LHC magnet leaving FNAL for CERN

And our science has shown up in some unusual journals… Her sneakers squeaked as she walked down the halls where Lederman had walked. The 7 th floor of the high-rise was where she did her work, and she found her way to the small, functional desk in the back of the pen.

Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division.

Similar presentations

Presentation on theme: "Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division.

Similar presentations

Presentation on theme: "Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division."— Presentation transcript:

Similar presentations

About project

Feedback