Presentation is loading. Please wait.

Presentation is loading. Please wait.

BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.

Similar presentations


Presentation on theme: "BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006."— Presentation transcript:

1 BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006

2 5-6 September 2006 B. Gibbard Grid Deployment Board 2 Service Planning (Requirements)  Tier 1 Center at BNL services only ATLAS  Some member of staff are ATLAS collaborators  Some participation in ATLAS planning o ATLAS computing model o ATLAS Computing TDR, etc.  Base requirement is the agreed fraction of the appropriate revision of the overall Tier1 ATLAS requirement from the computing model/TDR.  BNL Tier 1 site also supplies additional US ATLAS specific capacity  Requirement is total  CPU, Disk, Tape, Network Bandwidth  Plus associated, implementation dependent o Cyber infrastructure (LAN, LDAP, backup, Grid servers, etc.) o Staffing o Physical infrastructure (space, power, cooling, fire protection, security, etc.)

3 5-6 September 2006 B. Gibbard Grid Deployment Board 3 Service Planning (Facility Evolution & Cost Plan)  Facility staff converts requirements into  Plan for current and future years including  Projected set of costed capital equipment procurements based on o In-house experience o Experience at other sites o Interactions with vendors, etc  Projected operations budget o Labor o Licenses, maintenance, media, small equipment o Space and power charges, general overhead  With approval of US ATLAS program management this plan is  Integrated into overall Computing Facilities Plan, which is then  Integrated into overall Software and Computing Plan, which is then  Integrated into overall US ATLAS Program Plan

4 5-6 September 2006 B. Gibbard Grid Deployment Board 4 Service Planning (Review and Approval)  Program and plans are reviewed  By funding agencies o Department of Energy o National Science Foundation  Twice yearly o Most detailed review, usually in winter, with many technically knowledgeable consultants  Review flags issues and makes recommendations  US ATLAS Program Management with Agency Approval Allocates funds  Out of common US ATLAS Program funding  To Tier 1 for current year

5 5-6 September 2006 B. Gibbard Grid Deployment Board 5 Execution of Plan  Funding is by US Fiscal Year starting October 1 rather than January 1  Funding typically arrives in two chunks, beginning of and halfway through fiscal year; with details of how much, when, negotiated with program management based on  Competing needs within the program  Schedule of capacity requirements  When funds can be most effectively spent  Technology/product evaluation and review is a year around activity  Major equipment procurements typically require from one to two months to execute  Installation and commissioning typically takes from a couple of weeks to a couple of months to complete

6 5-6 September 2006 B. Gibbard Grid Deployment Board 6 Maintaining Availability of Services  ATLAS Tier 1 at BNL is co-located and co-operated with RHIC Computing Facility (RCF)  Use of redundancy in critical elements o Fail over and/or graceful degradation of services  Appropriate response time maintenance contracts on critical elements  24 x 7 operational coverage of fabric services o Jointly maintained by RCF and ATLAS Tier 1 staff o Five years of experience with RHIC 24 x 7 operations  When accelerator runs (25-30 weeks/year)  Coverage Includes o 16 x 7 on site staff coverage  2 operators extend coverage to week-ends and evenings o 24 x 7 on call staff coverage for critical fabric services  On call activation by automated systems, operators/other staff members, critical points of contact in user communities

7 5-6 September 2006 B. Gibbard Grid Deployment Board 7 Monitoring  Facility uses RT problem tracking system  Substantial use of Nagios-base monitoring by individual subsystems  Working toward facility wide unification  Automated monitoring and paging of staff for failures where possible  Physical infrastructure  Many common off the shelve subsystems  Complex, software intensive or newly deployed systems still require humans for early failure identification  HPSS  dCache  SFT suite is not a good monitor of BNL Tier 1 availability  Making SFT run continues to demand significant effort at BNL but …  … fails to detect problems impacting some critical site services  … reports some failures which have no effect on any significant site service

8 5-6 September 2006 B. Gibbard Grid Deployment Board 8 Monitoring Related Issues  Facility functions within the context of the Open Science Grid (OSG) and 5 US ATLAS Tier 2 Centers  Relatively few issues of interoperability (OSG  EGEE) at data transfer, storage and management levels  Significant interoperability issues with accounting, monitoring, allocation, work flow management, etc. do exist o Some are being addressed through OSG  Facility functions are convoluted with PanDA & DDM layers so reported problems need interpretation including expertise from PanDA & DDM teams o Grid Production and Analysis uses the US ATLAS specific “PanDA” job management system o ATLAS DDM system has a complex interaction with underlying facility services o So there is no automated monitoring for these critical systems yet  For SC’s and ATLAS CSC activities  On call list for critical Grid services  Accessible through OSG GOC (IU), PanDA operations team, selected other “power” users

9 5-6 September 2006 B. Gibbard Grid Deployment Board 9 Evolution of Monitoring & Support  Increase automation especially for services having immediate impact on operations  dCache, DDM, PanDA, and other Grid related services  In particular need to codify de-convolution of PanDA and/or DDM problems from underlying Tier 1 operations problems  Unify monitoring (under Nagios umbrella)  Add an additional operator allowing expansion of on site staff coverage to ~24 x 7 – ε  Integrate problem report/tracking systems  RT  Footprint (OSG GOC at IU)  Footprint  GGUS  Better integrate monitoring and problem resolution with US ATLAS Tier 2’s and with overall ATLAS effort  Target is to established comprehensive ATLAS directed ~24x7 operational monitoring and support by Jan ‘07


Download ppt "BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006."

Similar presentations


Ads by Google