Presentation is loading. Please wait.

Presentation is loading. Please wait.

CD FY10 Budget and Tactical Plan Review FY10 Tactical Plan for SCF / System Administration DocDB #3389 Jason Allen 10/06/2009.

Similar presentations


Presentation on theme: "CD FY10 Budget and Tactical Plan Review FY10 Tactical Plan for SCF / System Administration DocDB #3389 Jason Allen 10/06/2009."— Presentation transcript:

1 CD FY10 Budget and Tactical Plan Review FY10 Tactical Plan for SCF / System Administration DocDB #3389 Jason Allen 10/06/2009

2 CD FY10 Budget and Tactical Plan Review 2 FY10 Tactical Plan for SCF / System Administration FEF Members Jason Allen Glenn Cooper Ed Simmonds LaDerrick Honeycutt Ling Ho Jason Harrington Etta Burns Seth Graham Mark Schmitz Rennie Scott Current Customers D0 Offline D0 Online CDF Offline CDF Online EAG Minerva MiniBoone Minos SciBoone GP Farm MIPP SCF / System Administration plan is executed by the SCF/FEF Department.

3 CD FY10 Budget and Tactical Plan Review 3 FY10 Tactical Plan for SCF / System Administration Tactical Plan Leader: Jason Allen Service Activity List Online Systems Management Compute Node Management Server Management Storage Management Batch System Management Event and Incident Management Problem Management Operational Planning and Consulting Support Procurement Support Professional Development Project Activity List Short Term Projects

4 CD FY10 Budget and Tactical Plan Review 4 Service Activity: Online Systems Management Goals Related to this Activity –Common goal for all services: Support scientific computing at Fermilab by providing server, compute node, and storage management. Constantly strive to improve operational efficiency while maintaining a high level of customer satisfaction. Key Metrics –Tickets per month –Number of systems Service Documentation : https://fefweb.fnal.gov/mediawiki Issues and Risks (specific to this activity, includes allocation impact) 1.Aging hardware with no plans to replace. 2.Greater demands being put on CD staff to keep equipment running. 3.Little control over operational decisions.

5 CD FY10 Budget and Tactical Plan Review 5 Service Activity: Compute Node Management Goals Related to this Activity –Common goal for all services: Support scientific computing at Fermilab by providing server, compute node, and storage management. Constantly strive to improve operational efficiency while maintaining a high level of customer satisfaction. Key Metrics –Number of compute nodes managed, upgraded, and installed Service Documentation : https://fefweb.fnal.gov/mediawiki Issues and Risks (specific to this activity, includes allocation impact) 1.Endemic issue when purchasing new hardware.

6 CD FY10 Budget and Tactical Plan Review 6 Service Activity: Server Management Goals Related to this Activity –Common goal for all services: Support scientific computing at Fermilab by providing server, compute node, and storage management. Constantly strive to improve operational efficiency while maintaining a high level of customer satisfaction. Key Metrics –Number of servers managed, upgraded, and installed. Service Documentation : https://fefweb.fnal.gov/mediawiki Issues and Risks (specific to this activity, includes allocation impact) 1.Moving equipment to from one facility to another consumes a huge amount of effort. 2.Quality control of SLF and Fermi specific packages. Proper testing procedures must be followed to prevent deployment of bad RPMs. 3.Poor hardware support from vendors. 4.VM sprawl.

7 CD FY10 Budget and Tactical Plan Review 7 Service Activity: Batch System Management Goals Related to this Activity –Common goal for all services: Support scientific computing at Fermilab by providing server, compute node, and storage management. Constantly strive to improve operational efficiency while maintaining a high level of customer satisfaction. Key Metrics –Number of reported batch system related Service Desk tickets. Number of job slots on batch system. –Current and historical status: http://fefweb.fnal.gov/cab Service Documentation : https://fefweb.fnal.gov/mediawiki Issues and Risks (specific to this activity, includes allocation impact) 1.Reliant on Torque which is a community supported Open Source batch system. Poor quality control is a concern.

8 CD FY10 Budget and Tactical Plan Review 8 Service Activity: Procurement Support Goals Related to this Activity –Common goal for all services: Support scientific computing at Fermilab by providing server, compute node, and storage management. Constantly strive to improve operational efficiency while maintaining a high level of customer satisfaction. Key Metrics –Total dollar amount of approved requisitions. Number of reqs approved. Service Documentation : https://fefweb.fnal.gov/mediawiki Issues and Risks (specific to this activity, includes allocation impact) 1.Purchasing hardware with endemic problems. 2.Vendor lock-in.

9 CD FY10 Budget and Tactical Plan Review 9 Project Activity: Short Term Projects: Configuration Management Goals Related to this Activity –1. Reassess current tools and methods used for configuration management, identify strengths and weaknesses. –2. Evaluate potential replacement configuration management tools, Puppet etc. –3. Tighter coupling of configuration management, provisioning, and monitoring tools. –4. Better reporting and auditing of configuration changes. Key Milestones –Start: Fourth quarter CY 2009 –End: Second quarter CY 2010 Project Documentation : https://fefweb.fnal.gov/mediawiki Issues and Risks (specific to this activity, includes allocation impact) 1.Possible service interruption while migrating to a new tool. 2.New tools could be less reliable than the old.

10 CD FY10 Budget and Tactical Plan Review 10 Project Activity: Short Term Projects: Business Intelligence Goals Related to this Activity –1. Construct a data store from various sources containing asset and operational data. –2. Deploy tools which allow historical and current operational views using ad-hoc or canned reports. Key Milestones –Start: First quarter CY 2010 –End: Second quarter CY 2010 Project Documentation : https://fefweb.fnal.gov Issues and Risks (specific to this activity, includes allocation impact) 1.Reporting incorrect information.

11 CD FY10 Budget and Tactical Plan Review 11 Project Activity: Short Term Projects: GPCF Deployment To be discussed in the GPCF presentation.

12 CD FY10 Budget and Tactical Plan Review 12 FY10 FTE: Request vs. Allocation Level 0/1 Activity: SCF / System Administration Activity Level 2FTEs Operational Planning and Consulting Support0.15 Procurement Support0.45 Professional Development0.5 Batch System Management0.5 Compute Server Management1.0 Event and Incident Management1.5 Online Systems Management1.0 Problem Management0.5 Storage Management0.6 System Administration Management1.5 Server Management2.0 Total10.2

13 CD FY10 Budget and Tactical Plan Review 13 FY10 M&S: Request vs. Allocation Level 0/1 Activity: SCF / System Administration Activity Level 2Project or Service Project PriorityM&S RequestedM&S Allocated Professional DevelopmentService ---$15,000.00 Compute Server ManagementService ---$20,000.00 GP GridService --- $434,000.00 Server ManagementService ---$136,000.00 Total $605,000.00

14 CD FY10 Budget and Tactical Plan Review 14 Ripple Effect on Shared IT Services Activity Level 2Network Connectivity: Expanded Service  GP Grid140 switch portsNew, Steady- State service drives

15 CD FY10 Budget and Tactical Plan Review 15 M&S Requests Level 0/1 Activity: SCF/ System Administration GP Farm core count was determined by FermiGrid Services based on projected experiment need and node retirements. RequestDescriptionRisk of Reduced Allocation 434k1120 cores for GP FarmReduced analysis CPU 40kRacks and related infrastructure.Must reuse old hardware 41kAdmin and spare machines.Less operational reliability 40kVirtualization management software.Could postpone VM rollouts 20kReplace Lantronix and Opengear console servers. Less reliable console servers could mean more downtime

16 CD FY10 Budget and Tactical Plan Review 16 Summary of Past Action Items CDACTIONITEM-211: Need plan to conform D0 CAB cluster to OSE baseline. State: Open D0 OSE Taskforce lead by Mike Diesburg is examining how to bring CAB in line with OSE requirements. CDACTIONITEM-174: Batch system management (Torque /PBS) group? State: Closed It was determined after several meetings that FermiGrid Services doesn’t currently have sufficient effort to support the D0 CAB batch system. CDACTIONITEM-173: Desktop Computer Management should it be in a different group? State: Closed Support for CDF Desktops was migrated to Central Services at the beginning of 2009.

17 CD FY09 Tactical Plan Status 17 Tactical Plan Summary: FY09 Accomplishments Completed “Take over management of EAG servers” project. Completed “Upgrade and migrate CAB status web pages” project. http://fefweb.fnal.gov/cab/ Completed “Revamp system console and remote power-cycling infrastructure” project. Setup a new “interactive/batch cluster” for the Minerva experiment. Deployed virtualized clusters supporting high availability for D0 Offline and CDF Online.

18 CD FY09 Tactical Plan Status 18 Tactical Plan Summary: Objectives for FY10 Maintain existing scientific computing infrastructure for running Fermilab experiments. Scope includes system management, procurement of new systems, retiring old equipment, and troubleshooting technical issues. Improve system administration efficiency by streamlining procedures and refining existing system management infrastructure. Implement virtualization technologies in an effort to consolidate physical systems and increase operational reliability. (not virtualization, just for the sake of virtualization) Improve system monitoring (see business intelligence project).

19 CD FY09 Tactical Plan Status 19 Tactical Plan Summary: Objectives for FY10 (cont) Improve operational reporting. Document processes and procedures related to system management with the goal of improving service quality. Increase technical proficiency of department members. Improve technical proficiency of system administrators. Share technical expertise and standardize system administration tools/procedures with other CD departments. Support the division’s effort to implement ITIL Promote a safe and harmonious work environment

20 CD FY09 Tactical Plan Status 20 Tactical Plan Summary: Risk Assessment Reduction in available effort due to resignations, budget shortfalls, or reassignments. Increased number of requests from customers because of reduced support from scientific staff. This could be a particular problem with D0/CDF as RunII starts to wind down. Maintain high quality SLF and Fermi RPM releases. Proper testing procedures must be followed to prevent deployment of bad packages. Newer hardware may only run SLF 5.x or newer, experiment code must be compatible with current OS. Endemic problems with hardware purchases, especially disk vibration issues, are always a concern.

21 CD FY09 Tactical Plan Status 21 Tactical Plan Summary: Summary Significant challenges in FY10, particularly in regard to effort. Our focus this year to maintain stable operations. Lots of interesting things happening with virtualization. Hope to make significant improvements in configuration management and reporting. Significant challenges in FY10, particularly in regard to effort. Our focus this year to maintain stable operations. Lots of interesting things happening with virtualization. Hope to make significant improvements in configuration management and reporting.


Download ppt "CD FY10 Budget and Tactical Plan Review FY10 Tactical Plan for SCF / System Administration DocDB #3389 Jason Allen 10/06/2009."

Similar presentations


Ads by Google