NERC Published Lessons Learned Summary

Slides:



Advertisements
Similar presentations
1. Four LLs were published in February Transmission Relaying – Undesired Blocking 2.Lack of Separation for Critical Control Power Supply Leads.
Advertisements

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 20 Systems Operations and Support.
NERC Lessons Learned Summary December NERC lessons learned published in December 2014 Three NERC lessons learned (LL) were published in December.
1 Disaster Recovery “Protecting City Data” Ron Bergman First Deputy Commissioner Gregory Neuhaus Assistant Commissioner THE CITY OF NEW YORK.
Information System Economics IT MAINTENANCE MANAGEMENT.
Concepts of Database Management Seventh Edition
September 2014 Lesson Learned Summary. September 2014 LLs 2 Three NERC lessons learned (LL) were published in September 2014 LL Redundant Network.
Irwin/McGraw-Hill Copyright © 2004 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS6th Edition.
NERC Lessons Learned Summary March NERC lessons learned published in March 2015 Two NERC lessons learned (LL) were published in March2015 LL
 Mechanism for restoring a database quickly and accurately after loss or damage  RESPONSIBILITY OF ?????  Recovery facilities: Backup Facilities Backup.
NERC Lessons Learned Summary
Managing a computerised PO Operating environment 1.
High Availability Module 12.
Copyright © 2015 Pearson Education, Inc. Processing Integrity and Availability Controls Chapter
November 2009 Network Disaster Recovery October 2014.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Managing Computerised Offices Operating environment
11 SECURITY TEMPLATES AND PLANNING Chapter 7. Chapter 7: SECURITY TEMPLATES AND PLANNING2 OVERVIEW  Understand the uses of security templates  Explain.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
APS Dual Site ECC Primary Site Shutdown/Restart CIPC Confidentiality: Public Release.
Principles of Information Systems, Sixth Edition Systems Design, Implementation, Maintenance, and Review Chapter 13.
Module 9 Planning a Disaster Recovery Solution. Module Overview Planning for Disaster Mitigation Planning Exchange Server Backup Planning Exchange Server.
EAS Lessons Learned Summary Lessons Learned Published in August 2014.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
1 IRU Concurrency, Reliability and Integrity issues Geoff Leese October 2007 updated August 2008, October 2009.
Principles of Information Systems, Sixth Edition Systems Design, Implementation, Maintenance, and Review Chapter 13.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
EAS Lessons Learned Summary Lessons Learned Published in June 2014.
20-1 Systems support is the on-going technical support for users, as well as the maintenance required to fix any errors, omissions, or new requirements.
High Availability in DB2 Nishant Sinha
Principles of Information Systems, Sixth Edition 1 Systems Design, Implementation, Maintenance, and Review Chapter 13.
Chapter 12 Implementation and Maintenance
Installation and Maintenance of Health IT Systems Unit 8b Troubleshooting; Maintenance and Upgrades; and Interaction with Vendors, Developers, and Users.
NERC Lessons Learned Summary LLs Published in September 2015.
18 Copyright © 2004, Oracle. All rights reserved. Backup and Recovery Concepts.
1 TDTWG Report to RMS Recommended Solutions for SCR 745 ERCOT Unplanned System Outages and Failures Wednesday, August 10th.
NERC Lessons Learned Summary LLs Published in December 2015.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
Virtual Machine Movement and Hyper-V Replica
CACI Proprietary Information | Date 1 PD² SR13 Client Upgrade Name: Semarria Rosemond Title: Systems Analyst, Lead Date: December 8, 2011.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
NERC Published Lessons Learned Summary
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
NERC Published Lessons Learned
NERC Published Lessons Learned
NERC Lessons Learned Summary
NERC Published Lessons Learned
NERC Published Lessons Learned
EAS Lessons Learned Summary
NERC Published Lessons Learned Summary
Introduction to Operating Systems
What, When, Why, Where and How SCC maintains your Oracle database
NERC Published Lessons Learned Summary
Maximum Availability Architecture Enterprise Technology Centre.
Processing Integrity and Availability Controls
THE STEPS TO MANAGE THE GRID
NERC CIP Implementation – Lessons Learned and Path Forward
Alabede, Collura, Walden, Zimmerman
Audit Plan Michelangelo Collura, Folake Stella Alabede, Felice Walden, Matthew Zimmerman.
Oracle9i Database Administrator: Implementation and Administration
What's New in the World of High Availability for DB2 in 11.1
Planning High Availability and Disaster Recovery
Primary Site Shutdown/Restart
Unit 9 NT1330 Client-Server Networking II Date: 8/9/2016
SpiraTest/Plan/Team Deployment Considerations
Workshop.
Performing Database Recovery
Systems Operations and Support
Chapter 5 The Redo Log Files.
Presentation transcript:

NERC Published Lessons Learned Summary November 2016

NERC Lessons Learned - November 2016 Three NERC Lessons Learned (LL) were published in the November 2016 LL20161101 - “Redundant Systems may not Cold-Start unless fully intact to prevent Dual Primary Operation” LL20161102 - “Failover Configuration Leads to Loss of EMS” LL20161103 - “Loss of ICCP due to Database Sizing Issue”

Redundant Systems may not Cold-Start – Details The entity deployed a patch on several noncritical but similar hardware installations throughout the organization in addition to the identical QA/test EMS When the patch was executed on the production EMS communications halted The failure was attributed to a corrupt switch configuration (part of a redundant pair) The second switch’s configuration was verified intact, and the upgrade process had not yet executed on the second switch

Redundant Systems may not Cold-Start – Details The freshly upgraded switch was powered down and traffic was expected to resume to normal – Did not The switches were power cycled in various orders in an attempt to restore service – Still not working Once the corruption/missing configuration was identified, it was restored from backup and system functionality resumed immediately

Redundant Systems may not Cold-Start – Details The network switches will not forward traffic when the two units do not have matching configuration A single switch would not start without its mate in a cold- start situation This is to prevent dual primary operation (split-brain scenario), where two isolated switches each think that they are the only operating switch

Redundant Systems may not Cold-Start – Corrective Actions Vendor modified their recommended configuration baseline to include the ability to cold-start a single switch after a waiting period This balances dual primary protection (split-brain scenario) with the operational need to start a system using a single network switch Engineers now have processes to quickly compare configurations with known good baselines during maintenance operations Standard commissioning and testing procedures include cold-starting redundant systems when the mate is not present

Redundant Systems may not Cold-Start – Corrective Actions These concepts redundant system must be understood at all levels of technicians, operators, and engineers of redundant systems See lesson learned document for details In addition to common testing, redundant systems should be tested with partial outages Ensure backups and disaster recovery procedures are readily available before performing maintenance

Failover Configuration Leads to Loss of EMS– Details A failover was initiated from EMS-C server to EMS-D server, the AGC application aborted twice within a minute during EMS-D’s initialization/startup This caused an automatic failover to the next backup EMS server in line (EMS-A) The same condition was experienced by EMS-A, which initiated another automatic failover to the next backup EMS server in line (EMS-B) When the system reached the final available server (EMS-B), all systems were in a DISABLED state

Failover Configuration Leads to Loss of EMS– Corrective Actions The entity removed the scheme that initiated an automatic failover after two consecutive AGC failures (within a minute) from the EMS process manager model The entity also reviewed all other schemes to ensure that the triggering of an automatic failover is properly defined

Failover Configuration Leads to Loss of EMS– Lessons Learned Review all failover configuration settings in the EMS that could initiate an automatic failover of the EMS to determine the value of the scheme Remove schemes that are not necessary or could lead to a cascading failover scenario Evaluate whether to allow these applications to fail in lieu of automatic failovers

Loss of ICCP due to Database Sizing Issue – Details An entity was updating and expanding it’s state estimator (STE) network model This STE update required an additional 13,000 points from the ICCP These had already been added to the ICCP database and to the development and staging STE servers When the production STE server was updated, it began requesting the 13,000 additional points again and the database table was increased by 26,000 points (13,000 for each of the two production STE servers)

Loss of ICCP due to Database Sizing Issue – Details This exceeded the maximum allowed size of the database table and caused the ICCP processes to abort Investigation revealed that the database table was 97% full prior to the expansion, and the extra points caused it to exceed its maximum size A failback to the previous ICCP database was attempted, but this did not successfully resolve the issue

Loss of ICCP due to Database Sizing Issue –Corrective Action The database table was temporarily resized to accommodate the appropriate number of entries More extensive research was done to determine the total size that will be needed through the end of the network model expansion project A report was created to compare the current size of database tables to their maximum limit This is now reviewed at each ICCP database change Two ICCP support staff completed a vendor class on ICCP support and maintenance

Loss of ICCP due to Database Sizing Issue – Lessons learned Database sizes need to be carefully monitored as a system is expanded Sizes should be large enough to accommodate all data being requested, not just what is currently being transferred Primary databases, as well as peripherally associated databases, need to be evaluated for size constraints The vendor may need to be contacted to verify that database sizes can be increased without causing problems or to provide a more comprehensive validation routine Alternative ICCP configurations need to be evaluated to determine if there is a more efficient means to feed data into the staging and development systems

Loss of ICCP due to Database Sizing Issue – Lessons Learned ICCP databases should be set up so that external companies cannot inadvertently request data that does not originate in the host utility Backup support staff should be fully trained so that discovery of problems does not rest completely with primary support personnel Support staff should meet regularly to discuss questions, discoveries, and findings

Lessons Learned Survey Link NERC’s goal with publishing lessons learned is to provide industry with technical and understandable information that assists them with maintaining the reliability of the bulk power system NERC requests that industry provide input on lessons learned by taking the short survey. A link is provided in the PDF version of each Lesson Learned