2014 ERCOT Operations Training Seminar Texas Reliability Entity Jagan Mandavilli, Bob Collins, Mark Henry EMS Outages and Lessons Learned TDSP.

Slides:



Advertisements
Similar presentations
Revised Event Analysis Process Event Analysis Subcommittee (EAS) Process Update Team (EUT) Hassan Hamdar – FRCC Reliability Engineer, EAS Vice-Chair FRCC.
Advertisements

1 PER-005 Update Impact on Operators System Operator Conference April and May 1-3, 2012 Columbia, SC Margaret Stambach Manager, Training Services.
Key Reliability Standard Spot Check Frank Vick Compliance Team Lead.
OVERVIEW TEAM5 SOFTWARE The TEAM5 software manages personnel and test data for personal ESD grounding devices. Test and personnel data may be viewed/reported.
Chapter 19: Network Management Business Data Communications, 5e.
NERC Lessons Learned Summary December NERC lessons learned published in December 2014 Three NERC lessons learned (LL) were published in December.
Reliability Software1 Reliability Software Minimum requirements & Best practices Frank Macedo - FERC Technical Conference July 14, 2004.
August 14, 2003 Blackout Final Report
Chapter 19: Network Management Business Data Communications, 4e.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Pertemuan Matakuliah: A0214/Audit Sistem Informasi Tahun: 2007.
September 2014 Lesson Learned Summary. September 2014 LLs 2 Three NERC lessons learned (LL) were published in September 2014 LL Redundant Network.
NERC Lessons Learned Summary March NERC lessons learned published in March 2015 Two NERC lessons learned (LL) were published in March2015 LL
NERC Lessons Learned Summary
© SAIC. All rights reserved. NATIONAL SECURITY ENERGY & ENVIRONMENT HEALTH CYBERSECURITY The Potential High Cost of Simple Systems Engineering Errors Jim.
Maintaining Windows Server 2008 File Services
Title Month Year John Richmond
Supervisory Systems.
SCADA and Telemetry Presented By:.
Chapter 8.  Network Management  Organization Management  Risk Assessment & Management  Service Management  Performance Management  Problem Management.
ERCOT Compliance Audits Robert Potts Sr. Reliability Analyst March 23, 2005.
Administering Windows 7 Lesson 11. Objectives Troubleshoot Windows 7 Use remote access technologies Troubleshoot installation and startup issues Understand.
NERC PER-005 Overview May 10, Who, What, When, Where, Why WHO: RC, TOP, BA WHAT: –R1: Systematic Approach to Training –R2: Verify operator capabilities.
K E M A, I N C. NERC Cyber Security Standards and August 14 th Blackout Implications OSI PI User Group April 20, 2004 Joe Weiss
ERCOT 2013 Winter Storm Drill (WSD) Summary Brian Barcalow Training Specialist, Sr.
By Anthony W. Hill & Course Technology1 Common End User Problems.
Implementing the New Reliability Standards Status of Draft Cyber Security Standards CIP through CIP Larry Bugh ECAR Standard Drafting Team.
IT Infrastructure Chap 1: Definition
Monitoring EMS Infrastructure Ann Moore San Diego Gas & Electric September 13, 2004 EMS Users Group Meeting-St. Louis.
Computer Emergency Notification System (CENS)
WELCOME TO SEMINAR ON SCADA WELCOME TO SEMINAR ON SCADA Presented by: ANIL KUMAR RAUT Adm No:33IE/2k.
Project System Protection Coordination Requirement revisions to PRC (ii) Texas Reliability Entity NERC Standards Reliability Subcommittee.
Lead from the front Texas Nodal 1 Texas Nodal Energy Management System Requirement Documents December 5, 2006 Jay Dondeti EMS Project.
Topics of presentation
EDS 2 Early Delivery Systems Review and Request for Approval May 2007 John Webb.
Principles of Information Systems, Sixth Edition Systems Design, Implementation, Maintenance, and Review Chapter 13.
1Texas Nodal Market Trials Update. 2Texas Nodal Full System Market and Reliability Test 24-Hour Test Observations Duration of Test for Week of 8/ Hour.
Bill Lewis, Compliance Team Lead NERC Reliability Working Group May 16, 2013 Texas RE Update Talk with Texas RE April 25, 2013.
© Property of ERCOT /06/20041 Overview of SPS in ERCOT February 5, 2004.
EAS Lessons Learned Summary Lessons Learned Published in June 2014.
Lesson 19-E-Commerce Security Needs. Overview Understand e-commerce services. Understand the importance of availability. Implement client-side security.
Chapter 12 Implementation and Maintenance
February 20, 2006 Nodal Architecture Overview Jeyant Tamby 20 Feb 2006.
Loss of EMS Events 2015 System Operator Seminar. Training Objectives 2 System Operators will be able to identify the ERO Event Analysis Process Category.
Managing of Limits during Loss of Analysis Tools/ EMS April 23, 2014.
NERC Lessons Learned Summary LLs Published in September 2015.
Loss of EMS Events 2015 System Operator Seminar. Training Objectives 2 System Operators will be able to identify the ERO Event Analysis Process Category.
IAEA Training Course on Safety Assessment of NPPs to Assist Decision Making Temelin NPP Risk Panel A PSA and Safety Monitor Application Workshop Information.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Scheduling and Operating Transmission Devices in the Nodal Environment.
NERC Lessons Learned Summary LLs Published in December 2015.
© CONSTELLATION ENERGY GROUP, INC. THE OFFERING DESCRIBED IN THIS PRESENTATION IS SOLD AND CONTRACTED BY CONSTELLATION NEWENERGY, INC., A SUBSIDIARY.
1 TDTWG Report to RMS SCR Addressing ERCOT System Outages Tuesday, May 10.
PRS Workshop- NPRR755/NOGRR154 WAN communications of Critical Market Data and Agency-Only QSE Feb 29, 2016.
Cyber Security of SCADA Systems Testbed Development May1013 Group Members: Ben Kregel Justin Fitzpatrick Michael Higdon Rafi Adnan Adviser: Dr. Manimaran.
Artificial Intelligence In Power System Author Doshi Pratik H.Darakh Bharat P.
Principles of Information Systems Eighth Edition
Chapter 19: Network Management
NERC Published Lessons Learned Summary
NERC Lessons Learned Summary
NERC Published Lessons Learned
Maintaining Windows Server 2008 File Services
NERC Published Lessons Learned
EAS Lessons Learned Summary
How SCADA Systems Work?.
Introduction to Operating System (OS)
NERC CIP Implementation – Lessons Learned and Path Forward
Audit Plan Michelangelo Collura, Folake Stella Alabede, Felice Walden, Matthew Zimmerman.
“TOP-010-1: Data Quality, Analysis Quality, and Alarm Process Monitor”
Larry Bugh ECAR Standard Drafting Team Chair June 1, 2005
Presentation transcript:

2014 ERCOT Operations Training Seminar Texas Reliability Entity Jagan Mandavilli, Bob Collins, Mark Henry EMS Outages and Lessons Learned TDSP

2 Objectives Upon completing this course of instruction, you will:  Recognize the typical causes and failure modes for Energy Management Systems (EMS) systems and tools  Identify the importance of some of the tools TDSP’s use  Identify the EMS applications critical to your operation  Recognize the TDSP operator’s role in identifying problems and reporting EMS failures  Identify components of the procedures for operation of the system during EMS failures

3 Content ●EMS Failures  Communication and Control (EMS) Failures Inter Control Center Communication Protocol (ICCP) failures Remote Terminal Unit (RTU) issues  EMS Applications Failures State Estimator(SE), RTCA, VSAT, TSAT SCADA  Backup Control center operation  Loss of Operator User Interface  EMS failures due to database updates  Training and Live EMS Screens on same display ●Analysis of Restorations ●Contributing & Root causes with examples ●Common themes with examples

4 Definitions ●SCADA – Supervisory Control And Data Acquisition ●EMS – Energy Management System ●RTNET– Real Time Network Analysis ●RTCA – Real Time Contingency Analysis ●VSAT – Voltage Stability Analysis Tool ●TSAT – Transient Stability Analysis Tool ●ICCP – Inter Control Center Communications Protocol ●RTU – Remote Terminal Unit ●EAS – Event Analysis Sub Committee ●EMSTF – Energy Management Systems Task Force

5 Tools and Their Importance ●SCADA ●ICCP ●RTNET ●RTCA ●SCED ●VSAT ●TSAT

6 ERCOT EMS Overview

7 EMS Reliability ●EMS are extremely reliable ●Extremely high industry wide availability ●Systems usually have redundancy ●Multiple systems are common, with on-the-fly failover ●Backup centers, sometimes manned ●Communications circuits on highly redundant ring networks ●Data handling has built in error detection and correction ●Support staff available 24 x 7

8 What do EMS Problems Look Like? ●Trends flatline ●Data no longer updates ●Color changes ●Alarms ●Strange application results ●Lockup of applications ●Loss of Visibility

10 NERC EMS Failure Event Analysis ●NERC and personnel examined events  81 Category 2b events (Oct 26, 2010 – Sep 3, 2013) reported  64 events – thoroughly analyzed and reviewed  54 entities reporting - 20 entities experiencing multiple outages  Restoration time for partial outages: 18 to 411 min  Restoration time for complete outages: 12 to 253 min  Vendor diagnostic failures – Software & Hardware Issues  Several noticeable themes

11 NERC Lessons Learned from EMS Events #1 ●Remote Terminal Units Not on DC Sources  The power supply to an RTU for a High Voltage Direct Current (HVDC) converter station was not designed to be fed from station batteries, resulting in a loss of the RTU when all AC feeds to the substation were lost due to an event. ●Lesson Learned  While the availability of multiple AC sources provides a deep degree of reliability for RTUs, entities should evaluate the practicality and feasibility of powering RTUs needed for control, situation awareness, system restoration and/or post analysis from the station batteries. Operator Training Seminar 2014

12 NERC Lessons Learned from EMS Events #2 ●EMS System Outage and Effects on System Operations  An entity’s EMS began to lose data necessary for visibility of portions of its transmission network causing functionality and/or solution interruptions for some of its EMS operational tools. No loss of load occurred during this event and it was quickly determined to not be a cyber security event. ●Lessons Learned  All entities should have a procedure such as “Conservative Operations” which provides possible steps they may have to take to ensure reliability. Training should be conducted routinely on all procedures especially those related to low-probability, high-impact events regardless of how often the procedures are used. Operator Training Seminar 2014

13 NERC Lessons Learned from EMS Events #3 ●EMS Loss of Operators User Interface Application  A control center experienced a loss of control and monitoring functionality of the EMS due to the loss of the operator’s user interface application between its primary EMS computer/host server and the system operator consoles. ●Lessons Learned  Create a ‘save case’ of settings before and after any change to the system is made. The ‘save case’ will aid in supplying the necessary documentation needed to perform comparisons.  Analyze EMS performance on a periodic basis and evaluate if the system is meeting the needs as designed and intended. Operator Training Seminar 2014

14 NERC Lessons Learned from EMS Events #4 ●SCADA Lockup  A Transmission Owner (TO)’s control center experienced a SCADA failure which resulted in a loss of monitoring functionality for more than thirty minutes. ●Lessons Learned  It is beneficial that Transmission Operators (TOP) and TOs install a “heartbeat monitor” alarm to detect stale or stagnant data.  A periodic evaluation of the mismatch thresholds should be conducted for state estimator alarming specific to each operating area, such that it will allow for the optimum sensitivity while minimizing false mismatch alarms. Operator Training Seminar 2014

15 NERC Lessons Learned from EMS Events #5 ●Failure of EMS Due to Over-Utilization of Disk Storage  Loss of control functionality due to the hard disk on the SCADA server being fully utilized. ●Lessons Learned  SCADA equipment monitoring should include monitoring of hard disk storage utilization. Purging processes need to be set up to perform periodic clean up of disk space. Operator Training Seminar 2014

16 NERC Lessons Learned from EMS Events #6 ●Indistinguishable Screens during a Database Update Led to Loss of SCADA Monitoring and Control  During a planned database update and failover, an EMS Operations Analyst inadvertently changed an online SCADA server database mode from “remote” (online) to “local” (local offline copy), which caused a loss of SCADA monitoring and control of Bulk Electric System (BES) facilities. ●Lessons Learned  Changing the database mode on a server is not recommended. A future release of EMS software should eliminate the ability to switch database modes on a server. Operator Training Seminar 2014

17 NERC Lessons Learned from EMS Events #7 ●Inappropriate System Privileges Causes Loss of SCADA Monitoring  An entity experienced a loss of SCADA telemetry – specifically a loss of the channel status indicators – for 76% of its transmission system. This problem occurred during the implementation of a scheduled SCADA database update that caused one of the front-end processors to be in an abnormal state. An incorrect command was used to remedy the situation, which resulted in the channel status indicators being set to a failed state. ●Lessons Learned  Entities should consider: Reviewing the training with respect to change management to ensure that it includes a checklist of steps required; and Educating SCADA support staff on global impact of commands on the entire SCADA system. Operator Training Seminar 2014

18 NERC Lessons Learned from EMS Events #8 ●Loss of EMS – IT Communications Disabled  Transmission System Operators lost ability to authenticate to the EMS system, resulting in a loss of monitoring and control functionality for more than 30 minutes. ●Lessons Learned  EMS network design should include, where possible, a redundant local authentication server on the same internal network as the primary local authentication server. Operator Training Seminar 2014

19 NERC Lessons Learned from EMS Events #9 ●SCADA Failure Resulting in Reduced Monitoring Functionality  An entity’s primary control center SCADA Management Platform (SMP) servers became unresponsive, which resulted in a partial loss of monitoring and control functions for more than 30 minutes. Because this loss of functionality was a result of a conflict between security software configuration changes and core operating system functions, a cyber-security event was quickly ruled out, and no loss of load occurred during this event. ●Lessons Learned  Registered entities should consider a “multi-site hosting” configuration. This configuration provides flexibility and convenience for rapid recovery capability of EMS and SCADA functions. Operator Training Seminar 2014

20 NERC Lessons Learned from EMS Events #10 ●Failure of Energy Management System While Performing Database Update  There was a failure of EMS while performing a database update. ●Lessons Learned  When the EMS was purchased, the vulnerability of an integrated system architecture was unknown. To eliminate this now-exposed vulnerability, it is recommended that functional separation of the Primary from the Backup Control Center be implemented. Operator Training Seminar 2014

21 Number of Reports October 26, 2010 – September 3, 2013

22 Characteristics of EMS Outages

23 Root Causes by Category

24 Contributing Causes by Category

25 Top Root/Contributing Causes (in order) ●Software Failure (A2B6C07) ●Design output scope LTA (A1B2C01) ●Inadequate vendor support of change (A4B5C03) ●Testing of Design/Installation LTA (A1B4C02) ●Defective or failed part (A2B6C01) ●System Interactions not considered (A4B5C05) ●Inadequate risk assessment of change (A4B5C04) ●Insufficient Job scoping (A4B3C08) ●Post Modification Testing LTA (A2B3C03) ●Inspection/Testing LTA (A2B3C02) ●Attention given to wrong issues (A3B3C01) ●Untimely corrective actions to known issue (A4B1C08)

26 Common Themes 1.Software Failures 2.Software Configuration/Installation/Maintenance 3.Hardware Failures 4.Hardware Configuration/Installation/Maintenance 5.Failover Testing Weaknesses 6.Testing Inadequacies

27 Software Failures – What is Affected? ●Application Software Bug/Defect  Base System – Alarms/Health Check/Syncing etc.  Front End Processing  Supervisory Control Applications (SCADA)  ICCP  User Interface (UI)  Relational Database Management Systems (RDBMS)  Build Process Scripts  Miscellaneous Scripts ●Communication Equipment Firmware/Software Bug/Defect  RTUs  Switches  Modems  Routers  Firewalls ●Operating System Software Bug/Defect  Unix/Linux/Windows

28 Hardware Failures ●Application Servers/Nodes  Network Interface cards  Server hard drive control board  Aux Power regulator control ●Communication Equipment  RTU  Switches  Routers  Firewalls  Fiber Optic Cables  Time source ●Power Sources  Uninterruptible Power Supply (UPS)  External Generators  Power Cables

29 Failover Testing Weaknesses ●Improper settings preventing the failover ●Improper procedure to failover ●System setup issues preventing failover ●Improper patch management between primary/spare/backup servers ●Primary server issues reflected on spare/backup as well – No Isolation ●Improper failover configurations settings ●Improper network device configuration settings for failover ●Design requirements not considering failovers

30 Testing Inadequacies ●Inadequate testing ●Improper procedures to test ●Incomplete scope ●Not engaging all the parties involved

31 Software and Hardware Categories and Restoration Times

32 Historical Failure Restoration Data Mean Complete Outage Restoration Time: 56 Minutes Mean Partial Outage Restoration Time: 43 Minutes Mean Total Outage Restoration Time: 99 Minutes

33 Lessons Learned ●Publish information about problems and solutions ●NERC continues review of events with a working group of stakeholders and Regional personnel ●Situational Awareness workshop held in June 2013 with future workshops planned ●Dialogue with vendors to inform and improve

34 Reporting Requirements – NERC Standard EOP ●Complete loss of voice communication capability affecting a Bulk Electric System (BES) control center for 30 continuous minutes or more (same as Category 2a of EAP) ●Complete loss of monitoring capability affecting a BES control center for 30 continuous minutes or more such that analysis capability (i.e., State Estimator or Contingency Analysis) is rendered inoperable (similar to Category 2b of EAP) ●Report to ERCOT, TRE, NERC and DOE per TRE web link: 004disturbancereports/Pages/Default.aspx

35 Reporting Requirements – NERC Events Analysis ●Category 1f - Unplanned evacuation from a control center facility with Bulk Power System (BPS) SCADA functionality for 30 minutes or more ●Category 1h - Loss of monitoring or control, at a control center, such that it significantly affects the entity’s ability to make operating decisions for 30 continuous minutes or more. Examples include, but are not limited to the following:  Loss of operator ability to remotely monitor, control BES elements, or both  Loss of communications from SCADA RTUs  Unavailability of ICCP links reducing BES visibility  Loss of the ability to remotely monitor and control generating units via AGC  Unacceptable State Estimator or Contingency Analysis solutions

36 What Can Operators Do? ●Watch for failures and unexpected situations ●Determine the criticality and impact to the reliability of the grid ●Promptly report the failures ●Log the date/time of the failure, a description of alarms/events, time of system/function restoration ●Expect the EMS failure and prepare to react ●Have the necessary back up procedure in place and be familiar with them

37 ERCOT Procedures ●Analysis Tool Outages section from the ERCOT Transmission and Security Desk Procedures (Section 3.3) ●Respond to Miscellaneous Issues section from the ERCOT Transmission and Security Desk Procedures (Section 10.1) ●Telemetry and Communications (Operating Guide Section 7) ●Failover procedure ●Loss of ICCP

38 ERCOT Procedures on Telemetry Sect 10.1, Transmission & Security Desk (Feb. 2014) Telemetry Issues that could affect SCED and/or LMPs IF:  There is a telemetry issue; THEN:  Ensure the appropriate Control Room personnel are aware of the issue, and  Instruct the TO/QSE to correct the issue. IF:  The TO/QSE cannot fix the issue in a timely manner; THEN:  Ask the TO/QSE to override the bad telemetry. IF:  For some reason, the TO/QSE cannot override the bad telemetry; THEN  Notify the Operations Support Engineer to work with the TO/QSE and/or the ANA group to override the bad telemetry.

39 ERCOT Transmission Desk Procedures for Loss of SE/RTNET (Summary Section 3.3, Feb. 2014) 1.If SE/RTNET has not solved within last 15 thru < 30 minutes then:  Continue to monitor system,  Notify Operations Support Engineer (OSE) and  Refer to Desktop Guide Trans. Desk 2.1 and run through checklist 2.Must complete within 30 minutes of “tool outage”: Notify two master QSEs that represent Nuclear Plants that ERCOTs SE not functioning and expected functional within approximately [# minutes] 3.If NOT solved within last 30 minutes, Advisory Hotline call to TO’s 4.Notify Real-Time operator to make hotline call to QSEs. 5.If unavailable for extended period of time or topology changes occur, request OSE run manual studies to ensure system reliability. 6.Post Advisory message on MIS Public. …and LOG…

40 ERCOT Procedures for Loss of Analysis Tools ToolLess Than Time Limit Greater than Time Limit Notes: SE/RTNET15 thru 30  30 RTCA15 thru 30  30 TSAT15 – 20  Manual studies, notify Oncor, no general advisory issued VSAT  Manual studies, advisory to TO’s, request topology change notification Similar procedures are followed for other tools, per Transmission Operations Desk Procedure (Feb 2014)

41 Other References ●ERCOT Nodal Protocols, Sect 3.10 ●ERCOT Nodal Operating Guides, Sect 7 ●ERCOT State Estimator Standards ●ERCOT Telemetry Standards ●ERCOT Operating Procedure Manual, Shift Supervisor Desk, Sect 10 ●NERC Events Analysis Process ●NERC Standard EOP ●NERC EMS Task Force

42 Credits Much of the information contained in this presentation was previously published by North American Electric Reliability Corporation (NERC) in a variety of publications. It is the result of extensive review of actual power system events over a 2 year period by the EMS Event Task Force. Questions?

43 EXAM Please turn your iClicker on and answer each of the following questions.

44 1.Which of the following Operator tools can lead to EMS failures? a)SCADA b)ICCP c)RTNET d)All of the above

45 2.What is the top root/contributing cause of EMS failures? a)Inadequate vendor support b)Hardware failure c)Inadequate testing d)Software failure

46 3.What action should ERCOT take for a loss of State Estimator solution longer than 30 minutes? a)Monitor frequency and hope for the best b)Make a Hotline call to issue an Advisory to the TOs c)OOME Up units d)RUC units off line

47 4.Which of the following steps should an Operator take during an EMS failure? a)Promptly report the failures b)Determine the criticality and impact of the failure to the reliability of the grid c)Log the date/time of the failure d)Implement backup procedures e)All of the above

48 5.What is the NERC Standard that requires reporting of EMS failures? a)NERC Events Analysis Process b)NERC Standard TOP c)NERC Standard EOP d)NERC EMS Task Force