Presentation is loading. Please wait.

Presentation is loading. Please wait.

Emergency Database Failover: Impacts & Recovery Plan

Similar presentations


Presentation on theme: "Emergency Database Failover: Impacts & Recovery Plan"— Presentation transcript:

1 Emergency Database Failover: Impacts & Recovery Plan
Trey Felton – ERCOT IT

2 Synopsis ISM - Information Services Master Database DB – Database
EDW – Electronic Data Warehouse

3 Synopsis Failover Out of synch (24 hrs)
Emergency DB failover on April 21st, 2008 Market DB (which feeds ISM) became unresponsive Data could not be written/read Synchronization issues caused a 24 hr gap in data Propagated through to ISM Out of synch (24 hrs) ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

4 Synopsis Failover Physical Standby brought online
ISM rebuilt through Source data to recover affected extracts ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

5 Impacts Impacts: Market transactions were prevented from updating ISM through Logical Standby Market DB utilizes a standby to prevent outages / performance degradations Logical Standby (RSS) became out of synch with Physical Standby by 24 hrs April 22 at 11:14am through April 21 at 10:44am Other DBs feeding ISM continued normally (only Market DB was out of synch) Priority of rebuild led to the Standby being rebuilt before the RSS Market DB has to be kept up This prolonged the outage to the EDW and affected extracts Prices had to be recalculated and extracts restored from Source Price adjustments for NSRS were completed June 5th Missing extracts for April 21 - April 30 completed on July 1st Why did recovery take so long? ISM generates up to 25-35G of data per day Data restored from Source back to April 1st 120 Terabytes had to be restored in order to roll-forward through transaction gap Archive log changes applied during 24-hour gap

6 Emergency Database Failover
All data was restored with 100% accuracy The affected market systems that caused the April failure: Run the balancing energy and ancillary services markets Not used for wholesale batch or the retail markets.  ERCOT considers this to be an isolated incident and not a systemic problem

7 Actions to prevent future occurrences:
Going Forward Actions to prevent future occurrences: Nodal market DBs will utilize newer Hardware More fault tolerance Redundancy Change of architecture in the replication process for Nodal Proof of Concept recently introduced into the Nodal market systems Testing underway ERCOT is conducting a risk/cost analysis of several options for these Zonal systems To be presented to TAC in August New Backups / Recovery Procedures Project initiated to stabilize our database backup procedures Shorter recovery time

8 Data Recovery NOTICE DATE: July 1, 2008
NOTICE TYPE: W-A UPDATE Extracts - Wholesale CLASSIFICATION: Public SHORT DESCRIPTION: ERCOT has completed recovery of the missing data for April 21 through April 30, 2008. INTENDED AUDIENCE: QSEs DAY AFFECTED: April 21 through April 30, 2008 LONG DESCRIPTION: ERCOT conducted an emergency database failover on April 21, 2008 following a hardware failure. This database failover resulted in an out-of-synch data problem from April 21 through April ERCOT developed a phased process to attempt to thoroughly recover the missing data. The missing data has been recovered for the following extracts.  A market notice will be sent when the extracts are expected to be posted. Act_Res_Output Ancillary_Services_Daily Bids_and_Schedules_Daily Forecast_Data_Daily Market_Information_Daily Sched_and_Actual_Load Self_Sch_Energy_Services ASDEPLOYMENTS


Download ppt "Emergency Database Failover: Impacts & Recovery Plan"

Similar presentations


Ads by Google