# Data Mining Techniques to classify inter-area oscillations

## Presentation on theme: "Data Mining Techniques to classify inter-area oscillations"— Presentation transcript:

Data Mining Techniques to classify inter-area oscillations
Adamantios Marinakis ABB Corporate Research CH London, 29/11/2013

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

Wide-Area Monitoring System (WAMS)
GPS Satellite Time stamps Visualization of power system dynamics Stability monitoring Stability control and blackout prevention V, I t System Protection Center V, I t V, I t V, I t V, I t V, I t V, I t V, I t V, I t V, I t Voltage and current phasors Communication network © ABB Group March 31, 2017 | Slide 4

Power Damping Monitoring – PDM Principle
Sliding window of minutes length Estimate MIMO state-space model 𝑥 𝑘+1 =𝐴 𝑥 𝑘 +𝐵 𝑢 𝑘 +𝐾 𝑒(𝑘) 𝑦 𝑘 =𝐶 𝑥 𝑘 +𝐷 𝑢 𝑘 +𝑒(𝑘) Carry our modal analysis Damping and frequency of critical modes

Swissgrid WAMS Collects measurements from PMUs around Europe

And then? Do something more than observing…
What we have: An operator can at any moment know what are the oscillation modes in its system ⇒ The operator can know in real-time its system security status Insecure if damping < some value What would be nice to have: Given a candidate operating point, predict its expected oscillatory status. Given an observed poorly damped operating point, say what is the reason for this. modify the operating point such that it becomes well damped. Insecure → secure model operating point security status

What is an “operating point” At least, how we define it here

PMU measurements SCADA system (time-stamped data) WAMS generation, load dispatch line power flows FACTS devices status (PSS status) time-stamped oscillations damping ratios Need to time- synchronize them output labels Database Train classifier input variables

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

What is data mining? Apart from a fancy term
An interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. It is about analyzing the data

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

Support Vector Machines A powerful classification technique
Main Idea: Find the optimal separating hyperplane ⇒ maximum margin, i.e. maximize distance to the closest point from either class Minimizes generalization error 𝑓 𝒙 = 𝒘 𝑇 𝒙+𝑏=0 found by solving: min 𝒘,𝑏 𝒘 subject to 𝑦 𝑖 𝒘 𝑇 𝒙 𝑖 +𝑏 ≥1, 𝑖=1,…,𝑁 a QP 

Non-separable classes
min 𝒘,𝑏 𝒘 s.t. 𝑦 𝑖 𝒘 𝑇 𝒙 𝑖 +𝑏 ≥1, 𝑖=1,…,𝑁 min 𝒘,𝑏 𝒘 2 +𝐶 𝑖=1 𝑁 𝜉 𝑖 s.t. 𝑦 𝑖 𝒘 𝑇 𝒙 𝑖 +𝑏 ≥1− 𝜉 𝑖 , 𝜉 𝑖 ≥0 ∀𝑖 regularization parameter

And what about nonlinear patterns in the data?
Map into a higher dimension feature space Is there any problem? YES! Number of features may blow up! ⇒ Computing the mapping can be inefficient Using the mapped representation can be inefficient Is there any solution? YES!

We only need 𝝓 𝒙 𝑇 𝝓 𝒙 ′ , never just 𝝓 𝒙 Hence:
The “kernel trick” QP solved by resorting to its dual problem: max 𝜶 𝑖=1 𝑁 𝛼 𝑖 − 𝑖=1 𝑁 𝑗=1 𝑁 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝒙 𝑖 𝑇 𝒙 𝑗 s.t. 𝑖=1 𝑁 𝛼 𝑖 𝑦 𝑖 =0, 0≤ 𝛼 𝑖 ≤𝐶, ∀𝑖 which … finally gives: 𝑓 𝒙 = 𝑖=1 𝑁 𝛼 𝑖 𝑦 𝑖 𝒙 𝑖 𝑇 𝒙+𝑏 𝐾 (𝑁×𝑁) : 𝐾 𝑖𝑗 =𝑘( 𝑥 𝑖 , 𝑥 𝑗 ) 𝒙 𝑖 𝑇 𝒙 𝑗 𝝓 𝒙 𝑖 𝑇 𝝓 𝒙 𝑗 It should have a dot product in the space defined by 𝝓 Note: We only need 𝝓 𝒙 𝑇 𝝓 𝒙 ′ , never just 𝝓 𝒙 Hence: kernel function: 𝑘 𝒙, 𝒙 ′ = 𝝓 𝒙 𝑇 𝝓 𝒙 ′

Polynomial: 𝑘 𝒙,𝒙′ = 1+ 𝒙 𝑇 𝒙 ′ 𝑑
Most used kernels Polynomial: 𝑘 𝒙,𝒙′ = 1+ 𝒙 𝑇 𝒙 ′ 𝑑 Linear: special case of polynomial Gaussian: 𝑘 𝒙,𝒙′ = 𝑒 − 𝒙−𝒙′ 2 𝛾 𝑑, 𝛾 etc. are called “kernel hyperparameters” They have to be chosen by the user

Ouf, now it seems that quite some tuning is required …
The user should choose … 𝐶 kernel function kernel function hyperparameters Role of regularization parameter 𝐶: even more pronounced in an enlarged feature space where perfect separation can typically be achieved Overly large value of 𝐶 will lead to an overfit “too curvy” boundary. Overly small 𝐶 will lead to an overly smooth boundary, with big training error. Large 𝑑, 𝛾 ⇒ kernel function “too flexible”, very nonlinear boundary can be achieved Proper tuning is essential for good SVM performance

Automatic tuning of the SVM hyperparameters A nonlinear, non analytical optimization problem
Choose: 𝐶, kernel, ( 𝑑, 𝛾, … ) Such that: SVM accuracy is maximized Kernel choice: binary → coninuous 𝑘 𝒙,𝒙′ = 𝜆 1+ 𝒙 𝑇 𝒙 ′ 𝑑 + 1−𝜆 𝑒 − 𝒙−𝒙′ 2 𝛾 SVM accuracy: 10-fold cross-validation

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

The basic cycle of the ES algorithm
𝑓=… × × × 𝑓=… × Explore Exploit

Mutation: create an offspring out of one parent
𝒚 is created by mutating 𝒚: 𝒚 =𝒚+𝒛 with 𝒛≔𝜎 𝒩 1 0,1 ,…, 𝒩 𝑛 0,1 𝜎 is called the mutation strength

Create 𝜆 offsprings out of one parent
× × × ×

Each variable 𝑙 has its mutation strength 𝝈 𝒍 Mutation strengths are also mutated 𝜎 𝑙 = 𝜎 𝑙 𝑒 𝜁 𝑙 with 𝜁 𝑙 sampled from 𝜏𝒩 0,1 + 𝜏 ′ 𝒩 0,1 Each individual carries its mutation strengths’ values 𝑦= 𝑦 1 ,…, 𝑦 𝑙 ,…, 𝜎 1 ,…, 𝜎 𝑙 ,… Idea: individuals with more suitable mutation strength values will survive Before mutating the individual object parameters, the strategy parameters are first mutated

Population 𝜇>1 × × × ×

Another variation operator: Recombination
Create offspring out of 𝜚 parents e.g. (𝜚=2) 𝑦 𝑖 = 𝑦 𝑖 𝐴 + 𝑦 𝑖 𝐵 2 Do 𝜆 times recombination Then apply mutation on those offsprings Parents are selected by uniform random distribution (their fitness is NEVER taken into account) × × × × × × × × 𝜇/𝜚 + , 𝜆 −𝐸𝑆

𝜇/𝜚,𝜆 preferred over 𝜇/𝜚+𝜆 selection better in leaving local optimum better in following moving optima with the + strategy bad 𝜎 can survive too long 𝜇>1 to carry different strategies high selective pressure (usually 𝜆≈7∙𝜇) to generate offspring surplus mix strategy parameters (i.e. mutation strengths) by recombining them

ES-tuned SVM classifier Coming up with the oscillation damping classifier

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

Random Forests A promising alternative
A collection of decision trees Basic Idea of DT: Greedy algorithm to progressively select the cut-attributes Splitting decided according to some node impurity measure typically the Gini index

Assume independence among classifiers
Ensemble classifiers General Idea Why do they work Assume 25 classifiers Each with error rate 𝜀=0.35 Assume independence among classifiers Error rate of the ensemble classifier: 𝑖= 𝑖 𝜀 𝑖 1−𝜀 25−𝑖 =0.06

Random Forests – The algorithm
Given training dataset 𝒟= 𝒙 1 , 𝑦 1 … 𝒙 𝑛 , 𝑦 𝑛 For 𝑏=1 to 𝐵: Draw a bootstrap sample 𝒟 𝑏 of size 𝑛 from 𝒟 (i.e. sample 𝑛 times with replacement) Grow a tree classifier on 𝒟 𝑏 , where each split is computed as follows: Select 𝑚 variables at random (from the 𝑝 variables) Pick the best variable/split-point among the 𝑚 Split the current node into two Output: the ensemble of 𝐵 trees 𝜚 : pairwise correlation 𝜚 𝜎 2 + 1−𝜚 𝐵 𝜎 2 Feature importance insight Massive parallelization potential

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

PMU measurements SCADA system (time-stamped data) WAMS generation, load dispatch line power flows FACTS devices status (PSS status) time-stamped oscillations damping ratios output labels Need for proper feature selection Database Train classifier input variables

Test system - Modified Nordic32 12978 samples, produced by simulations
Generators mostly participating at the Hz mode (based on participation factors from linear model)

Damping vs. Intertie Cut Correlated, but …
3580 samples 1643 samples (out of 12978) Correspond to different PSS being off 1271 samples 4851 samples

ES-SVM classifier 10-fold cross-validation accuracy
Input features kernel mixed radial basis polynomial Only intertie flow 92.7 92.7 92.0 Intertie flow & PSS status 93.4 94.0 92.8 Dispatch 95.6 95.6 95.6 Intertie flow, PSS status & 98.3 97.8 98.2 synthetic features Dispatch & PSS status 98.6 97.8 98.3 Dispatch, power flows, 99.2 98.6 99.1 PSS status & synthetic features 1% - 3% improvement compared to initial guess mixed kernel slightly better More features ⇒ better performance (even if redundant)

Random Forest classifier Out-of-bag accuracy
Input features Accuracy Dispatch, power flows, PSS status & synthetic features 97.79 PSS, Intertie, Line 18, Line 32 98.54 PSS, Intertie, Gen63, Line 16, Line 32 98.53 PSS, Intertie, Gen63 & 6 line flows 98.59 18 16 32 very efficient feature selection less accurate than SVM Gen63

Presentation outline Problem statement Data mining Support Vector Machines Evolution Strategies Random Forests Solution – Results Conclusion

Conclusion … and challenges
WAMS-SCADA link turned out to be an interesting idea At least for the inter-area oscillations case SVM achieved higher accuracy proper SVM tuning pays off RFs are not much worse, while allowing for very efficient feature selection Challenges… Check in real data Computational intensiveness Close the loop – Correct operating point based on model

Acknowledgment The author gratefully acknowledges the financial support from Marie Curie FP7-IAPP Project: Using real-time measurements for monitoring and management of power transmission dynamics for the smart grid- REAL-SMART, Contract No. PIAP-GA