Intrusion Detection Using Neural Networks and Support Vector Machine IEEE WCCI IJCNN 2002 World Congress on Computational Intelligence International Joint Conference on Neural Networks Intrusion Detection Using Neural Networks and Support Vector Machine Srinivas Mukkamala, Guadalupe Janoski, Andrew Sung Dept. of CS in New Mexico Institute of Mining and Technology
Outline Approaches to intrusion detection using neural networks and support vector machines DARPA dataset Neural Networks Support Vector Machines Experiments Conclusion and Comments
Approaches Key ideas are to discover useful patterns or features that describe user behavior on a system And use the set of relevant features to build classifiers that can recognize anomalies and known intrusions Neural networks and support vector machines are trained with normal user activity and attack patterns Significant deviations from normal behavior are flagged as attacks
DARPA Data for Intrusion Detection DARPA (Defense Advanced Research Projects Agency) An agency of US Department of Defense responsible for the development of new technology for use by the military Benchmark from a KDD (Knowledge Discovery and Data Mining) competition designed by DARPA Attacks fall into four main categories DOS: denial of service R2L: unauthorized access from a remote machine U2R: unauthorized access to local super user (root) privileges Probing: surveillance and other probing
Features http://kdd.ics.uci.edu/databases/kddcup99/task.html
Neural Networks Neuron 神經 Signals Dendrite 樹突 Signals Gather signals Soma 中心 Combine signals & decide to trigger Signals Signals Axon 軸突 Output signal
Σ Σ Σ Σ X2 θ X1 w1 w2 Divide and Conquer N1 N3 N2 平面的線: w1X1 + w2X2 – θ = 0 D A θ X1 INPUT w1 OUTPUT Σ w2 C B ACTIVATION Divide and Conquer WEIGHT Data N1 N2 A +1 -3 B -1 C D -1 1 N1 -1 -1 x1 1 Σ N3 1 1 out1 x2 Σ out3 N3 A +1 -1 B C D -1 1 N2 1 -1 x1 Σ out2 -1 x2
Feed Forward Neural Network (FFNN) Layer 1 Layer 2 Layer 3 Layer 4 Decide Architecture 1 Determine Weight Automatically 2 tanh(S) S x1(1) xj(l) N1 Nj Hyperbolic function eS – e-S Layer 1 Layer l tanh(S) = eS + e-S general S1(1) Sj(l) Σ Σ w01(1) w21(1) wij(l) w11(1) x0(0) x2(0) x1(0) xi(l-1)
Σ Σ Σ How to minimize E(w) ? Stochastic Gradient Descent (SGD) w w w Input Output Σ w w w Σ w g(x) 由w所組成的classifier Training Data: E Error Function: How to minimize E(w) ? Stochastic Gradient Descent (SGD) w is random small value at the beginning for T iterations wnew wold – η.▽w(En) w learning rate
Σ …… Back Propagation Algorithm Nj forward for l = 1, 2, …, L Layer 1 Layer 2 … … Layer L-1 Layer L …… for l = 1, 2, …, L compute Sj(l) and xj(l) x1(l) Back Propagation Algorithm backward Nj for l = L, L-1, …, 1 compute δi(l) Layer l Sj(l) Σ wij(l) General xi(l-1)
Σ Σ Σ Σ Feed Forward NNet Consists of layers w 1, 2, …, L w w wij(l) connect neuron i in layer (l-1) to neuron j in layer l w w … … Σ Cumulated signal w w w Σ Activated output w often tanh x1(l) Minimize E(w) and determine the weights automatically Nj SGD (Stochastic Gradient Descent) Layer l w is random small value at the beginning for T iterations wnew wold – η.▽w(En) Sj(l) Σ Forward: compute Sj(l) and xj(l) Backward: compute δi(l) wij(l) Stop when desired error rate was met xi(l-1)
Support Vector Machine A supervised learning method Is known as the maximum margin classifier Find the max-margin separating hyperplane
SVM – hard margin max argmin <w, x> - θ = 0 2 max ∥w∥ w, θ yn(<w, xn> - θ) ≧1 2 ∥w∥ 1 argmin <w, w> 2 w, θ yn(<w, xn> - θ) ≧1 x1
Quadratic programming 1 Σ Σ aijvivj + Σ bivi 2 i j argmin v V* quadprog(A, b, R, q) Σ rkivi ≧ qk i Let V = [ θ, w1, w2, …, wD ] Adapt the problem for quadratic programming Find A, b, R, q and put into the quad. solver argmin 2 w, θ yn(<w, xn> - θ) ≧1 1 <w, w> Σ wd2 2 1 d=1 D (-yn) θ + Σ yn (xn)d wd ≧ 1 d=1 D
Adaptation argmin Σ Σ aijvivj + Σ bivi Σ rkivi ≧ qk Σ wd2 1 2 1 V = [ θ, w1, w2, …, wD ] v0, v1, v2, .…, vD (-yn) θ + Σ yn (xn)d wd ≧ 1 d=1 D Σ wd2 2 1 d=1 D (1+D)*(1+D) v0 vd (1+D)*1 a00 = 0 a0j = 0 ai0 = 0 i ≠ 0, j ≠ 0 aij = 1 (i = j) 0 (i ≠ j) b0 = 0 i ≠ 0 bi = 0 rn0 = -yn d > 0 rnd = yn (xn)d (2N)*(1+D) qn = 1 (2N)*1
SVM – soft margin Allow possible training errors Tradeoff c Large c : thinner hyperplane, care about error Small c : thicker hyperplane, not care about error tradeoff 1 argmin <w, w> + c Σξn errors 2 w, θ n yn(<w, xn> - θ) ≧1 – ξn ξn ≧ 0
Adaptation argmin Σ Σ aijvivj + Σ bivi Σ rkivi ≧ qk 1 2 V = [ θ, w1, w2, …, wD, ξ1, ξ2, …, ξN ] (1+D+N)*(1+D+N) (1+D+N)*1 (2N)*1 (2N)*(1+D+N)
Primal form and Dual form 1 argmin <w, w> + c Σξn 2 w, θ n yn(<w, xn> - θ) ≧1 – ξn Variables: 1+D+N Constraints: 2N ξn ≧ 0 Dual form 1 argmin ΣΣ αnynαmym<xn, xm> - Σ αn 2 α n m n 0 ≦αn≦C Variables: N Constraints: 2N+1 Σ ynαn = 0 n
Dual form SVM Find optimal α* Use α* solve w* and θ αn=0 correct or on 0<αn<C on αn=C wrong or on αn=C free SV Support Vector αn=0
Nonlinear SVM Nonlinear mapping X Φ(X) {(x)1, (x)2} R2 {1, (x)1, (x)2, (x)12, (x)22, (x)1(x)2} R6 Need kernel trick 1 argmin ΣΣ αnynαmym<Φ(xn), Φ(xm)> - Σ αn 2 α n m n 0 ≦αn≦C Σ ynαn = 0 n (1+ <xn, xm>)2
Support Vector Machines Experiments Pre-processing Training Testing Using automated parsers to process the raw TCP/IP dump data into machine-readable form 7312 training data (different types of attacks and normal data) has 41 features 6980 testing data evaluate the classifier Support Vector Machines Neural Networks Details RBF kernel C = 1000 204 support vectors (29 free) 3-layer 41-40-40-1 FFNNets Scaled conjugate gradient descent Desired error rate = 0.001 Accuracy 99.5% 99.25% Time spent 17.77 sec 18 min
Conclusion and Comments Speed SVMs is significant shorter Avoid the ”curse of dimensionality” by max-margin Accuracy Both have high accuracy SVMs can only make binary classification IDS requires multiple-class identification How to determine the features?