Data Mining Approaches for Intrusion Detection Wenke Lee and Salvatore J. Stolfo Computer Science Department Columbia University.

Slides:



Advertisements
Similar presentations
Intrusion Detection Systems (I) CS 6262 Fall 02. Definitions Intrusion Intrusion A set of actions aimed to compromise the security goals, namely A set.
Advertisements

1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Han-na Yang Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst.
Data Mining and Intrusion Detection
IDS/IPS Definition and Classification
 Firewalls and Application Level Gateways (ALGs)  Usually configured to protect from at least two types of attack ▪ Control sites which local users.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
EECS Presentation Web Tap: Intelligent Intrusion Detection Kevin Borders.
5/1/2006Sireesha/IDS1 Intrusion Detection Systems (A preliminary study) Sireesha Dasaraju CS526 - Advanced Internet Systems UCCS.
This work is supported by the National Science Foundation under Grant Number DUE Any opinions, findings and conclusions or recommendations expressed.
1 Intrusion Detection CSSE 490 Computer Security Mark Ardis, Rose-Hulman Institute May 4, 2004.
Learning Classifier Systems to Intrusion Detection Monu Bambroo 12/01/03.
Intrusion detection Anomaly detection models: compare a user’s normal behavior statistically to parameters of the current session, in order to find significant.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
seminar on Intrusion detection system
Data Mining – Intro.
Intrusion Detection Systems. Definitions Intrusion –A set of actions aimed to compromise the security goals, namely Integrity, confidentiality, or availability,
Design and Implementation of SIP-aware DDoS Attack Detection System.
Intrusion Detection - Arun Hodigere. Intrusion and Intrusion Detection Intrusion : Attempting to break into or misuse your system. Intruders may be from.
Lecture 11 Intrusion Detection (cont)
Department Of Computer Engineering
Intrusion Detection System Marmagna Desai [ 520 Presentation]
WAC/ISSCI Automated Anomaly Detection Using Time-Variant Normal Profiling Jung-Yeop Kim, Utica College Rex E. Gantenbein, University of Wyoming.
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Data Mining for Intrusion Detection: A Critical Review Klaus Julisch From: Applications of data Mining in Computer Security (Eds. D. Barabara and S. Jajodia)
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Intrusion Detection for Grid and Cloud Computing Author Kleber Vieira, Alexandre Schulter, Carlos Becker Westphall, and Carla Merkle Westphall Federal.
Detecting Network Violation Based on Fuzzy Class-Association-Rule Mining Using Genetic Network Programming.
Network Intrusion Detection Using Random Forests Jiong Zhang Mohammad Zulkernine School of Computing Queen's University Kingston, Ontario, Canada.
IIT Indore © Neminah Hubballi
Intrusion Detection Techniques for Mobile Wireless Networks Zhang, Lee, Yi-An Huang Presented by: Alex Singh and Nabil Taha.
INTRUSION DETECTION INTRUSION DETECTION INTRUSION DETECTION INTRUSION DETECTION INTRUSION DETECTION INTRUSION DETECTION INTRUSION DETECTION INTRUSION DETECTION.
Grant Pannell. Intrusion Detection Systems  Attempt to detect unauthorized activity  CIA – Confidentiality, Integrity, Availability  Commonly network-based.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
An Overview of Intrusion Detection Using Soft Computing Archana Sapkota Palden Lama CS591 Fall 2009.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
2001/11/27IDS Lab Seminar1 Adaptive Fraud Detection Advisor: Dr. Hsu Graduate: Yung-Chu Lin Source: Fawcett, Tom and Foster Provost, Journal of Data Mining.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Second Line Intrusion Detection Using Personalization DISA Sponsored GWU-CS.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.
1 A Network Security Monitor Paper By: Heberlein et. al. Presentation By: Eric Hawkins.
Intrusion Detection Systems Paper written detailing importance of audit data in detecting misuse + user behavior 1984-SRI int’l develop method of.
Intrusion Detection System
CS526: Information Security Chris Clifton November 25, 2003 Intrusion Detection.
I NTRUSION P REVENTION S YSTEM (IPS). O UTLINE Introduction Objectives IPS’s Detection methods Classifications IPS vs. IDS IPS vs. Firewall.
Scientific Systems Not for Public Release SSCI #1301 DARPA OASIS PI MEETING – Santa Fe, NM - Jul 24-27, 2001 Intelligent Active Profiling for Detection.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Blackboard-Based Learning Intrusion Detection System: A New Approach
1. ABSTRACT Information access through Internet provides intruders various ways of attacking a computer system. Establishment of a safe and strong network.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
Ch.22 INTRUSION DETECTION
QianZhu, Liang Chen and Gagan Agrawal
Intrusion Detection Systems
Jiawei Han and Micheline Kamber Department of Computer Science
Data Warehousing Data Mining Privacy
Intrusion Detection Systems
Modeling IDS using hybrid intelligent systems
Presentation transcript:

Data Mining Approaches for Intrusion Detection Wenke Lee and Salvatore J. Stolfo Computer Science Department Columbia University

Overview Intrusion detection and computer security Current intrusion detection approaches Our proposed approach Data mining Classification models for intrusion detection Mining patterns from audit data System architecture Current status Research plans

Overview Current intrusion detection approaches and problems Our proposed approach Data mining Classification models for intrusion detection Mining patterns from audit data System architecture Current status Research plans

Intrusion Detection and Computer Security Computer security goals: confidentiality, integrity, and availability Intrusion is a set of actions aimed to compromise these security goals Intrusion prevention (authentication, encryption, etc.) alone is not sufficient Intrusion detection is needed

Intrusion Detection Primary assumption: user and program activities can be monitored and modeled Key elements: –Resources to be protected –Models of the “normal” or “legitimate” behavior on the resources –Efficient methods that compare real-time activities against the models and report probably “intrusive” activities.

Inductive Learning Engine Audit Data Preprocessor Audit Records Activity Data Detection Models Decision Table (Base) Detection Engine Rules Evidence (Meta) Detection Engine Evidence from Other Agents Final Assertion Decision Engine Action/Report Learning Agent Base Detection Agent Meta Detection Agent

10:35: > :. 512:1024(512) ack 1 win :35: > :. ack 1073 win :35: > :. ack 2650 win tcpdump Connection Records Profile execve(“/usr/ucb/finger”, … open(“/dev/zero … mmap(…... truss execve open mmap... System call Sequence Profile Learning

Intrusion Detection Two categories of techniques : –Misuse detection: use patterns of well-known attacks to identify intrusions –Anomaly detection: use deviation from normal usage patterns to identify intrusions

Current Intrusion Detection Approaches Misuse detection: –Record the specific patterns of intrusions –Monitor current audit trails (event sequences) and pattern matching –Report the matched events as intrusions –Representation models: expert rules, Colored Petri Net, and state transition diagrams

Current Intrusion Detection Approaches Anomaly detection : –Establishing the normal behavior profiles –Observing and comparing current activities with the (normal) profiles –Reporting significant deviations as intrusions –Statistical measures as behavior profiles: ordinal and categorical (binary and linear)

Current Intrusion Detection Approaches Main problems: manual and ad-hoc –Misuse detection: Known intrusion patterns have to be hand-coded Unable to detect any new intrusions (that have no matched patterns recorded in the system) –Anomaly detection: Selecting the right set of system features to be measured is ad hoc and based on experience Unable to capture sequential interrelation between events

Our Proposed Approach A systematic framework to: –Build good models: select appropriate features of audit data to build intrusion detection models –Build better models: architect a hierarchical detector system that combines multiple detection models –Build updated models: dynamically update and deploy new detection system as needed

Our Proposed Approach Support for the feature selection and model construction process: –Apply data mining algorithms to find consistent inter- and intra- audit record (event) patterns –Use the features and time windows in the discovered patterns to build detection models –A support environment to semi-automate this process

Our Proposed Approach Combining multiple detection models: –Each (base) detector model monitors one aspect of the system –They can employ different techniques and be independent of each other –The learned (meta) detector combines evidence from a number of base detectors

Our Proposed Approach An intelligent agent-based architecture: –learning agents: continuously compute (learn) the detection models –detection agents: use the (updated) models to detect intrusions

Data Mining KDD (Knowledge Discovery in Database): –The process of identifying valid, useful and understandable patterns in data –Steps: understanding the application domain, data preparation, data mining, interpretation, and utilizing the discovered knowledge –Data mining: applying specific algorithms to extract patterns from data

Data Mining Relevant data mining algorithms: –Classification: maps a data item into one of several pre-defined categories –Link analysis: determines relations between fields in the database –Sequence analysis: models sequence patterns

Data Mining Why is it applicable to intrusion detection? –Normal and intrusive activities leave evidence in audit data –From the data-centric point view, intrusion detection is a data analysis process –Successful applications in related domains, e.g., fraud detection, fault/alarm management

Building Classifiers for Intrusion Detection Experiments in constructing classification models for anomaly detection Two experiments: –sendmail system call data –network tcpdump data Use meta classifier to combine multiple classification models

Classification Models on sendmail The data: sequence of system calls made by sendmail. Classification models (rules): describe the “normal” patterns of the system call sequences. The rule set is the normal profile of sendmail Detection: calculate the deviation from the profile –large number/high scores of “violations” to the rules in a new trace suggests an exploit

Classification Models on sendmail The sendmail data: –Each trace has two columns: the process ids and the system call numbers –Normal traces: sendmail and sendmail daemon –Abnormal traces: sunsendmailcap, syslog- remote, syslog-remote, decode, sm5x and sm56a attacks.

Classification Models on sendmail Data preprocessing: –Use sliding window to create sequence of consecutive system calls –Label the sequences to create training data:

Classification Models on sendmail Experiment 1 - learning patterns of normal sequences: –Each record: n consecutive system calls plus a class label, “normal” or “abnormal” –Training data: sequences from 80% of the normal traces plus some of the attack traces –Testing data: traces not used in training –Use RIPPER to learn specific rules for the minority classes

sendmail Experiment 1 Examples of output RIPPER rules: –if the 2nd system call is vtimes and the 7th is vtrace, then the sequence is “normal” –if the 6th system call is lseek and the 7th is sigvec, then the sequence is “normal” –… –if none of the above, then the sequence is “abnormal”

sendmail Experiment 1 Using the learned rules to analyze a new trace: –label all sequences according to the rules –define a region as l consecutive sequences –define a “abnormal” region as having more “abnormal” sequences than normal ones –calculate the percentage of “abnormal” regions –the trace is “abnormal” if the percentage is above a threshold

sendmail Experiment 1 Hypothesis: need specific rules of “normal” sequences to detect “unknown/new” intrusions Some results using various normal v.s. abnormal distributions: –Experiment A: 46% normal, length 11 –Experiment B: 46% normal, length 7 –Experiment C: 54% normal, length 11 –Experiment D: 54% normal, length 7

sendmail Experiment 1 All 4 experiments: –Training data includes sequences from intrusion traces in Bold and Italic, and sequences from 80% of the normal sendmail traces –Percentage of abnormal “regions” of each trace (showed in the table) is used as the intrusion indicator –The output rule sets contain ~250 rules, each with 2 or 3 attribute tests. This compares with the total ~1,500 different sequences. Experiment A and B generate rules that characterize “normal” sequences of length 11 and 7 respectively Experiment C and D generate rules that characterize “abnormal” sequences of length 11 and 7 respectively

sendmail Experiment Anomaly detectors A and B performs better then misuse detectors C and D.

Classification Models on sendmail Experiment 2 - learning to predict normal system call: –Each record: n-1 consecutive system calls plus a class label, the nth or the middle system call –Training data: sequences from 80% of the normal traces (no abnormal traces) –Testing data: traces not used in training –Use RIPPER to learn rules

sendmail Experiment 2 Examples of output RIPPER rules: –if the 3rd system call is lstat and the 4th is write, then the 7th is stat –if the 1st system call is sigblock and the 4th is bind, then the 7th is setsockopt –… –if none of the above, then the 7th is open

sendmail Experiment 2 Using the learned rules to analyze a new trace: –predict system calls according to the rules –if a rule is violated, the “violation” score is increased by 100 times the accuracy of the rule –the trace is “abnormal” if the violation score is above a threshold

sendmail Experiment 2 Some results: –Experiment A: predict the 11th system call –Experiment B: predict the middle system call in a sequence of length 7 –Experiment C: predict the middle system call in a sequence of length 11 –Experiment D: predict the 7th system call

sendmail Experiment 2 All 4 experiments: –Training data includes only the sequences from 80% of the normal sendmail traces –Output rules predict what should be the “normal” nth or the middle system call –Score of rule “violation” (mismatch) of each trace (showed in the table) is used as the intrusion indicator –The output rule sets contain ~250 rules, each with 2 or 3 attribute tests. This compares with the total ~1,500 different sequences.

sendmail Experiment The 11th (A) and 4th (B) system call are more predictable

Classification Models on sendmail Lessons learned: –Normal behavior can be established and used to detect anomalous usage –Need to collect near “complete” normal data in order to build the “normal” model –But how do we know when to stop collecting? –Need tools to guide the audit data gathering process

Classification Models on tcpdump The tcpdump data (part of a public data visualization contest): –Packets of incoming, out-going, and internal broadcast traffic –One trace of normal network traffic –Three traces of network intrusions

Classification Models on tcpdump Data preprocessing: –Extract the “connection” level features: Record connection attempts Monitor data packets and count: # of bytes in each direction, resent rate, hole rate, etc. Watch how connection is terminated

Classification Models on tcpdump Data Preprocessing: –Each record has: start time and duration participating hosts and ports (applications) statistics (e.g., # of bytes) flag: “normal” or a connection/termination error protocol: TCP or UDP –Divide connections into 3 types: incoming, out- going, and inter-lan

Classification Models on tcpdump Building classifier for each type of connections: –Use the destination service (port) as the class label –Training data: 80% of the normal connections –Testing data: 20% of the normal connections and connections in the 3 intrusion traces –Apply RIPPER to learn rules

Classification Models on tcpdump The output RIPPER rules describe the “normal” characteristics of the destination services. The rule set is the profile of the normal network traffic. Using the rules to analyze tcpdump traces: –Examine each connection record according to the rules –Calculate the percentage of misclassification (violation of a rule). This percentage is the deviation from the profile.

Classification Models on tcpdump Results - misclassification rate on each type of connections: This model is not very effective in detecting intrusions

Classification Models on tcpdump Adding temporal features for better models: –Examine all connections in the past n seconds, and count: the number of connection errors, all other errors, connections to system services, user applications, and connection to the same service as the current connection average duration and data bytes of all connections; and the same averages of connections to the same service.

Classification Models on tcpdump Results of adding the temporal features, the time window is 30 seconds: Adding temporal statistical features improves the effectiveness of the detection models

How do we obtain the optimal time window length? Effects of time window length on misclassification rate

Classification Models on tcpdump Lessons learned: –Data preprocessing requires extensive domain knowledge –Adding temporal features improves classification accuracy –Need tools to guide (temporal) feature selection

Building Classifiers for Intrusion Detection Meta classifier that combines evidence from multiple detection models: –Build base classifiers that each model one aspect of the system –The meta learning task: each record has a collection of evidence from base classifiers, and a class label “normal” or “abnormal” on the state of the system –Apply a learning algorithm to produce the meta classifier

Mining Patterns from Audit Data Association rules: describe multi-feature (attribute) correlation from a database X => Y, confidence, support: –X and Y are subsets of the attribute values in a record –support is the percentage of records that contain X and Y –confidence is support(X+Y)/support(X)

Association Rules Motivations: –Audit data can be easily formatted into a database table –Program executions and user activities have frequent correlation among system features –Incremental updating of the rule set is easy An example from the.sh_history : –trn => rec.humor, [0.3, 0.1] –Meaning: 30% of the time when using trn, the user is reading rec.humor; and reading this newsgroup constitutes 10% of all sh commands

Mining Patterns from Audit Data Frequent Episodes: frequent events occurring within a time window X => Y, confidence, support, window: –X and Y are subsets of the attribute values in a record –support is the percentage of (sliding) windows that contain X and Y –confidence is support(X+Y)/support(X)

Frequent Episodes Motivation: –Sequence information needs to be included in a detection model An example from a department’s web log: –home, research => theory, [0.2, 0.05], [30] –Meaning: 20% of the time, after home and research pages are visited (in that order), the theory is then visited within 30 seconds from when home is visited; and visiting these three pages constitutes 5% of all visits to the web site

Using the Mined Patterns Guide the audit data gathering process: –Run a program under different settings –For each run, calculate the association rules and frequent episodes from its audit data –Merge them into an aggregate rule set –Stop gathering audit data when no rules can be added from a new run

Using the Mined Patterns Support the feature selection process: –System features in the association rules and frequent episodes should be included in the classification models –Time window and features in the frequent episodes suggest additional temporal features should be considered

Using the Mined Patterns Alternatives and complement to classification models: –Examine new audit trace and calculate “violation” scores: missing rules, new rules, deviations in confidence and support, etc. –Study the “unique” patterns in the trace of suspected attack to further pin point the cause of the intrusion alarms.

Using the Mined Patterns tcpdump data revisited: –How to select the right time window? –Hypothesis: the appropriate window should contain stable sets of frequent episodes –Experiments: mine frequent episodes using different window lengths, and count the number of episodes

The optimal time window length for classification has stable # of episodes Results on time window length v.s. # of episodes:

Using the Mined Patterns tcpdump data revisited: –“unique” patterns in intrusion data may provide some insights –intrusion 3: one of the unique frequent episode rules: –dst_srv=“auth” => flag=“unwanted_syn_ack”, [0.82, 0.1], [30] one of the unique association rules: –src_srv=“smtp” => duration=0, flag=“unwanted_syn_ack”, dst_srv=“user_apps”, [1.0, 0.38]

Architecture Support Dedicated learning agents are responsible for building detection models Base and meta detection agents are equipped with learned models Detection agents provide new audit data to the learning agents Learning agents dispatch updated models JAM (Java Agents for Meta-learning) on fraud detection is the model architecture

Inductive Learning Engine Audit Data Preprocessor Audit Records Activity Data Detection Models Decision Table (Base) Detection Engine Rules Evidence (Meta) Detection Engine Evidence from Other Agents Final Assertion Decision Engine Action/Report Learning Agent Base Detection Agent Meta Detection Agent

Current Status Accomplished: –Experiments on sendmail and tcpdump data –Implementation of the association rules and the frequent episodes algorithms. Testing on medium size data sets (30,000+ records, each with 6+ fields) has been completed. –Design and 35% of the implementation of a support environment for mining patterns from audit data –High level design system architecture design

Research Plans To be completed within the next year and a half: –Finish the implementation of the support environment for mining patterns –Experiments on using the algorithms and the environment to gather audit data and select features –Experiments on building meta detection models

Research Plans To be completed within the next year and a half: –Detailed architecture design –Implementing a prototype intrusion detection system –Final evaluation using “standard/public” data sets

Conclusions We demonstrated the effectiveness of classification models for intrusion detection We propose to use systematic data mining approaches to select the relevant system features to build better detection models We propose to use (meta) learning agent- based architecture to combine multiple models, and to continuously update the detection models.