Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining for Cyber Threat Analysis Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota.

Similar presentations


Presentation on theme: "Data Mining for Cyber Threat Analysis Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota."— Presentation transcript:

1 Data Mining for Cyber Threat Analysis Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota Project Participants: A. Lazarevic, V. Kumar, J. Srivastava H. Ramanni, L. Ertoz, M. Joshi, E. Eilertson, S. Ketkar

2 Mining Large Data Sets - Motivation  Examples:  Computational simulations  Information Assurance & Network intrusion  Sensor networks  Homeland Defense  There is often information “hidden” in the data that is not readily evident.  Human analysts may take weeks to discover useful information.  Much of the data is never analyzed at all. Computational Simulations Network Intrusion Detection Sensor Networks

3 Data Mining for Homeland Defense  “Data mining and data warehousing are part of a much larger FBI plan to … discover patterns and relationships that indicate criminal activity” (network intrusions, cyber attacks, terroristic calls, …)” Federal Computer Week, June 3, 2002  FBI Director Robert Mueller: “New information technology is critical top conducting business in a different way, critical to analyzing and sharing information on a real time basis”

4 Homeland Defense: Key issues  Information fusion from diverse data sources including intelligence, agencies, law enforcement, profile …  Data mining on this information base to uncover latent models and patterns  Visualization and display tools for understanding the relationships between persons, events and patterns of behavior Cultural Data Intelligence Data Law enforcement Data Information Fusion Event recog Assoc analysis Threat predictor Threat Visualizer

5 Information Assurance: Introduction  As the cost of the information processing and Internet accessibility falls, more and more organizations are becoming vulnerable to potential cyber threats  “unlawful attacks and threats of attack against computers, networks, and the information stored therein when done to intimidate or coerce a government or its people“ – D. Denning

6 Information Assurance : Intrusion Detection  Intrusion Detection: Detecting a set of actions that compromise the integrity, confidentiality, or availability of information resources.  Viruses and Internet worms  Theft of classified information from DOD computers  Problem of identifying individuals  who are using computers without authorization  who have legitimate access but are abusing their privileges  Intrusion Detection System (IDS)  combination of software and hardware that attempts to perform intrusion detection  raise the alarm when possible intrusion happens

7 Data Mining on Intelligence Databases Purpose:  Develop methods to identify potential threats  Mine intelligence database Example: Forecasting Militarized Interstate Disputes (FORMIDs).  Data: social, political, economic, geographical information for pairs of countries  ratio of military capability  democracy index  level of trade  distance  Predict: the likelihood of militarized interstate disputes (MIDs).  Overall Objective: predict likely instabilities involving pairs of countries.  Collaborators: Sean O’Brien, Center of Army Analysis (CAA), Kosmo Tatalias (NCS).

8 Data Mining in Commercial Word Employed # of years # of years in school YESNO Yes No Yes Married < 2  4 4 > 4 Classification / Predictive Modeling {Direct Marketing, Fraud Detection} Clustering (Market segmentation) Association Patterns Marketing / Sales Promotions Given its success in commercial applications, data mining holds great promise for analyzing large data sets.data mining  2

9 Key Technical Challenges  Large data size  Gigabytes of data are common  High dimensionality  Thousands of dimensions are possible  Spatial/temporal nature of the data  Data points that are close in time and space are highly related  Skewed class distribution  Interesting events are very rare  looking for the “needle in a haystack”  Data Fusion & Data Preprocessing  Data from multiple sources  Data of multiple types (text, images, voice, … )  Data cleaning – missing value imputation, scaling, mismatch handling “Mining needle in a haystack. So much hay and so little time”

10 Intrusion Detection Research at AHPCRC  Misuse Detection - Predictive models  Mining needle in a haystack – models must be able to handle skewed class distribution, i.e., class of interest is much smaller than other classes.  Learning from data streams - intrusions are sequences of events  Anomaly and Outlier Detection  Able to detect novel attacks through outlier detection schemes  Detect deviations from “normal” behavior as anomalies  Construction of useful features that may be used in data mining  Modifying signature based intrusion detection (SNORT) systems to incorporate anomaly detection algorithms  Summer Institute Projects  Implementing Anomaly/Outlier detection algorithms  Investigating algorithms for classification of rare classes  Visualizing tool for monitoring network traffic and suspicious network behavior

11 Learning from Rare Class  Key Issue: Examples from rare classes get missed in standard data mining analysis  Over-sampling the small class or under-sampling the large class  PNrule and related work [ Joshi, Agarwal, Kumar, SIAM 2001, SIGMOD 2001 ]  RareBoost [ Joshi, Agarwal, Kumar, ICDM 2001, KDD 2002 ]  SMOTEBoost [ Lazarevic, et al, in review ]  Classification based on association - add frequent items as “meta-features” to original data set

12 PN-rule Learning and Related Algorithms*  P-phase:  cover most of the positive examples with high support  seek good recall  N-phase:  remove FP from examples covered in P-phase  N-rules give high accuracy and significant support Existing techniques can possibly learn erroneous small signatures for absence of C C NC PNrule can learn strong signatures for presence of NC in N-phase C NC * - SIAM 2001, SIGMOD 2001, ICDM 2001, KDD 2002

13 SMOTE and SMOTEBoost  SMOTE (Synthetic Minority Oversampling Technique) generates artificial examples from minority (rare) class along the boundary line segment  Generalization of over-sampling technique  Combination of SMOTE and boosting further improves the prediction performance or rare classes

14 SMOTE and SMOTEBoost Results Experimental Results on modified KDDCup 1999 data setmodified KDDCup 1999 data set

15 Classification Based on Associations  Current approaches use confidence-like measures to select the best rules to be added as features into the classifiers.  This may work well only if each class is well-represented in the data set.  For the rare class problems, some of the high recall itemsets could be potentially useful, as long as their precision is not too low.  Our approach:  Apply frequent itemset generation algorithm to each class.  Select itemsets to be added as features based on precision, recall and F-Measure.  Apply classification algorithm, i.e., RIPPER, to the new data set.

16 Experimental Results (on modified KDD Cup 1999 data)on modified KDD Cup 1999 data For rare classes, rules ordered according to F-Measure produce the best results. Original RIPPER RIPPER with high Precision rules RIPPER with high Recall rules RIPPER with high F-measure rules

17 Anomaly and Outlier Detection  Main Assumptions  All anomalous activities need closer inspection  Determine “normal activity profile” and flag an alarm when the state differs from the “normal profile”  Expert analyst examines suspicious activity to make final determination whether activity is indeed an intrusion  Drawbacks  Possible large number of false alarms and not recognizing attacks  Supervised (with access to normal data) vs. Unsupervised (with NO access to normal data) determining “normal behavior” False alarm Missed attacks Anomalous activities Normal profile

18 Outlier Detection  Outlier is defined as a data point which is very different from the rest of the data (“normal data”) based on some measure of similarity  Outlier detection approaches:  Statistics based approaches  Distance based techniques  Clustering based approaches  Density based schemes

19 Distance and density based schemes  Distance based approaches (NN approach) - Outliers are points that do not have enough neighbors  Density based approach (LOF approach) finds outliers based on the densities of local neighborhoods  Concept of locality becomes difficult to define due to data sparsity in high dimensional space  Clustering based approaches define outliers as points which do not lie in clusters  Implicitly define outliers as background noise

20 Outlier Detection Results (on DARPA’98 data)on DARPA’98 data Detection Rate (False alarm rate was fixed to 1%) The score values assigned to network connections from bursty attacks

21 Modifying SNORT  SNORT contains simple SPADE (Statistical Packet Anomaly Detection Engine)  SPADE only compares the statistics of packets  Our approach  Integrate our implemented outlier detection schemes into the SNORT since it is and open source code  Improve the detection of novel intrusions and suspicious behavior by using sophisticated outlier detection schemes

22 SNN Clustering – Finding Patterns in Noisy Data Finds clusters of arbitrary shapes, sizes and densities Handles high dimensional data Density invariant Built-in noise removal

23 Topics from Los Angeles Times (Jan. 1989) 3204 articles, words (LA Times, January 1989) afghanistan embassi guerrilla kabul moscow rebel soviet troop ussr withdraw chancellor chemic export german germani kadafi kohl libya libyan plant poison weapon west ahead ball basket beate brea chri coach coache consecut el final finish foul fourth free game grab half halftim hill host jef lead league led left los lost minut miss outscor overal plai player pointer quarter ralli rank rebound remain roundup scor score scorer season shot steal straight streak team third throw ti tim trail victori win won ab bengal bowl cincinnati craig dalla denver esiason field football francisco giant jerri joe miami minnesota montana nfl oppon pass pittsburgh quarterback rice rush super table taylor terri touchdown yard

24 Topics from FBI web site uss rocket cabin aircraft fuel hughes twa missile redstone ( Congressional Statement - TWA 800) theft art stolen legislation notices recoveries (FBI - Art Theft Program) signature ink writings genuine printing symbols (forensic science communications) forged memorabilia authentic bullpen (FBI Major Investigation - operation bullpen) arabia saudi towers dhahran (June bombing of the Khobar Towers military housing complex in Dhahran Kingdom of Saudi Arabia REWARD) classified philip quietly ashcroft hanssen drop cia affidavit tenet dedication compromised kgb successor helen volunteered (Agent Robert Philip Hanssen - espionage)

25 Topics from FBI web site afghanistan ramzi bin yousef islamic jihad egyptian bombings egypt pakistan hamas yemen headquartered usama laden kenya tanzania nairobi embassies dar salaam rahman mohamed abdel affiliated camps opposed deserve legat enemies vigilance plots casualties enterprise asian chinese enterprises korean vietnamese italian cartels heroin cosa nostra sicilian lcn firearms firearm bullets ammunition cartridges perpetrating bioterrorism responders credible exposed biological articulated covert hoax wmd assumes

26 Future Applications  Unclassified telephone calls data to be provided by INSCOM  Goal: to determine a terrorist in a haystack Nodes people Edges telephone calls (date / time / duration)

27 Conclusions  Predictive models specifically designed for rare class can help in improving the detection of small attack types  Simple outlier detection approaches appear capable of detecting anomalies  Clustering based approaches show promise in identifying novel attack types  Integration data mining techniques into SNORT should improve the detection rate

28 Data Mining Process Data mining – “non-trivial extraction of implicit, previously unknown and potentially useful information from data” Back

29 Modified KDDCup 1999 Data Set KDDCup 1999 data is based on DARPA 1998 data setDARPA 1998 data set Remove duplicates and merge new train and test data sets Sample 69,980 examples from the merged data set –Sample from neptune and normal subclass. Other subclasses remain intact. Divide in equal proportion to training and test sets Back

30 DARPA 1998 Data Set  DARPA 1998 data set (prepared and managed by MIT Lincoln Lab) includes a wide variety of intrusions simulated in a military network environment  9 weeks of raw TCP dump data  7 weeks for training (5 million connection records)  2 weeks for training (2 million connection records)  Connections are labeled as normal or attacks (4 main categories of attacks - 38 attack types)  DOS- Denial Of Service  Probe - e.g. port scanning  U2R- unauthorized access to root privileges,  R2L- unauthorized remote login to machine,  Two types of attacks  Bursty attacks -involve multiple network connections  Non-bursty attacks-involve single network connections Back to KDDCupBack to Experiments

31 Terrorist Threat Analyzer & PredictorOperational Capability: Ability to match data from multiple sources, resolving structural and semantic conflicts. Ability to recognize events and behaviors, each of whose (partial) information is available in different data streams. Ability to identify latent associations between suspects and their linkages to events/behaviors. Capability to predict threatening activities and behaviors with high probability. Ability to visualize interesting and significant events and behaviors. Terrorist Threat Analyzer & Predictor (T-TAP) Proposed Technical Approach: New Effort Key Technologies: High dimensional data clustering – METIS. Spatio-temporal change point detection. Association analysis and belief revision. High dimensional classification – SVM, NN, boosting. Task List: T1: Develop data fusion algorithms to match diverse intelligence, law enforcement, and cultural data. T2: Develop event & behavior recognition algorithms across multiple, multi-media, data streams. T3: Association analysis algorithms to determine hidden connections between suspects, events and behaviors; develop networks of associations. T4: Predictive models of threatening events and behaviors. T5: Interestingness & relevance based visualization models of significant events and behaviors. Rough Order of Magnitude Cost and Schedule: Tasks 1, 2, 3 will each proceed in parallel for the first 12 months, with version 1 released at the end of 6 months. At this point tasks 4, 5 will start and proceed in parallel. Cross feedback across various tasks will lead to final, refined tools at the end of the 18 month period. Deliverables: Database schema and structures to store terrain info. Software implementation of concealed cavities prediction model. User manuals, test reports, database schema Quarterly technical and status reports, and final report. Corporate Information: Vipin Kumar (POC) Army Research Center, University of Minnesota, 1100 Washington Avenue SE, Minneapolis, MN Phone (612) ; AHPCRC University of Minnesota Cultural Data Intelligence Data Law enforcement Data Information Fusion Event recog Assoc analysis Threat predictor Threat Visualizer T-TAP System

32 Distributed Virtual Integrated Threat Analysis Database Operational Capability: Global integrated view across multiple databases: integrated schema, global dictionary and directory, common ontology Threat analysis object repository comprehensive suspect dossier temporal activity tracks association networks Interactive database exploration Field and value based querying Approximate match based retrieval Distributed Virtual Integrated Threat Analysis Database (DVITAD) Proposed Technical Approach: New Effort Key Technologies: Semantic object matching Clustering large datasets – METIS Latent/Indirect association analyzer Task List: T1: Data normalization through the development of wrappers/connectors for various databases. T2: Integrated schema creation to model information found in all databases. T3: Matching entities (suspects, events, etc.) across multiple data sources; resolving conflicting attributes values for entities across databases. T4: Clustering of suspect profiles. T5: Building networks of hidden associations between suspects. T6: Constructing temporal activity tracks of events and linkage of suspects to the tracks. Rough Order of Magnitude Cost and Schedule: Tasks 1 and 2 will each proceed in parallel for the first 12 months, with version 1 released at the end of 6 months. At this point tasks 3,4,5,6 will start and proceed in parallel. Cross feedback across various tasks will lead to final, refined tools at the end of the 18 month period. Deliverables: Global schema, dictionary, and directory for the integrated database. Software that realizes the virtual integrated DB view. User manuals, test reports, database schema Quarterly technical and status reports, and final report. Corporate Information: Vipin Kumar (POC) Army Research Center, University of Minnesota, 1100 Washington Avenue SE, Minneapolis, MN Phone (612) ; AHPCRC University of Minnesota Database Connectors Information integration: matching, clustering, profiles, networks, tracks Virtual Integrated Database Threat Analysis Tools


Download ppt "Data Mining for Cyber Threat Analysis Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota."

Similar presentations


Ads by Google