A Generic Approach to Big Data Alarms Prioritization

Slides:

Advertisements

Similar presentations

CHAPTER 9: Decision Trees

Advertisements

Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.

ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.

Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

Machine Learning as Applied to Intrusion Detection By Christine Fossaceca.

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

Issues with Data Mining

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Classification using Decision Trees 1.Data Mining and Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.

Data Mining and Decision Support

Bootstrapped Optimistic Algorithm for Tree Construction

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.

Introduction to Machine Learning, its potential usage in network area,

Detecting Web Attacks Using Multi-Stage Log Analysis

Ensemble Classifiers.

Experience Report: System Log Analysis for Anomaly Detection

Chapter 1 The Systems Development Environment

SNS COLLEGE OF TECHNOLOGY

Software Configuration Management

What Is Cluster Analysis?

Chapter 1 The Systems Development Environment

DECISION TREES An internal node represents a test on an attribute.

Decision Trees an introduction.

Automate Does Not Always Mean Optimize

ANOMALY DETECTION FRAMEWORK FOR BIG DATA

Record Storage, File Organization, and Indexes

Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector Matt Thomson 17/11/2016.

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

Prepared by: Mahmoud Rafeek Al-Farra

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Chapter 1 The Systems Development Environment

Technology & Analytics

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Chapter 6 Classification and Prediction

Chapter 1 The Systems Development Environment

FORMAL SYSTEM DEVELOPMENT METHODOLOGIES

Evaluating Program Security

MID-SEM REVIEW.

Dipartimento di Ingegneria «Enzo Ferrari»,

Cost-Sensitive Learning

Classification and Prediction

Machine Learning Week 1.

Cost-Sensitive Learning

Classification by Decision Tree Induction

Automating Security Operations using Phantom

iSRD Spam Review Detection with Imbalanced Data Distributions

PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.

Statistical Learning Dong Liu Dept. EEIS, USTC.

©Jiawei Han and Micheline Kamber

Chapter 1 The Systems Development Environment

Yining ZHAO Computer Network Information Center,

Modeling IDS using hybrid intelligent systems

Presentation transcript:

A Generic Approach to Big Data Alarms Prioritization Ossi Askew, Darshit Mody, Ayushi Vyas, Tiffany Branker Pedro Vasseur, Stephan Barabassi

Introduction The process of identifying and acting upon a possible data leak in a timely manner is a continuing challenge for most organizations. The volume of data being ingested keeps growing every day as organizations continue to place this information in very large data repositories in order to mine insightful patterns that can be used for key decision-making tasks. In order to safeguard this sensitive data from being accessed without the proper authorization, companies have installed Data Leak Detection (DLD) applications that monitor the access to internal data repositories. An alarm record is created in the form of a security log record whenever any anomalous data querying behavior is detected by a DLD engine. The alarms need to be analyzed in a timely manner by security experts to determine if indeed they are malicious or benign.

Current State of DLD Alarm Handling Alarm, alarm, alarm, alarm, alarm, alarm, alarm, alarm…. e f a Query Functions and Applications Security Analysts Data Leak Detection Engine g c b Manual review and decision-making of unclassified alarms Big Data Tables (Hadoop) Access Rules d

Previous work The first data analysis technique used was an unsupervised data clustering model and the ensuing identification of determining attributes of a “data-in-motion” test sample. Teams found that the target UID path of the uploaded datasets triggered the false positives. The team discovered a pattern where any file upload that took more than a second were classified as a false alarm. The second data analysis technique, a decision tree algorithm, proved to be more useful in multiple environments because it could consider various attributes of the data. Team leveraged as much as possible on the previous work done by applying to the “data at rest” condition especially in the Big Data security logs.

Primary Approach To improve the efficiency and effectiveness of the analysis effort associated with data leak detection alert logs an approach could be achieved by using traditional data mining methods, as well as Big Data analytics techniques that can classify and prioritize true and false positives. To use machine learning methods that can iteratively confirm the nature and priority of the alerts, and hence reduce the time and cost incurred in the manual process of investigating and acting upon malicious data access. For prediction of binomial results, such as true or false, a decision tree model was chosen, specifically the ID3 component. After reviewing several data mining tools, the data mining tool RapidMiner has been chosen to test this algorithm.

Primary Approach - continued ID3: (Iterative Dichotomiser 3) It builds a decision tree from a fixed set of examples. The resulting tree is used to classify future samples. The examples of the given Example Set have several attributes and every example belongs to a class (like yes or no). The leaf nodes of the decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses feature selection heuristic to help it decide which attribute goes into a decision node. The required heuristic can be selected by a criterion parameter. * * From RapidMiner official documentation

Secondary Approach To provide an automated feedback mechanism to the Data Leak Detection Rules engine, the Decision Tree Training set and the Subject matter expert for improving the prediction accuracy in continued learning. The automated feedback should be executed based on a programmed algorithm that compares records of determining attributes and associated target value in the training set to the corresponding record attributes and associated target in the DLD access rules engine.

Desired State of Alarm Handling Alarm, alarm, alarm, alarm, alarm, alarm, alarm, alarm….. Predicting and Prioritizing Algorithm e f True alarm, true alarm, false alarm, false alarm, false alarm a c g Query Functions and Applications Data Leak Prevention Rules Engine i b Automated Feedback Automated feedback h Big Data Tables (Hadoop) d Security Analyst Access Rules Final decision-making of classified alarms j Manual feedback

Results . Initial trained tree

Results - continued The alarms generated by the DLD engine were categorized and prioritized as true and false positives. The RapidMiner tool was used to attain accuracy of over 90 percent after re-selecting the key contributing attributes for the decision tree algorithm. Introduced the filtering of some input variables. After several selections and trials, discovered that ID3 model improved its prediction capabilities by using the attribute selection of Alarm, Component Accessed and Role, and with an adjusted learning criteria of gain_ratio, minimal size for split of2, minimal leaf size of 2 instead of 4, and minimal gain increased gradually up to 0.90 from 0.10. This created a new learned tree structure with Role as the root node instead of Component Used. .

Results - continued The alarms generated by the DLD engine were categorized and prioritized as true and false positives. The RapidMiner tool was used to attain accuracy of over 90 percent after re-selecting the key contributing attributes for the decision tree algorithm. A prototype of component I, the automated feedback, was proposed for future testing.

Conclusion/Summary The subject matter experts, the security analysts, will spend less time sorting and reviewing true alarms since the false alarms have been identified for later non-urgent review It is proposed that the time and effort required to manually update the Access Rules violation decision table and the model’s training set will be minimized by developing a programmatic approach to generate component I that could efficiently replace component J. This approach can be generalized and be applied to other types of DLDs.

Future Work Design an efficient approach that can anonymize data security alarm records for scoring purposes. Continue enhancing the algorithm for the iterative machine learning and re-training to keep improving the confirmation and prioritization of anomalous querying of Big Data repositories based on increasing the accuracy of prediction over 95% in the decision tree component of the algorithm the programmatic comparison of existing rules and confirmed alarms using the formula described on the next slide

Future Work - continued Logic for the execution of component i. If {TSIV1, TSIV2,…TSIVn, TSTV } of x is not equal to {ARIV1, ARIV2,…ARIVn, ARTV} of y then execute component i Where, TSIV stands for the value of a contributing Training Set Independent Variable TSTV stands for the value of a Training Set Target Variable e.g. True or False ARIV stands for the value of the corresponding Access Rules Independent Variable ARTV stands for the value of the corresponding Access Rules Target Variable e.g. True or False 1 to n is the distinct value of an Independent Variable in an instance of the Training Set, x is a record in a Training Set y is a record in the Access Rules table.

Q and A’s?

Thank you.