Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.

Similar presentations


Presentation on theme: "Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham."— Presentation transcript:

1 Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas

2 Outline and Acknowledgement ● Vision for Assured Information Sharing ● Handling Different Trust levels ● Defensive Operations between Untrustworthy Partners – Detecting Malicious Executables using Data Mining ● Research Funded by Air Force Office of Scientific Research and Texas Enterprise Funds

3 Vision: Assured Information Sharing Publish Data/Policy Component Data/Policy for Agency A Data/Policy for Coalition Publish Data/Policy Component Data/Policy for Agency C Component Data/Policy for Agency B Publish Data/Policy 1.Trustworthy Partners 2.Semi-Trustworthy partners 3.Untrustworthy partners 4.Dynamic Trust

4 Our Approach ● Integrate the Medicaid claims data and mine the data; next enforce policies and determine how much information has been lost by enforcing policies – Prof. Khan, Dr. Awad (Postdoc) and Student Workers (MS students) ● Apply game theory and probing techniques to extract information from semi-trustworthy partners – Prof. Murat Kantarcioglu and Ryan Layfield (PhD Student) ● Data Mining for Defensive and offensive operations – E.g., Malicious code detection, Honeypots – Prof. Latifur Khan and Mehedy Masud ● Dynamic Trust levels, Peer to Peer Communication – Prof. Kevin Hamlen and Nathalie Tsybulnik (PhD student)

5 Introduction: Detecting Malicious Executables using Data Mining 0 What are malicious executables? - Harm computer systems - Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer, Spoofer, Trojan etc. - Exploits software vulnerability on a victim - May remotely infect other victims - Incurs great loss. Example: Code Red epidemic cost $2.6 Billion 0 Malicious code detection: Traditional approach - Signature based - Requires signatures to be generated by human experts - So, not effective against “zero day” attacks

6 State of the Art: Automated Detection O Automated detection approaches: ● Behavioural: analyse behaviours like source, destination address, attachment type, statistical anomaly etc. ● Content-based: analyse the content of the malicious executable – Autograph (H. Ah-Kim – CMU): Based on automated signature generation process – N-gram analysis (Maloof, M.A. et.al.): Based on mining features and using machine learning.

7 New Ideas ✗ Content -based approaches consider only machine- codes (byte-codes). ✗ Is it possible to consider higher-level source codes for malicious code detection? ✗ Yes: Diassemble the binary executable and retrieve the assembly program ✗ Extract important features from the assembly program ✗ Combine with machine-code features

8 Feature Extraction ✗ Binary n-gram features – Sequence of n consecutive bytes of binary executable ✗ Assembly n-gram features – Sequence of n consecutive assembly instructions ✗ System API call features – DLL function call information

9 The Hybrid Feature Retrieval Model ● Collect training samples of normal and malicious executables. ● Extract features ● Train a Classifier and build a model ● Test the model against test samples

10 Hybrid Feature Retrieval (HFR) ● Training

11 Hybrid Feature Retrieval (HFR) ● Testing

12 Binary n-gram features – Features are extracted from the byte codes in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: Given a 11-byte sequence: 0123456789abcdef012345, The 2-grams (2-byte sequences) are: 0123, 2345, 4567, 6789, 89ab, abcd, cdef, ef01, 0123, 2345 The 4-grams (4-byte sequences) are: 01234567, 23456789, 456789ab,...,ef012345 and so on.... Problem: – Large dataset. Too many features (millions!). Solution: – Use secondary memory, efficient data structures – Apply feature selection Feature Extraction

13 Assembly n-gram features – Features are extracted from the assembly programs in the form of n-grams, where n = 2,4,6,8,10 and so on. Example: three instructions “push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”; 2-grams (1) “push eax”; “mov eax, dword[0f34]”; (2) “mov eax, dword[0f34]”; “add ecx, eax”; Problem: – Same problem as binary Solution: – Same solution Feature Extraction

14 ● Select Best K features ● Selection Criteria: Information Gain ● Gain of an attribute A on a collection of examples S is given by Feature Selection

15 Experiments 0 Dataset – Dataset1: 838 Malicious and 597 Benign executables – Dataset2: 1082 Malicious and 1370 Benign executables – Collected Malicious code from VX Heavens (http://vx.netlux.org) 0 Disassembly – Pedisassem ( http://www.geocities.com/~sangcho/index.html ) 0 Training, Testing – Support Vector Machine (SVM) – C-Support Vector Classifiers with an RBF kernel

16 Results ● HFS = Hybrid Feature Set ● BFS = Binary Feature Set ● AFS = Assembly Feature Set

17 Results ● HFS = Hybrid Feature Set ● BFS = Binary Feature Set ● AFS = Assembly Feature Set

18 Results ● HFS = Hybrid Feature Set ● BFS = Binary Feature Set ● AFS = Assembly Feature Set

19 Future Plans ● System call : – seems to be very useful. – Need to Consider Frequency of call – Call sequence pattern (following program path) – Actions immediately preceding or after call ● Detect Malicious code by program slicing – requires analysis

20 Data Mining to Detect Buffer Overflow Attack Mohammad M. Masud, Latifur Khan, Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas

21 Introduction ● Goal – Intrusion detection. – e.g.: worm attack, buffer overflow attack. ● Main Contribution – 'Worm' code detection by data mining coupled with 'reverse engineering'. – Buffer overflow detection by combining data mining with static analysis of assembly code.

22 Background ● What is 'buffer overflow'? – A situation when a fixed sized buffer is overflown by a larger sized input. ● How does it happen? – example:........ char buff[100]; gets(buff);........ buffStack memory Input string

23 Background (cont...) ● Then what?........ char buff[100]; gets(buff);........ buffStack memory Stack Return address overwritten buffStack memory New return address points to this memory location Attacker's code buff

24 Background (cont...) ● So what? – Program may crash or – The attacker can execute his arbitrary code ● It can now – Execute any system function – Communicate with some host and download some 'worm' code and install it! – Open a backdoor to take full control of the victim ● How to stop it?

25 Background (cont...) ● Stopping buffer overflow – Preventive approaches – Detection approaches ● Preventive approaches – Finding bugs in source code. Problem: can only work when source code is available. – Compiler extension. Same problem. – OS/HW modification ● Detection approaches – Capture code running symptoms. Problem: may require long running time. – Automatically generating signatures of buffer overflow attacks.

26 CodeBlocker (Our approach) ● A detection approach ● Based on the Observation: – Attack messages usually contain code while normal messages contain data. ● Main Idea – Check whether message contains code ● Problem to solve: – Distinguishing code from data

27 ● Statistics to support this observation (a)on Windows platforms – most web servers (port 80) accept data only; – remote access services (ports 111, 137, 138, 139) accept data only; Microsoft SQL Servers (port 1434) accept data only; – workstation services (ports 139 and 445) accept data only. ● (b) On Linux platforms, most – Apache web servers (port 80) accept data only; – BIND (port 53) accepts data only; – SNMP (port 161) accepts data only; – most Mail Transport (port 25) accepts data only; – Database servers (Oracle, MySQL, PostgreSQL) at ports 1521, 3306 and 5432 accept data only.

28 Severity of the problem ● It is not easy to detect actual instruction sequence from a given string of bits

29 Our solution ● Apply data mining. ● Formulate the problem as a classification problem (code, data) ● Collect a set of training examples, containing both instances ● Train the data with a machine learning algorithm, get the model ● Test this model against a new message

30 CodeBlocker Model

31 Feature Extraction

32 Disassembly ● We apply SigFree tool – implemented by Xinran Wang et al. (PennState)

33 Feature extraction ● Features are extracted using – N-gram analysis – Control flow analysis ● N-gram analysis Assembly program Corresponding IFG What is an n-gram? -Sequence of n instructions Traditional approach: -Flow of control is ignored 2-grams are: 02, 24, 46,...,CE

34 Feature extraction (cont...) ● Control-flow Based N-gram analysis Assembly program Corresponding IFG What is an n-gram? -Sequence of n instructions Proposed Control-flow based approach -Flow of control is considered 2-grams are: 02, 24, 46,...,CE, E6

35 Feature extraction (cont...) ● Control Flow analysis. Generated features – Invalid Memory Reference (IMR) – Undefined Register (UR) – Invalid Jump Target (IJT) ● Checking IMR – A memory is referenced using register addressing and the register value is undefined – e.g.: mov ax, [dx + 5] ● Checking UR – Check if the register value is set properly ● Checking IJT – Check whether jump target does not violate instruction boundary

36 Feature extraction (cont...) ● Why n-gram analysis? – Intuition: in general, disassembled executables should have a different pattern of instruction usage than disassembled data. ● Why control flow analysis? – Intuition: there should be no invalid memory references or invalid jump targets.

37 Putting it together ● Compute all possible n-grams ● Select best k of them ● Compute feature vector (binary vector) for each training example ● Supply these vectors to the training algorithm

38 Experiments ● Dataset – Real traces of normal messages – Real attack messages – Polymorphic shellcodes ● Training, Testing – Support Vector Machine (SVM)

39 Results ● CFBn: Control-Flow Based n-gram feature ● CFF: Control-flow feature

40 Novelty / contribution ● We introduce the notion of control flow based n-gram ● We combine control flow analysis with data mining to detect code / data ● Significant improvement over other methods (e.g. SigFree)

41 Advantages ● 1) Fast testing ● 2) Signature free operation 3) Low overhead ● 4) Robust against many obfuscations

42 Limitations ● Need samples of attack and normal messages. ● May not be able to detect a completely new type of attack.

43 Future Works ● Find more features ● Apply dynamic analysis techniques ● Semantic analysis

44 Reference / suggested readings – X. Wang, C. Pan, P. Liu, and S. Zhu. Sigfree: A signature free buffer overflow attack blocker. In USENIX Security, July 2006. – Kolter, J. Z., and Maloof, M. A. Learning to detect malicious executables in the wild Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA Pages: 470 – 478, 2004.

45 Email Worm Detection (behavioural approach) Training data Feature extraction Clean or Infected ? Outgoing Emails Classifier Machine Learning Test data The Model

46 Feature Extraction Per email features = Binary valued Features Presence of HTML; script tags/attributes; embedded images; hyperlinks; Presence of binary, text attachments; MIME types of file attachments = Continuous-valued Features Number of attachments; Number of words/characters in the subject and body Per window features = Number of emails sent; Number of unique email recipients; Number of unique sender addresses; Average number of words/characters per subject, body; average word length:; Variance in number of words/characters per subject, body; Variance in word length = Ratio of emails with attachments

47 Feature Reduction & Selection Principal Component Analysis = Reduce higher dimensional data into lower dimension = Helps reducing noise, overfitting Decesion Tree = Used to Select Best features

48 Experiments 0 Data Set - Contains instances for both normal and viral emails. – Six worm types: ● bagle.f, bubbleboy, mydoom.m, mydoom.u, netsky.d, sobig.f - Collected from UC Berkeley ● Training, Testing: - Decision Tree: C4.5 algorithm (J48) on Weka Systems - Support Vector Machine (SVM) and Naïve Bayes (NB).

49 Results

50 Conclusion & Future Work ● Three approaches has been tested – Apply classifier directly – Apply dimension reduction (PCA) and then classify – Apply feature selection (decision tree) and then classify ● Decision tree has the best performance ● Future Plans – Combine content based with behavioral approaches ● Offensive Operations – Honeypots, Information operations


Download ppt "Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham."

Similar presentations


Ads by Google