Presentation is loading. Please wait.

Presentation is loading. Please wait.

U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Similar presentations


Presentation on theme: "U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically."— Presentation transcript:

1 U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors

2 Research Question Can we reduce redundant analysis by finding common behaviors in malware instances?

3 Malware Analysis Dynamic Analysis: Run the malware instance (binary) in a controlled environment –Log all events (registry, memory, sockets, etc.) –Analyze logs for malicious behavior –Find similar malware instances based on runtime behavior

4 Malware Analysis Event Logs 00 01 02 03 … Malware A Event Codes Initialize network socket Establish connection to malicious.com Load library Sleep

5 Malware Instance Similarity Event n-grams (Rieck et al. 2010) –Find common n-grams (or sequences of events) in event logs 01, 02; 02, 03; 2-grams for Malware A / Malware B 00 01 02 03 … Malware A 01 02 03 02 … Malware B 04 02 05 01 02 … Malware C 01, 02; 2-grams for Malware A / Malware C 01, 02; 2-grams for Malware B / Malware C Events Codes

6 Malware Instance Similarity Event n-grams (Rieck 2010) –Find common fixed size n-grams (or sequences of events) in event logs Malware A / Malware B are more likely to be of the same type 01, 02; 02, 03; 2-grams for Malware A / Malware B 00 … 01 02 03 … Malware A 01 … 02 03 02 … Malware B 04 … 02 05 01 02 … Malware C 01, 02; 2-grams for Malware A / Malware C 01, 02; 2-grams for Malware B / Malware C

7 Malware Instance Similarity Limitations for post analysis –Lose context given by varied-length sequences 00 01 02 03 04 05 … Malware A Event Codes Initialize network socket Establish connection to malicious.com Load library Sleep … Install a rootkit

8 Malware Instance Similarity Limitations for post analysis –Lose context given by varied-length sequences –Lose commonalities between different types of malware 08 … Malware A 06 04 00 01 … Malware B 00 … Malware C

9 Approach Common Substrings Algorithm –Based on the Longest Common Substring –Finds all common event sequences of minimum (not fixed) length n between trace files in a dataset

10 Approach Malheur Reference Dataset –Dynamic traces of 3131 malware instances Generated with CWSandbox Trace size ranges from 700B to 3.4MB Collected in August 2009

11 Approach Malheur Reference Dataset –Traces split into 2 sets Small Set (<100KB)Large Set (>=100KB) Total # malware instance trace files2,0711,060 Total # events1,217,98517,400,262 Total size of malware instance trace files44 MB490 MB

12 Approach Goal –Reduce redundant analysis, especially in larger malware First, find common substrings within small malware traces Next, reduce analysis workload by removing redundancies in larger malware traces

13 Approach – Common Substrings Algorithm Input: Malware dynamic traces of the small set (size < 100KB) 00 … 01 02 03 … Malware A 04 … 05 06 02 … Malware D 01 … 02 06 02 … Malware B 02 … 03 00 04 … Malware E 04 … 02 03 00 … Malware C 04 … 05 06 00 … Malware F Events Output: Common substrings matrix XXXXXX …XXXXX ……XXXX ………XXX …………XX ……………X ABCDEF A B C D E F All common substrings between Pairs of malware traces

14 Approach – Common Substrings Algorithm Iteration 0 00 01 02 03 … Malware A 01 02 06 02 … Malware B 0001 02 03 01 02 06 02 Malware A Malware B

15 Approach – Common Substrings Algorithm Iteration 1 00 01 02 03 … Malware A 01 02 06 02 … Malware B 01 0001 02 03 01 02 06 02 Malware A Malware B

16 Approach – Common Substrings Algorithm Iteration 2 – match found, merge with upper left corner 01 01,02 0001 02 03 01 02 06 02 Malware A Malware B 00 01 02 03 … Malware A 01 02 06 02 … Malware B

17 Approach – Common Substrings Algorithm Final Iteration 01 01,02 02 0001 02 03 01 02 06 02 Malware A Malware B We have 2 common substrings. We only keep those with minimum substring length 2 00 01 02 03 … Malware A 01 02 06 02 … Malware B

18 Approach – Common Substrings Algorithm Selecting which Common Substrings to keep Common Substrings Matrix 01 01,02 02 0001 02 03 01 02 06 02 Malware A Malware B We have 2 common substrings. We only keep those with minimum substring length 2 XXXXXX 01,02 XXXXX XXXX XXX XX X ABCDEF A B C D E F

19 Approach – Common Substrings Algorithm Unique common substrings are merged XXXXXX 01,02 02,03,04 XXXXX 03,02,24,4 6,35 01,02 02,03,04 XXXX 03,02,20,4 0,35 03,02,20,4 0,3,5 XXX 03,02,24,4 0,36 03,02,20,4 0,3,5 XX 01,02,54,4 09,35 03,02,20,4 0,3,5 X ABCDEF A B C D E F 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings

20 Approach – Common Substrings Algorithm Doesnt that take a lot of space? –Many shared common substrings –Total size of all unique common substrings was 25MB Doesnt that take a lot of processing time? –Can be run on separate processes with multithreading –GPU

21 Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings 00 02 03 … Malware AA 00 01 02 … Malware BB 00 01 03 02 … Malware CC 02 03 … Malware AA 00 … Malware BB 00 01 … Malware CC 40% shared 30% shared 50% shared

22 Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings 00 02 03 … Malware AA 00 01 02 … Malware BB 00 01 03 02 … Malware CC 02 03 … Malware AA 00 … Malware BB 00 01 … Malware CC 40% shared 30% shared 50% shared Average = 40%

23 Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings 00 02 03 … Malware AA 00 01 02 … Malware BB 00 01 03 02 … Malware CC 02 03 … Malware AA 00 … Malware BB 00 01 … Malware CC 40% shared 30% shared 50% shared This process was run several times with minimum length sizes 2 to 100

24 Results Analysts dream: Many long common substrings are shared with the larger set

25 Results A B C A - Not too interesting finding common pairs of instructions is expected and will not reduce redundant analysis by much

26 Results A B C B - Indicates that small traces can be analyzed thus reducing the larger set analysis by about half

27 Results A B C C - Some reassurance that the dataset was reasonably diverse

28 Contributions –The common substring algorithm is capable of identifying similarities in dynamic traces of malware –Redundant event sequences can be identified to reduce analysis –Commonalities are not limited to short event sequences

29 Future Work –Use behavior templates For example: regular expressions to identify a recurring sequences (5 vs. 10 sleep events) –Develop a user interface –Optimization GPU

30 Questions

31 Sample Common Substrings Retrieve file from server and replace system file –Load library –Connect –Download –Check if exists –Remove –Copy –Remove evidence

32 Dataset Reference http://pi1.informatik.uni-mannheim.de/malheur/


Download ppt "U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically."

Similar presentations


Ads by Google