Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Similar presentations


Presentation on theme: "A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted."— Presentation transcript:

1 A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted Software (Malware 2009)The 4th International Malicious and Unwanted Software (Malware 2009)

2 Why javascript? u>60% of Internet attacks are on web app’s [sans09] SQL injection, cross-site scripting(xss) uXSS is the most prevalent bug on the web drive-by download, malicious advertisements, … take over the user’s browser using JavaScript Cross-site request forgeries (CSRF) –forces to execute commands without users’ consent

3 What has been done before? uBlacklist-based approaches profiles from known malicious javascripts domain names and URLs of known bad websites most scanners adopt this uSandbox-based approaches run in a virtual machine and check the state change honey* approaches to find new malware uLimited-capability approaches run with limited function calls only use in a subset of javascript

4 Limitations uBlacklist-based approaches zero-day vulnerability cannot respond to new ones spontaneously uSandbox-based approaches delay before execution imperfect sandbox might leak uLimited-capability approaches compatibility issues

5 Good and bad javascripts Clue: Obfuscation! >90% in our dataset

6 De-obfuscation? uWhy not de-obfuscation then blacklist check? complete de-obfuscation is extremely difficult we do use partial de-obfuscation for URL extraction still vulnerable to 0-day attacks uOnly need to know the existence for detection uGood and obfuscated codes? copyright, tamper-proof, protection against reverse- engineering other features to reduce false positives

7 Our approach uComprehensive framework consists of a targeted web crawler url extractions&feedback javascript classifiers uClassifier benefits mitigate 0-day vulnerability smaller delay compatibility with legacy codes

8 Preliminaries on classifiers uClassifiers “learn” from training set how to classify is this script benign or malicious? probabilistic analysis, decision tree, rule induction, hyperplane,... uExample classifier: Naive Bayes highly used in spam filtering

9 Classifier evaluation uConfusion matrix [thanks to Prof. Press]

10 Precision/NPP uPrecision if the classifier says malicious, how much can we trust this decision? precision = tp/(tp+fp) the higher the precision is, the tougher we can be on the positives uNegative Predictive Power(NPP) if the classifier says benign, how much can we trust this decision? NPP = (tn/tn+fn) the higher the NPP is, less risk we have letting this script run

11 How to get good classifiers? uGiven a word “stock” in an , what is the probability of this being spam? we can compute these from the sample set of s the closer the sample set is to the real Internet the better this classifier gets. -> importance of crawler

12 Targeted Crawls uBased on Heritrix, open-source crawler uInitial seeds from popular and blacklisted domains uAlexa top 500 top 500 websites with the most traffic may include some malicious scripts but mostly benign uBlacklisted domains malekal.com, malwareurl.com uFeedback from newly found malicious scripts extract URLs from redirections and downloads

13 Crawled scripts DatesInitial seeds# pages downloaded # unique scripts Jan. 26 ~ Feb. 3Alexa 5009, 028, 469~63million Jun. 2~16827 blacklisted domains 163, 93824,269 Jul. 16 ~ Aug blacklisted domains 79,6967,602 uTraining set: 50,000 benign + 66 malicious scripts from Feb~Mar 2009 u65 out of 66 obfuscated

14 Is this training set good? u10-fold cross validation by 5,000 increments ClassifierPrecision (stdev) Recall (stdev) NPP (stdev) NaiveBayes0.808(0.11)0.659(0.18)0.996 (0.0023) REPTree0.884(0.12)0.769(0.17)0.997 (0.0022) SVM0.920(0.14)0.742(0.17)0.997 (0.0021) RIPPER0.882(0.17)0.787(0.21)0.997 (0.0027)

15 Feature extraction uIdentify commonly observed features of malicious javascript manually added features (obfuscation) 50 reserved javascript keywords uImportant features human readability (obfuscation) –>70% alphabetical, 60%>vowels>20%, <15 characters long, <=2 repetitions eval –obfuscation and hiding malicious code

16 Feature evaluation uScatterplots: good vs. bad = red vs. blue

17 Helpful features

18 Detection in the real world uTest data 2 weeks’ data from malwaredomains.com 24,269 unique scripts by MD5 22 malicious scripts found by classifiers –all obfuscated 2 found by the latest virus scanner Classifier#found#malprecision NaiveBayes % REPTree(decision tree learner) % SVM % Ripper(inductive rule learner) %

19 Future work uCorrelation among malicious domains more effective domain-based blacklist uLanguage-model classifiers uResilience testing feedback from newly found malicious scripts sustain the classifiers’ accuracy uCombine with other features HTTP and connection information [Seifert08] uRecall testing with blacklists


Download ppt "A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted."

Similar presentations


Ads by Google