Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning applied to Security Steve Poulson 25 th Feb 2010.

Similar presentations

Presentation on theme: "Machine Learning applied to Security Steve Poulson 25 th Feb 2010."— Presentation transcript:

1 Machine Learning applied to Security Steve Poulson 25 th Feb 2010

2 Complexity Security Threats – Drivers and Trends ScanSafe Threat Center Complexity of Threat Time spent online creating larger user base to steal from Value of information transmitted Online banking, personal data theft Communication is fragmenting alternative platforms are growing rapidly beyond email Webmail, IM, RSS, Wikis, VoiP New platforms are less mature and less well protected - vulnerable to attack the more popular the application the more attention it draws Hackers/Cyber criminals working faster AV signatures are failing Threats becoming more complex Looking for new vectors to exploit Email security is mature and successful, threats are migrating from inbox to browser Zero-Day Threats Social Engineering Email Viruses Hybrid Worms Spam OS Vulnerabilities Mobile Attacks Identity Theft Phishing IM Threats Web Viruses DDoS Attacks Spyware/Adware

3 Web Security: Trusted Sites Under Attack Worldwide open web Dynamic and dangerous internet Over 127m active websites (Netcraft survey) Graphics Webmail New Web Pages Blogs Ad Links Links Comments Banner Ads Backdoors Rootkits Trojan Horses Keyloggers Worms Samsung site hijacked as malware host

4 Web Security: Risks of Unfiltered Content Up to 40% of time spend online is non business related (IDC) Productivity Bandwidth congestion Legal liability threat 37% of users visited an X-rated web site from work (Gartner) Web Filtering Blocks by Category (%) The Facebook Effect 32% of our customers now blocking social networking sites, up from 18% last year

5 Effortless Management Manage Granular Policies Directory and custom grouping Web usage quotas Schedules 50+ URL categories 60+ content types Custom block/allow lists Email and browser alerts Generate Reports Summary Scheduled Forensic audit Blocked/allowed traffic

6 Ease of Management + Unrivalled SLAs Ease and speed of deployment Management portal Reporting across multiple locations Database management built in Dedicated expert 24x7x365 support Zero maintenance Automated continuous updates No patching 1. Availability Time our service is available to scan traffic 99.999% guaranteed availability 2. Latency Additional load time attributable to services Evaluated by 3 rd party analysis 3. False Positives Pages that were blocked but should not have False positive rate < 0.0004% 4. False Negatives Pages that were not blocked, but should have False negative rate < 0.0004% The most comprehensive Service Level Agreements available for Web security

7 Proactive Security Acceptable Uncategorized Prohibited Malicious 1 in 5 searches yield Malware or Inappropriate Content Over 90% of new sites are visited as the result of an Internet search Trojan-Download.Win32.IstBar.jl Case Study Provides protection in the zero-hour Proactive threat detection The most effective scanner, sits at the heart of all web traffic, analyses the largest amount of web traffic Generating the most accurate heuristics in the fastest time Outbreak Intelligence SearchAhead

8 Outbreak Intelligence Users are protected by several anti-virus engines at once However this is not sufficient due to the variety of exploits, and their ability to disguise themselves (polymorphism) Outbreak Intelligence harnesses machine-learning techniques and ScanSafes dataset to develop novel techniques to detect zero-hour attacks Uses advanced techniques such as code emulation However we must always meet our maximum false- positive rate of 1/250,000 –Just 0.000004! Solutions must scale to millions of requests in real-time

9 Industrial Development Constraints Getting FP / FN right – customer expectation Deadlines :( Solutions must scale to 250 million requests per day (and growing) –Involves lots of approximations –Lookup tables in place of actual functions –Fast Data Structures –Constrains the choice of algorithms E.g. neural-nets or naïve bayes instead of SVMs

10 Industrial Development Constraints Dataset is continually changing –As the nature of interests across the web –And vectors targeted by attackers –Constantly change E.g. the latest Quicktime vulnerability targets in-request headers, by-passing virus detectors entirely Hence a preference for online models which can be continually updated, rather than those which have to be trained in batch.

11 Dataset Scan approx. 250 million web-requests every day From 45 different countries All traffic is logged for several months We can also archive traffic as it travels through our servers –Which means we can replay hacks several days after the event to investigate them

12 Techniques Employed Supervised Learning –Support Vector Machines for classification and anomaly detection –Some use of Neural Networks –Various probabilistic models such as Naïve Bayes variations Unsupervised Learning –HMMs and more complex variations thereof –Various clustering algorithms, MoG, KNNs –Dimensionality Reduction Algorithms (KPCA) Other –Adaboost, mixtures of experts Disclaimer –Not all are used in end products, and unfortunately we cannot say which techniques are used in which applications.

13 Applications of Machine Learning Inappropriate Web Content Drive-by attacks (first step in an attack) –Malicious JavaScript and other scripts –Malicious Non-Executable Files Actual attacks –Malicious Executable Files Phishing –Use third-party databases –Use models that generate a probability based on URL, request and time of a phishing attack Reputation –Use history of blocks for a URL, the probability of it being a phished URL, and other information, to derive a prior probability of it hosting malware to govern the decision model generating actions from the results of other classifiers

14 Inappropriate Content Basically just document classification Want to stop Bad sites by content – Porn, hate,... Good classifier naïve Bayes – Multinomial Bernouli Multinomial mixture model These have problems, in practice add IR techniques such as TF/IDF SVM approaches better. Also topic based – LSA / LDA

15 Malicious JavaScript Normal document classification works on the presence of words in files Its also possible to encapsulate other information in models –E.g. Naïve Bayes classifiers for email use pseudo words like sender-tld:info, sender-tld:com and address-known:false, address-known:true to improve accuracy We use similar methods with JavaScript We extract words (though not all words) And other features of interest And feed these to a model

16 Malicious JavaScript Complications arise due to the extreme use of obfuscation techniques by attackers –And also legitimate vendors (e.g. Google) –And by large Web 2.0 libraries v46f658f5e2260(v46f658f5e3226){ function v46f658f5e4207 () {return 16;} return(parseInt(v46f658f5e3226,v46f658f5e4207()));}function v46f658f5e61f4(v46f658f5e7174){ function v46f658f5ea0cd () {return 2;} var v46f658f5e813e=\'\';for(v46f658f5e9105=0; v46f658f5e9105<v46f658f5e7174.length; v46f658f5e9105+=v46f658f5ea0cd()){ v46f658f5e813e+=(String.fromCharCode(v46f658f5e2260(v46f658f5e7174.substr(v46 f658f5e9105, v46f658f5ea0cd()))));}return v46f658f5e813e;} document.write(v46f658f5e61f4(\'3C5343524950543E77696E646F772E7374617475733D2 \')); The above is JavaScript, but where are the features? –An exercise for the reader!

17 function startAudioFile() { try { var mmed = document.createElement("object"); mmed.setAttribute("classid", "clsid:77829F14-D911-40FF-A2F0-D11DB8D6D0BC"); var mms=''; for(var i=0;i<4120;i+"\x0c\x0c\x0c\x0c""\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c") { "\x0c\x0c\x0c\x0c"= "A"; } setSpId(3); "\x0c\x0c\x0c\x0c"="\x0c\x0c\x0c\x0c"; mmed.SetFormatLikeSample(mms); } catch(e) { } }; Generating features (canonicalizing) Tokenize and count frequencies to construct a |V| dimensional vector 001111111 fooalert createEl ement documen tmmedvartrystartAudioFilefunction

18 Case Study: India Times Malware Cocktail 25 Oct 07 first malware detected for ScanSafe customers STAT team investigating Oi blocks from certain pages on the India Times website ranked 483 rd by traffic Impacted pages contain script pointing to remote site containing more iframes pointing to two further sites. One iframe points to an encrypted script which exploits multiple vulnerabilities. Successful exploit results in massive download of malware and assorted other files - over 434 files. Installed malware includes cocktail of downloader/dropper Trojans, malicious binaries, scripts, cookies, other non-binaries STAT tested binaries through VirusTotal and overall detection among signature-based AV vendors is low. India Times notified immediately by STAT to prevent further infection - ScanSafe customers continued to be protected This starts an automatic chain of exploit, and all of it was invisible to the user" Mary Landesman - senior researcher

19 Malicious Non-Executable Files Almost no-one opens executables from odd sources any more So instead people use drive-by attacks –They serve a normal file (JavaScript, JPEG, Quicktime movie, animated cursor) –Which is crafted to exploit a vulnerability in a viewer (Internet Explorer, Quicktime, a system library that a viewer depends on) –Which causes code embedded within the file to be executed –Which then downloads The actual executable Or another program to download the main payload.

20 Malicious Non-Executable Files Weve already covered JavaScript But there are a lot of file formats out there Its not feasible to figure out the formats for all these files themselves –So we have to write an application that can learn a file format In the case of zero-day attacks, we have no data to compare against –So we cant just create and train a simple binary classifier

21 Malicious Non-Executable Files Well deal with the second element first If we cant train a binary classifier –We have to train a unary classifier Basically this is anomaly detection –Already used in business to help detect fraud –Typically define (sometimes implicitly) a probability distribution over all possible data And so generate a probability of a particular datum being normal Use some decision function based on this probability to decide whether or not to block

22 Malicious Non-Executable Files However we also have to automatically extract features from the file –Could use kernel methods (1-class SVM, bounding hypersphere) –But developing a kernel to capture the latent structure is not easy –And may be expensive to execute Could use probabilistic methods –HMMs are good for sequences But poor at capturing long-range correlations –Algorithms exist for capturing grammars probabilistically But are difficult to implement And may also be expensive in terms of runtime. Another exercise for the reader!

23 Malicious Executable Files The final stage of an attack is downloading an executable Typically blocked using signatures –Effectively quite advanced regular expressions Virus writers now release several variations of their virus over its life-time And release viruses that change themselves as they propagate

24 Malicious Executable Files This all makes signature based approaches increasingly infeasible –F-Secure now checks every file against 500,000 signatures –McAfee now checks every file against 360,000 signatures The rate at which variants of viruses come out is growing rapidly –The Storm worm launched separate Christmas and New Year versions of its attack within days of each other Vendors are struggling to develop techniques to detect variants using their existing technologies –But continuing to add separate signatures for each new variant is not feasible

25 Malicious Executable Files We seek to investigate machine learning techniques to look into this. Several approaches have been used in the past –Typically binary classifiers using existing virus samples –Techniques include decision trees, self-organising maps, naïve- bayes classifiers, neural networks, SVMs and others –Features are usually library includes, strings, or hex-sequences selected using information theoretic techniques (e.g. information gain) –Some break the executable into a graph, where nodes correspond to blocks of code (most of which are identical between variations) and perform analysis on the graphs to determine similarity.

26 Malicious Executable Files Windows Portable Executable (PE) is a rich format, starts with magic number MZ so easy to detect. This means we can quickly extract features without resorting to disassembly or flow graph construction. Some notable features: 60% of recent malware is obfuscated. We determined that if an executable is obfuscated, there is a greater than 95% probability that it is malware. An executable consists of sections, such as header, text, code and so on. There are generally fewer sections in malicious files than in non-malicious ones. In our analysis, more than 70% of the malware samples consisted of two or three sections, while more than 70% of non-malicious files consisted of four or five sections. Another notable feature relates to peculiarities in the executable structure – for example, some sections in the executable may not be aligned properly. In our analysis, more than 78% of malware revealed an anomaly in the executable structure, while only 5% of non-malicious samples had an anomaly in their structure. If an anomaly exists, there is a more than 93% chance that the sample is malicious. As part of our investigations we also calculated statistics relating to the importing of DLL files. For example, if an executable imports system32.dll, then the sample has a more than 77% chance of being malware and if it imports kernel32.dll, then the sample has a more than 67% chance of being malware.

27 Malicious Executable Files As discussed there are many classification algorithms at our disposal. Currently we are using the naive-Bayes classification algorithm as it is both accurate and simple to implement. The simplified algorithm (assuming that there are only two classes: malware and non-malware) is given in Equation (1). Where x = [x1, x2, · · ·, xn] is an array of selected features from an executable, P(c|x) is the a posteriori probability that the executable with feature set x is in class c, and P(x|c) is the probability of x occurring in class c.

28 Malicious Executable Files We used one group of non-malware and 28 released malware groups that had been detected by our analysis team in recent months. Each group contained around 150 to 300 samples. We plotted the results of our experiment. A smooth, dashed curve shows the recognition We are consistently getting more than 90% accuracy detection of malware. The FPR of our system is around 10% and we are trying to reduce this by extracting new features and by developing a new feature selection algorithm.

29 Control flow Graph Can be matched by a graph edit distance and nearest neighbour classifier – slow :(

30 Malicious Executable Files Much like early attempts to classify email using naïve-bayes –Which concentrated only on text –Until someone thought to use the entire context of the email, such as when it was sent, from whom, the domain and TLD of the email address etc. –Which brings us to

31 Website Reputation Classifier Gather information from context –Time of request –TLD, domain of server –Type of URL (IP Address, Domain Name) –Geographic location of server –Details of request (drive-bys may not simulate a browser) –Details of response (server may be misconfigured) –And any other information And use it to alter the prior probability of malware from the default 0.5 Which may help control the FP rate.

32 Any Questions?

33 Problem Overview Attackers no longer rely on users launching executables Rely on drive-by download techniques to launch an executable without user involvement Examples include –JavaScript exploiting browser vulnerabilities to launch remote executables –Website content (ANIs, WMFs, etc.) exploiting browser and / or operating system vulnerabilities to launch remote executables

34 Problem Overview Things to look out for –Buffer overflows: extraordinarily long field values –Integer overflows: value encoded in 4 bytes is very large: Hard to spot! But could be found by the absence of leading zeros in e.g. 4 byte length fields –Exploit Code May not resemble expected data However raw data in some formats (JPEG, MP3) may be relatively indistinguishable from machine-code.

35 Problem Specification Examine first 300 or so bytes of file Detect if its normal –If not normal, its an exploit System should infer file-structure itself to determine normalcy –Unfeasible for us to manually break down every file-format into individually interesting features

36 Anomaly Detection Techniques used in machine learning and statistics to detect outliers: data-points (such as file content) which arent probable (normal) Two broad approaches –Non-Probabilistic Discriminative Classifiers Learn a function that spits out positive or negative depending on some version of the data –Probabilistic Generative Classifiers Find a way of estimating the actual probability of the file being what it appears to be and use that to make a decision

37 Anomaly Detection :: Techniques Non-Probabilistic Classifiers –One-Class Support-Vector Machine (SVM) using Sequence Kernels [Trialed, not implemented] Probability Density Estimation (PDE) –Hidden Markov Model (HMM) [Implemented] –Hierarchical Hidden Markov Model (HHMM) [Not Implemented, Not Planned] –Factorial Hidden Markov Model (FHMM) [Not Implemented, but an avenue for future work]

38 Anomaly Detection :: Classifiers One-Class Support-Vector Machine (SVM) –If a binary (2-class) classifier draws a line between two classes –A unary classifier draws a circle around the data – everything outside the circle is weird. SVMs try to find the best place to place the line, and can work around errors in the dataset They store the line in terms of the inputs it crosses (the support-vectors) They minimise the number of support-vectors they have to store to represent this line. SVMs use kernels to find a way of representing the data such that its easy to figure out where to place the line

39 Anomaly Detection :: Classifiers Kernels can also be used to convert symbolic data (such as strings and sequences) into a tractable numeric form. Kernels can also be chained together to help figure out where to put the line ÿØÿà..JFIF.....H.H..ÿá.§Exif..MM.*........ [10, 23, 34, 0, 0, 0, 23, 0, 23, …, 0, 0, 2, 1]

40 Anomaly Detection :: Classifiers In testing performance (using string kernel) was quite poor –Needed to store a large number of support vectors to remember where the line was –Only detected buffer overflows, not integer overflows –Couldnt be re-trained on the go Arguably all these problems could be solved by a more complex kernel function –But that would increase run-time

41 Anomaly Detection: PDE Probability Density Estimation Return a probability for each file-header indicating how typical it is Approach implemented is a simple Hidden Markov Model (HMM), using various heuristics to help it fit the file-types. What is a HMM?

42 Anomaly Detection :: HMMs Implementation Issues: –How to jointly determine the probabilities of Certain characters appearing in each stage Moving from one stage to another for all stages Answer is the Expectation Maximisation (EM) algorithm –How to figure out the structure of the model in advance Structural Learning problem is single major problem in machine learning In our case we use heuristics based on reg-exp idea. –Variable Length Sequences Multiply result by constant multiple of probability of file size (normally dist)

Download ppt "Machine Learning applied to Security Steve Poulson 25 th Feb 2010."

Similar presentations

Ads by Google