Applications of Machine Learning in Cisco Web Security

Applications of Machine Learning in Cisco Web Security
Richard Wheeldon PhD BSc

Cisco Web Security Cisco, Ironport and ScanSafe Request time filtering
Categorization and classification Reputation Response time filtering Malware types and attack vectors Malware detection Dynamic classification Other challenges

The Ubiquitous Speaker Slide
Richard Wheeldon UCL Graduate in 1999 PhD from Birkbeck in 2003 Joined Cisco December 2009 Acknowledgements Steve Poulson - Bryan Feeney - Slides will be made available at after the presentation is over. Bryan Feeney was a UCL MSc student before joining Scansafe. He wrote many of the original slides for this talk and much of the Scansafe OI code. He is currently doing a PhD at UCL.

Cisco, Ironport and ScanSafe
World’s leading network company Ironport Leader in Anti-spam Provide Web Security Appliances ScanSafe World leader in “Security as a Service” Scans 1.8 billion web requests a day Blocks 32 million of them Cisco Arguably the world’s largest network company, employing people with assets of over 80BN$ Known mostly for routers and switches Cisco also produce firewalls, telephones and servers. Technologies for security, network management, building management, video conferencing (telepresence, webex), instant messaging, phones, automation, Security products include firewalls, IPSs, managements systems, VPNs, mobile security Ironport Acquired by Cisco in 2007 Best known for IronPort AntiSpam, the SenderBase reputation service, and security appliances (ESA) Scansafe Acquired by Cisco in December 2009 2 years ago, Scansafe were scanning 7 billion requests a month, now we’re up to 1.8 bilion a day. More data breeds better training Around 5000 customers including well-known names such as ICI, Standard Chartered, BMW. A single company may have users. Blocks are up from 70 million a month to 32 million a day, based on a combination of categorization and detection of malware and other online threats. Currently peaking at a bandwidth usage of 8 gigabit / sec across 1000 servers

We’re local

Previous MSc projects Tree Kernels for CFG similarity
Guangyan Song, 2010 Fast computation of the Kernel of a Tree and applications to Semi-Supervised Learning Malcolm Reynolds, 2009 Comparing N-gram features for web page classification Noureen Tejani, 2007

We’re hiring Positions Locations Graduate recruitment
Software Developers QA, Operations, Research Locations ScanSafe UK - Bedfont Lakes, Reading, Staines, Edinburgh Galway, EMEA, US, Worldwide Graduate recruitment Cisco are making 500$ investment in the UK over next five years.

Scansafe’s SaaS 1. Availability
Time our service is available to scan traffic 99.999% guaranteed availability 2. Latency Additional load time attributable to services Evaluated by 3rd party analysis 3. False Positives Pages that were blocked but should not have 4. False Negatives Pages that were not blocked, but should have Cloud based scanning Content scanning with AV engines done on over 1000 worldwide scanning servers Ease of Management is at the heart of a managed service approach. Using ScanSafe helps remove the burden of updating software and hardware. For example, the need to constantly update AV software for new signatures. However, a strength-in-depth approach with AV on the desktop is still recommended. Zero maintenance Provides easy deployment (relative to appliance-based delivery) Automated continuous updates - no patching A global presence with very high levels of redundancy help us meet some of the toughest SLAs in the business. Some of the world’s largest Telcos have partnered with us including AT&T, Google, and Sprint We have won numerous industry awards for our services and feature in the leaders quadrant of Gartner’s surveys The ScanCenter portal enables the customer administrator to: View reports and Configure automated reports Review statistics of all Web activity and threats blocked Create Access policies and apply these to specific users or groups Report Generation Automated reports are available on overall traffic, bandwidth, blocked URLs, spyware and Web viruses stopped. These are complemented by a comprehensive selection of additional reports, generated daily, which provide in-depth analysis in the form of graphs, tables, and exportable data files. Reporting functionality is comprehensive and ranges from high-level dashboard views to detailed forensic audits on specific users. This is all backed by our data warehouse of recorded traffic. Granular policies Block / allow lists Schedules Quotas 80 categories

Risks of Unfiltered Content
Software threats Malware Phishing Botnets Business threats Productivity Loss Bandwidth congestion Legal liability Data Leaks

Most web traffic is good
The Web vs. Web Most web traffic is good Most is bad Easy to find safe sites Easy to get Spam Harder to get dangerous URLs Harder to get examples of good mail Blocking web sites is visible Blocking is invisible Performance gain from white-listing Performance gain from blocking Very Real-Time (<2s) Not Real-Time (<Nhrs)

Request time filtering
Motivation Quicker blocks save bandwidth and processing time If the request is made, the damage may be done Techniques Databases Reputation Rules Trained systems Bandwidth and processing time Saving bandwidth is important for categorized media content Also important for where bad content dominates over good content With 7 trillion spam message a year, the benefit of not processing that data is huge Data leak Prevention If malware is already installed on a client network, connections may be made out to Botnets May leak passwords, credit card details, information about the network Data Leak Prevention (DLP) is the general term for preventing vital data from getting out of a company’s network

Category-based filtering
Responsible for most blocks High-risk and high-traffic Manual categorizers 10 million URLs 97% of traffic 2 million porn sites Category-based filtering Responsible for the majority of blocks Focused towards high-risk and high-traffic Team of manual categorizers focuses on recently seen, high-volume sites Database of 10 million URLs and IP addresses covers 97% of traffic Over 2 million known porn sites

Web Reputation Feeds Heuristics Phishing sites Malware sites
In spam but not in ham Age of domain registration High traffic – e.g. Alexa 1000 Scanned but never blocked Motivation Blacklists and whitelists aren’t always 100% accurate (or up to date) There are a number of heuristics which indicate potential problems but which aren’t automatically malicious If we see it frequently seen on the service, but never blocked, it’s more likely to be safe. If we keep finding malware it’s probably not Malware and phishing sites tend to be short-lived. This makes blacklists less valuable but provides a useful metric

Web Reputation in the WSA

Check your reputation at senderbase.org
Mainly focused on reputation

Keyword-based URL filtering
Keyword rules Fitness -> Health Basketball -> Sport Pizzeria -> Food Restaurant -> Food Whore -> Porn Strange URLs whorepresents.com therapistfinder.com speedofart.com expertsexchange.com penisland.com powergenitalia.it n.b. powergenitalia.com was a hoak.

Recognizing Porn URLs http://www.penisland.com
Example of segmentation problem P('peni') X P('sland') P('penis') X P('land') P('pen') X P('island') Extends to classification P('penis') X P('land') X P(porn|'penis') X P(porn|'land') P('pen') X P('island') X P(not_porn|'pen') X P(not_porn|'island') Text Segmentation problem Well studied in speech recognition and NLP Real life problem for natural languages – particularly eastern and oriental Japanese and Chinese don’t have spaces between words Arabic doesn’t have spaces between characters There are many problems in generating a real-world solution, caused in part by biases in the training data. For example, our categorized corpus has a bias towards high-risk areas and towards languages used mostly by our customers

Phishing and Malware Examples
Phishing examples Malicious examples: www1.scan-projectrf.cz.cc www1.scan-projectsi.cz.cc www1.scan-projectst.cz.cc www1.scan-projectte.cz.cc www1.scan-projectti.cz.cc We can construct a graph using a distance measure between URLs such as edit distance or ngrams. For the phishing example: the known phish sites are labelled and paypal sites labelled as good. Methods exist for partitioning graphs based on labels and graph structure Other contextual information can be used: Time of request Type of URL (IP Address, Domain Name) Geographic location of client and server Headers (malware may not look like a known browser)

Searchahead Acceptable Uncategorized Prohibited Malicious If we can identify bad URLs we can warn before the user clicks. Over 90% of new sites are visited as the result of an Internet search Search Engines are the Gateway to Web Threats. Over 90% of new sites are visited as the result of an Internet search – that’s where users come across new and potentially suspicious sites. We operate SearchAhead – as an “early warning” service for safe searching

Response Time Scanning
Graphics Webmail New Web Pages Blogs Ad Links Links Comments Banner Ads Backdoors Rootkits Trojan Horses Keyloggers Worms Trusted sites are targets Strength-in-depth combination of commercial scanners and in-house technology. Its not just suspicious sites that can host malware Trusted sites host user-generated content in the form of comments, blogs, s, etc. Trusted sites can become infected, compromised or unwitting passengers for malware This is happening ever more frequently 20

Exploited sites in recent years
Facebook Times India Miami Dolphins Samsung Exploited sites over the past few years include: Samsung unknowingly hosted malicious code in the form of trojans which disabled AV programs and logged keystrokes Before the Superbowl in 2007, the Miami dolphins site was compromised by hackers Earlier this year a worm went round on Twitter caused by hackers taking advantage of a Cross-site scripting (XSS) exploit Facebook , Times India In 2007, Times India hosted a cocktail of dropper Trojans, malicious binaries and scripts which were identified by ScanSafe’s OI team. The well-known Koobface virus spreads by delivering Facebook messages to people who are 'friends' of a Facebook user whose computer has already been infected.

Nothing is safe – not even Twitter!
Category or URL based filtering approaches won’t provide protection from any of these threats. Strength in depth Combination of commercial AV and Spyware engines Added in-house scanning technology (OI)

Signature Databases From 2006 to 2008, the F-Secure signature database grew from entries to 1.5 million The rate at which variants of viruses come out is growing rapidly No vendor can rely exclusively on signatures In 2007 F-Secure had 500,000 signatures McAfee had 360,000 signatures Now, just over 3 years later ClamAV has signatures Symantec claim 1.6 million Panda claim 2 million The rate at which variants of viruses come out is growing rapidly. Vendors have realised that adding separate signatures for each new variant isn’t feasible and have added behavioural scanning, heuristics and machine learning technologies to cope.

Zero-hour protection Vendors take time to release signature updates
Win32.IstBar.jl trojan Outbreak Intelligence (OI) provides proactive threat detection A huge data set of traffic to be leveraged

How does OI use Machine Learning?
Approaches Malware detection Anomaly detection Dynamic categorization Techniques Employed Supervised Learning Unsupervised Learning Sandboxing

Dynamic Classification
Document classification across 80 categories Increases coverage Language identification Identifies inappropriate content Porn is relatively easy Phishing is harder – but not impossible? Hate speech is harder still We augment our request-time categorization with dynamic classification based on machine learning of these categories from a training set extracted from our data. In order to do this we first have to handle language-identification By phishing, we also include fake escrow sites, fake anti-virus sites, fake pharmacy sites. Many of these are now generated automatically by tools which create slight variants of a default template. Detecting hate-sites is particularly difficult because most existing algorithms will find sites matching hateful subject, where as what is required is to recognize only those in support of it.

DC for identifying malicious sites
Automated tools generate malicious sites Fake escrow Fake pharmacy Mule recruitment Examples from Richard Clayton’s 2010 FOSDEM talk The search for “the most trusted escrow service on the internet” yielded 231 hits in Feb 2010

Malicious Executable Files
The final stage of an attack is frequently downloading an executable Traditionally blocked using signatures We use a combination of signature-based scanners and machine-learning The final payload is usually a platform-specific executable - dominated almost entirely by Windows code. Traditionally these are detected by signatures – which are effectively advanced regular expressions and pattern matchers. The best details of any signature format are for the open source ClamAV engine. A 2009 presentation by Alain Zidouemba provides a lot of detail. With virus writers now releasing several variations of their virus over its life-time and with viruses that change themselves as they propagate, this is increasingly unfeasible to rely on. Machine learning techniques can be used to detect malware. Several approaches have been used in the past with techniques including decision trees, self-organising maps, naïve-bayes classifiers, neural networks, SVMs and others. Typically these are binary classifiers using existing virus samples where the selected features may include strings or hex-sequences, system or library calls.

Drive-by attacks Almost no-one opens executables from odd sources any more, so instead people use drive-by attacks. A normal file (e.g. Flash, PDF, Javascript, Image file) is crafted to exploit a vulnerability in a viewer or library and execute code embedded within the file.

Flash “Symantec recently highlighted Flash for having one of the worst security records in We also know first hand that Flash is the number one reason Macs crash. We have been working with Adobe to fix these problems, but they have persisted for several years now. We don’t want to reduce the reliability and security of our iPhones, iPods and iPads by adding Flash” Steve Jobs, April 2010

The growing threat of Java
Almost as common as Flash 90% of PCs have Java JDK downloads per month 3.48 Million JRE downloads per month Growth in known vulnerabilities 29 patched in a single update (Oct 2010) Growth in exploits reported by Sophos, Symantec, Microsoft and Cisco Signatures + Trained Scanlet Everybody knows Flash is insecure - which is why the bad guys are switching to Java Stats are from 2008 JavaOne conference - 90% of PCs on the internet have Java Sophos Labs report “a noticeable rise in the number of in-the-wild Java related exploits” Symantec reported that “in 2008, Java vulnerabilities made up only 11 percent of all vulnerabilities found in browser plug-ins. This percentage increased significantly in 2009 when Java vulnerabilities made up 26 percent” Our analysis of Wire data supports these findings. Java 1.6 Update 22 (October 2010) contained fixes to 29 separate vulnerabilities (separate CVEs) affecting network components, GUI components and most critically Java Web-Start. To combat that we have a specific Java Scanlet which is trained from a comprehensive set of malware.

Detecting Malicious JavaScript
Sandboxing Behavioural checking Good way to beat obfuscation techniques Difficult to constrain Trained classification Analyse features We can run Javascript, ActionScript, VBScript and Java in sandboxes to analyse behavior but this is difficult to make robust in the face of obfuscated and (sometimes intentionally) corrupted malware. Normal document classification works on the presence of “words” in files It’s possible to use similar methods with JavaScript by extracting words (though not all words) and adding in other features of interest. These can be fed into a statistical model such as a Naïve-Bayes classifier.

Javascript Features v46f658f5e2260(v46f658f5e3226){ function v46f658f5e4207 () {return 16;} return(parseInt(v46f658f5e3226,v46f658f5e4207()));}function v46f658f5e61f4(v46f658f5e7174){ function v46f658f5ea0cd () {return 2;} var v46f658f5e813e=\'\';for(v46f658f5e9105=0; v46f658f5e9105<v46f658f5e7174.length; v46f658f5e9105+=v46f658f5ea0cd()){ v46f658f5e813e+=(String.fromCharCode(v46f658f5e2260(v46f658f5e7174.substr(v46f658f5e9105, v46f658f5ea0cd()))));}return v46f658f5e813e;} document.write(v46f658f5e61f4(\'3C E77696E646F772E D2\')); The above is JavaScript, but where are the features? An exercise for the reader!

Obfuscation Attackers use obfuscation Techniques include
But so do legitimate vendors (e.g. Google) And large Web 2.0 libraries Techniques include Name changes String concatenation (eval) Dynamically loaded/generated/decrypted code (eval) Splitting functionality across files

Malicious Non-Executable Files
There are a lot of file formats out there – documents, pictures, videos. For zero-day attacks, we have no data to compare against. Basically this is anomaly detection.

Development Constraints
Low False Positive Rate Robust Tolerant against malformed data Language-agnostic Scalable 1.8 Billion requests per day on 1000 servers Low latency

Back-end processing If a technique is too slow for real-time scanning, that doesn’t make it useless. Back end processing can generate lists of good and bad files and help evaluate new techniques. Just because a technology or implementation isn’t reliable enough or fast enough to run on the scanning towers or appliances doesn’t stop us being able to take advantage of it.

Want to know more? Cisco 2Q10 Global Threat Report Richard Clayton : Evil on the Internet Kaspersky Lab Security News Service A plan for Spam Clayton’s talk is from FOSDEM February 2010 and gives some live demonstrations of phishing, fake antivirus and pharmacy sites, talks about mule recruitment and take-down times.

Still want to know more? Identifying Suspicious URLs : An Application of Large-Scale Online Learning Peter Norvig Google : Statistical Learning as the Ultimate Agile Development Tool Writing ClamAV Signatures Alain Zidouemba

Take Home Messages Web Security ScanSafe and Cisco
Challenging and interesting domain Many applications for Machine Learning ScanSafe and Cisco Many opportunities for collaboration Several opportunities for student projects

Any Questions?

Applications of Machine Learning in Cisco Web Security

Similar presentations

Presentation on theme: "Applications of Machine Learning in Cisco Web Security"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applications of Machine Learning in Cisco Web Security

Similar presentations

Presentation on theme: "Applications of Machine Learning in Cisco Web Security"— Presentation transcript:

Similar presentations

About project

Feedback