Presentation on theme: "Applications of Machine Learning in Cisco Web Security"— Presentation transcript:
1 Applications of Machine Learning in Cisco Web Security Richard Wheeldon PhD BSc
2 Cisco Web Security Cisco, Ironport and ScanSafe Request time filtering Categorization and classificationReputationResponse time filteringMalware types and attack vectorsMalware detectionDynamic classificationOther challenges
3 The Ubiquitous Speaker Slide Richard WheeldonUCL Graduate in 1999PhD from Birkbeck in 2003Joined Cisco December 2009AcknowledgementsSteve Poulson -Bryan Feeney -Slides will be made available at after the presentation is over.Bryan Feeney was a UCL MSc student before joining Scansafe. He wrote many of the original slides for this talk and much of the Scansafe OI code. He is currently doing a PhD at UCL.
4 Cisco, Ironport and ScanSafe World’s leading network companyIronportLeader in Anti-spamProvide Web Security AppliancesScanSafeWorld leader in “Security as a Service”Scans 1.8 billion web requests a dayBlocks 32 million of themCiscoArguably the world’s largest network company, employing people with assets of over 80BN$Known mostly for routers and switches Cisco also produce firewalls, telephones and servers.Technologies for security, network management, building management, video conferencing (telepresence, webex), instant messaging, phones, automation,Security products include firewalls, IPSs, managements systems, VPNs, mobile securityIronportAcquired by Cisco in 2007Best known for IronPort AntiSpam, the SenderBase reputation service, and security appliances (ESA)ScansafeAcquired by Cisco in December 20092 years ago, Scansafe were scanning 7 billion requests a month, now we’re up to 1.8 bilion a day. More data breeds better trainingAround 5000 customers including well-known names such as ICI, Standard Chartered, BMW. A single company may have users.Blocks are up from 70 million a month to 32 million a day, based on a combination of categorization and detection of malware and other online threats.Currently peaking at a bandwidth usage of 8 gigabit / sec across 1000 servers
6 Previous MSc projects Tree Kernels for CFG similarity Guangyan Song, 2010Fast computation of the Kernel of a Tree and applications to Semi-Supervised LearningMalcolm Reynolds, 2009Comparing N-gram features for web page classificationNoureen Tejani, 2007
7 We’re hiring Positions Locations Graduate recruitment Software DevelopersQA, Operations, ResearchLocationsScanSafeUK - Bedfont Lakes, Reading, Staines, EdinburghGalway, EMEA, US, WorldwideGraduate recruitmentCisco are making 500$ investment in the UK over next five years.
8 Scansafe’s SaaS 1. Availability Time our service is available to scan traffic99.999% guaranteed availability2. LatencyAdditional load time attributable to servicesEvaluated by 3rd party analysis3. False PositivesPages that were blocked but should not have4. False NegativesPages that were not blocked, but should haveCloud based scanningContent scanning with AV engines done on over 1000 worldwide scanning serversEase of Management is at the heart of a managed service approach. Using ScanSafe helps remove the burden of updating software and hardware. For example, the need to constantly update AV software for new signatures. However, a strength-in-depth approach with AV on the desktop is still recommended.Zero maintenanceProvides easy deployment (relative to appliance-based delivery)Automated continuous updates - no patchingA global presence with very high levels of redundancy help us meet some of the toughest SLAs in the business.Some of the world’s largest Telcos have partnered with us including AT&T, Google, and SprintWe have won numerous industry awards for our services and feature in the leaders quadrant of Gartner’s surveysThe ScanCenter portal enables the customer administrator to:View reports and Configure automated reportsReview statistics of all Web activity and threats blockedCreate Access policies and apply these to specific users or groupsReport GenerationAutomated reports are available on overall traffic, bandwidth, blocked URLs, spyware and Web viruses stopped. These are complemented by a comprehensive selection of additional reports, generated daily, which provide in-depth analysis in the form of graphs, tables, and exportable data files. Reporting functionality is comprehensive and ranges from high-level dashboard views to detailed forensic audits on specific users. This is all backed by our data warehouse of recorded traffic.Granular policiesBlock / allow listsSchedulesQuotas80 categories
10 Most web traffic is good The Web vs.WebMost web traffic is goodMost is badEasy to find safe sitesEasy to get SpamHarder to get dangerous URLsHarder to get examples of good mailBlocking web sites is visibleBlocking is invisiblePerformance gain from white-listingPerformance gain from blockingVery Real-Time (<2s)Not Real-Time (<Nhrs)
11 Request time filtering MotivationQuicker blocks save bandwidth and processing timeIf the request is made, the damage may be doneTechniquesDatabasesReputationRulesTrained systemsBandwidth and processing timeSaving bandwidth is important for categorized media contentAlso important for where bad content dominates over good contentWith 7 trillion spam message a year, the benefit of not processing that data is hugeData leak PreventionIf malware is already installed on a client network, connections may be made out to BotnetsMay leak passwords, credit card details, information about the networkData Leak Prevention (DLP) is the general term for preventing vital data from getting out of a company’s network
12 Category-based filtering Responsible for most blocksHigh-risk and high-trafficManual categorizers10 million URLs97% of traffic2 million porn sitesCategory-based filteringResponsible for the majority of blocksFocused towards high-risk and high-trafficTeam of manual categorizers focuses on recently seen, high-volume sitesDatabase of 10 million URLs and IP addresses covers 97% of trafficOver 2 million known porn sites
13 Web Reputation Feeds Heuristics Phishing sites Malware sites In spam but not in hamAge of domain registrationHigh traffic – e.g. Alexa 1000Scanned but never blockedMotivationBlacklists and whitelists aren’t always 100% accurate (or up to date)There are a number of heuristics which indicate potential problems but which aren’t automatically maliciousIf we see it frequently seen on the service, but never blocked, it’s more likely to be safe. If we keep finding malware it’s probably notMalware and phishing sites tend to be short-lived. This makes blacklists less valuable but provides a useful metric
17 Recognizing Porn URLs http://www.penisland.com Example of segmentation problemP('peni') X P('sland')P('penis') X P('land')P('pen') X P('island')Extends to classificationP('penis') X P('land') X P(porn|'penis') X P(porn|'land')P('pen') X P('island') X P(not_porn|'pen') X P(not_porn|'island')Text Segmentation problemWell studied in speech recognition and NLPReal life problem for natural languages – particularly eastern and orientalJapanese and Chinese don’t have spaces between wordsArabic doesn’t have spaces between charactersThere are many problems in generating a real-world solution, caused in part by biases in the training data. For example, our categorized corpus has a bias towards high-risk areas and towards languages used mostly by our customers
18 Phishing and Malware Examples Phishing examplesMalicious examples:www1.scan-projectrf.cz.ccwww1.scan-projectsi.cz.ccwww1.scan-projectst.cz.ccwww1.scan-projectte.cz.ccwww1.scan-projectti.cz.ccWe can construct a graph using a distance measure between URLs such as edit distance or ngrams.For the phishing example: the known phish sites are labelled and paypal sites labelled as good.Methods exist for partitioning graphs based on labels and graph structureOther contextual information can be used:Time of requestType of URL (IP Address, Domain Name)Geographic location of client and serverHeaders (malware may not look like a known browser)
19 SearchaheadAcceptableUncategorizedProhibitedMaliciousIf we can identify bad URLs we can warn before the user clicks.Over 90% of new sites are visited as the result of an Internet searchSearch Engines are the Gateway to Web Threats. Over 90% of new sites are visited as the result of an Internet search – that’s where users come across new and potentially suspicious sites. We operate SearchAhead – as an “early warning” service for safe searching
20 Response Time Scanning GraphicsWebmailNew Web PagesBlogsAd LinksLinksCommentsBanner AdsBackdoorsRootkitsTrojan HorsesKeyloggersWormsTrusted sites are targetsStrength-in-depth combination of commercial scanners and in-house technology.Its not just suspicious sites that can host malwareTrusted sites host user-generated content in the form of comments, blogs, s, etc.Trusted sites can become infected, compromised or unwitting passengers for malwareThis is happening ever more frequently20
21 Exploited sites in recent years FacebookTimes IndiaMiami DolphinsSamsungExploited sites over the past few years include:Samsung unknowingly hosted malicious code in the form of trojans which disabled AV programs and logged keystrokesBefore the Superbowl in 2007, the Miami dolphins site was compromised by hackersEarlier this year a worm went round on Twitter caused by hackers taking advantage of a Cross-site scripting (XSS) exploitFacebook , Times IndiaIn 2007, Times India hosted a cocktail of dropper Trojans, malicious binaries and scripts which were identified by ScanSafe’s OI team.The well-known Koobface virus spreads by delivering Facebook messages to people who are 'friends' of a Facebook user whose computer has already been infected.
22 Nothing is safe – not even Twitter! Category or URL based filtering approaches won’t provide protection from any of these threats.Strength in depthCombination of commercial AV and Spyware enginesAdded in-house scanning technology (OI)
23 Signature DatabasesFrom 2006 to 2008, the F-Secure signature database grew from entries to 1.5 millionThe rate at which variants of viruses come out is growing rapidlyNo vendor can rely exclusively on signaturesIn 2007F-Secure had 500,000 signaturesMcAfee had 360,000 signaturesNow, just over 3 years laterClamAV has signaturesSymantec claim 1.6 millionPanda claim 2 millionThe rate at which variants of viruses come out is growing rapidly. Vendors have realised that adding separate signatures for each new variant isn’t feasible and have added behavioural scanning, heuristics and machine learning technologies to cope.
24 Zero-hour protection Vendors take time to release signature updates Win32.IstBar.jl trojanOutbreak Intelligence (OI) provides proactive threat detectionA huge data set of traffic to be leveraged
25 How does OI use Machine Learning? ApproachesMalware detectionAnomaly detectionDynamic categorizationTechniques EmployedSupervised LearningUnsupervised LearningSandboxing
26 Dynamic Classification Document classification across 80 categoriesIncreases coverageLanguage identificationIdentifies inappropriate contentPorn is relatively easyPhishing is harder – but not impossible?Hate speech is harder stillWe augment our request-time categorization with dynamic classification based on machine learning of these categories from a training set extracted from our data. In order to do this we first have to handle language-identificationBy phishing, we also include fake escrow sites, fake anti-virus sites, fake pharmacy sites. Many of these are now generated automatically by tools which create slight variants of a default template.Detecting hate-sites is particularly difficult because most existing algorithms will find sites matching hateful subject, where as what is required is to recognize only those in support of it.
27 DC for identifying malicious sites Automated tools generate malicious sitesFake escrowFake pharmacyMule recruitmentExamples from Richard Clayton’s 2010 FOSDEM talkThe search for “the most trusted escrow service on the internet” yielded 231 hits in Feb 2010
28 Malicious Executable Files The final stage of an attack is frequently downloading an executableTraditionally blocked using signaturesWe use a combination of signature-based scanners and machine-learningThe final payload is usually a platform-specific executable - dominated almost entirely by Windows code.Traditionally these are detected by signatures – which are effectively advanced regular expressions and pattern matchers. The best details of any signature format are for the open source ClamAV engine. A 2009 presentation by Alain Zidouemba provides a lot of detail. With virus writers now releasing several variations of their virus over its life-time and with viruses that change themselves as they propagate, this is increasingly unfeasible to rely on.Machine learning techniques can be used to detect malware. Several approaches have been used in the past with techniques including decision trees, self-organising maps, naïve-bayes classifiers, neural networks, SVMs and others.Typically these are binary classifiers using existing virus samples where the selected features may include strings or hex-sequences, system or library calls.
30 Flash“Symantec recently highlighted Flash for having one of the worst security records in We also know first hand that Flash is the number one reason Macs crash. We have been working with Adobe to fix these problems, but they have persisted for several years now. We don’t want to reduce the reliability and security of our iPhones, iPods and iPads by adding Flash”Steve Jobs, April 2010
31 The growing threat of Java Almost as common as Flash90% of PCs have JavaJDK downloads per month3.48 Million JRE downloads per monthGrowth in known vulnerabilities29 patched in a single update (Oct 2010)Growth in exploits reported by Sophos, Symantec, Microsoft and CiscoSignatures + Trained ScanletEverybody knows Flash is insecure - which is why the bad guys are switching to JavaStats are from 2008 JavaOne conference - 90% of PCs on the internet have JavaSophos Labs report “a noticeable rise in the number of in-the-wild Java related exploits”Symantec reported that “in 2008, Java vulnerabilities made up only 11 percent of all vulnerabilities found in browser plug-ins. This percentage increased significantly in 2009 when Java vulnerabilities made up 26 percent”Our analysis of Wire data supports these findings.Java 1.6 Update 22 (October 2010) contained fixes to 29 separate vulnerabilities (separate CVEs) affecting network components, GUI components and most critically Java Web-Start.To combat that we have a specific Java Scanlet which is trained from a comprehensive set of malware.
34 Obfuscation Attackers use obfuscation Techniques include But so do legitimate vendors (e.g. Google)And large Web 2.0 librariesTechniques includeName changesString concatenation (eval)Dynamically loaded/generated/decrypted code (eval)Splitting functionality across files
35 Malicious Non-Executable Files There are a lot of file formats out there – documents, pictures, videos.For zero-day attacks, we have no data to compare against.Basically this is anomaly detection.
36 Development Constraints Low False Positive RateRobustTolerant against malformed dataLanguage-agnosticScalable1.8 Billion requests per day on 1000 serversLow latency
37 Back-end processingIf a technique is too slow for real-time scanning, that doesn’t make it useless.Back end processing can generate lists of good and bad files and help evaluate new techniques.Just because a technology or implementation isn’t reliable enough or fast enough to run on the scanning towers or appliances doesn’t stop us being able to take advantage of it.
38 Want to know more?Cisco 2Q10 Global Threat ReportRichard Clayton : Evil on the InternetKaspersky Lab Security News ServiceA plan for SpamClayton’s talk is from FOSDEM February 2010 and gives some live demonstrations of phishing, fake antivirus and pharmacy sites, talks about mule recruitment and take-down times.
39 Still want to know more?Identifying Suspicious URLs : An Application of Large-Scale Online LearningPeter Norvig Google : Statistical Learning as the Ultimate Agile Development ToolWriting ClamAV Signatures Alain Zidouemba
40 Take Home Messages Web Security ScanSafe and Cisco Challenging and interesting domainMany applications for Machine LearningScanSafe and CiscoMany opportunities for collaborationSeveral opportunities for student projects