Dec 14, 2014, Harvard University

Dec 14, 2014, Harvard University
Spam Campaign Cluster Detection Using Redirected URLs and Randomized Sub-Domains Authors Abu Awal Md Shoeb, Dibya Mukhopadhyay, Shahid Al Noor, Alan Sprague, and Gary Warner Dec 14, 2014, Harvard University

Outline Introduction Why spam detection is important
Why it is difficult to detect Our approach Results Conclusion

Introduction Spam email Spam campaign Redirected URL
Unsolicited bulk (also know as junk ) sent to numerous recipients Not only annoying but also dangerous Spam campaign Spam s constructed from same template and to market the same product Redirected URL Address of the actual web sites obtained from given URL Randomized sub-domain Different web address generated from a single domain or web address spam, also known as junk or unsolicited bulk (UBE), is a subset of electronic spam involving nearly identical messages sent to numerous recipients by . Host name?

Why Spam Detection is Important
More than 70% of is Spam! Malware infected user helps spammers to harm others! Save extra time of large organizations to respond to infections Find useful easily without going through the large set of spam Users are not aware that they are infected!

Why Spam Detection is Difficult
Attributes of spam change very quickly Variety of subjects Variety of given URLs Given URL shuts down after some time Criminals hide their identity through Botnets Infected computers are used for campaign Lack of having access to instantaneous spam data No benefit if a campaign is detected after it achieves its goal

Our Approach Why given URL Why campaign, why not spam only
This is what comes with most spam Why campaign, why not spam only Campaign is the superset of spam Why redirected URL Given URL changes very often but Redirected URL doesn’t Why campaign We will show why campaign once we see the result Campaigns are more meaningful than

Block Diagram/Overview
Each spam may be considered as a cluster of size 1 Level 2: Merge each pair of clusters that contain an exact subject in common Level 3: Merge each pair of clusters that contain a URL with common domain Level 1: Merge all spam s that redirect to the same website Remove clusters whose size is smaller than a particular threshold T Large Spam Campaign each represented as a cluster Spam Dataset Exact subject matching, not partial Threshold to remove outliers

Our Algorithm Level1: Redirected URL-Based Clustering
Put all spams into same cluster that have same Redirected URL and assign Redirected URL as KEY Remaining spams are treated as individual clusters Level 2: Exact Subject-Based Clustering Merge existing clusters if subjects are matched Assign Subjects as KEY if Redirected URL is not a key Level 3: Randomized Sub-Domain-Based Clustering Extract given URL and merge existing clusters if they are the member of same domain Thresholding Apply different threshold to discard tiny clusters (outliers) Mention threshold values

Dataset and Tools Source
The Center for Information Assurance and Joint Forensics Research (CIA-JFR) Spam Data (Appx. Half a Million) 15 April 2014 (6 Hours) to build prototype 20, 21 August 2014 (Full Day) to test Attributes of Data Subject, given URL, redirected URL (derived) Language Python (Mechanize Library) Say about Mechanize : visit web pages

Example: Redirected URL
Given URLs (20 Aug 2014) All of them have same Redirected URL Say about Mechanize : visit web pages

Example: Randomized Sub-Domain
All given URLs have same domain Say about Mechanize : visit web pages

Randomized Sub-domain
Results: Number of Clusters Clusters/ Dataset # of Spam Redirected URL Same Subject Randomized Sub-domain Threshold 500 15 April 60995 17253 1086 289 4 20 Aug 249389 156077 5145 1044 18 21 Aug 247922 166645 7938 1037 15 20, 21 Aug 497311 322722 11878 1670 26

Results: April 15, 2014 Total Spam: 60995 SC 1: 36% SC 2: 32% SC 3:
18% SC 4: 6%

Results: August 20, 2014 Total Spam: 249389 SC 1: 36% SC 2: 30% SC 3:
11.5% SC 4: 6.7%

Results: August 21, 2014 Total Spam: 247922 SC 1: 34% SC 2: 32% SC 3:
17% SC 4: 3.5%

15 April 2014 Campaigns: 60995 Spams
Products Advertized in Large Campaigns Which one comes from which cluster 15 April 2014 Campaigns: Spams

Behavior of Campaigns Redirected URL 15 April 20 August 21 August
Yes No Yes (less frequent) Which one comes from which cluster

Conclusion - 1 It is a real time spam campaign detection
No predefined model/role is required Can be applied once spam arrives Our approach is very effective Almost 90% of half a million spam falls into 4 major campaigns Can detect campaign consistently No matter if campaigns subject changes No matter if given URL changes

Conclusion - 2 With large clusters identified, rather than blocking the spam, we need to identify a new approach towards spam campaign Community awareness Law enforcement

☺ Thank You Question Time! Presented By – Abu Awal Md Shoeb
The SECuRE and Trustworthy Computing Lab (SECRETLab)

Problem of Current Approaches
Content-Based Requires longer processing time Blacklist-Based/IP-Based Attackers change host IP or path Whitelist-Based Detecting and maintaining the list is not easy Challenge Response-Based Deadlock when both party implement this Content-Based: Content based approach considers several factors such as, number of words in page tittle and body along with their average length, fraction of visible content and globally popular words, compressibility, n-gram likelihoods etc.

Dec 14, 2014, Harvard University

Similar presentations

Presentation on theme: "Dec 14, 2014, Harvard University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dec 14, 2014, Harvard University

Similar presentations

Presentation on theme: "Dec 14, 2014, Harvard University"— Presentation transcript:

Similar presentations

About project

Feedback