Presentation on theme: "Spam Sinkholing Nick Feamster. Introduction Goal: Identify bots (and botnets) by observing second-order effects –Observe application behavior thats likely."— Presentation transcript:
Spam Sinkholing Nick Feamster
Introduction Goal: Identify bots (and botnets) by observing second-order effects –Observe application behavior thats likely to contain bot activity (spam is a good candidate: > 85% of spam coming from bots as of 4Q 2005) Advantages: –Direct observation of behavior –Potentially very wide lens –Passive Disadvantage: No ground truth
Spam Collection Overview Trap mail sent to dead domains Log IPs Perform active and passive measurements –Traceroute –Passive SYN fingerprints –DNSBL lookups, etc.
Data Collection Overview Mail Avenger sendmail Spammer DNS MX lookups Resolve to sinkhole Blowtorch (GTISC) dynamo rsync (schema on wiki) O(100k) pieces of spam per week Hundreds of domains
Sample Mail Avenger Header Highly configurable SMTP server that collects many useful statistics
Database Schema Sample CREATE TABLE spamtrap_ ( entrytime timestamp with timezone default NULL, trap_domain text default NULL, client_ip ip4 default NULL, client_port smallint default NULL, traceroute_time timestamp with timezone default NULL, to_ text default NULL, delivered_to text default NULL, subject text default NULL, xmailer text default NULL, from_ text default NULL, id serial default NULL, FOREIGN KEY(dnsbl_id) on spamtrap_dnsbl(dnsbl_id), ) tablespace dataspace;
Uses for Data Identification: Low-confidence list of likely bot IPs Bootstrapping: Use as a starter set for some intractable analysis problems –Use this low-confidence list to prune DNSBL graph mining –Feed this information back to ISPs to focus mining Second-order effects –Analysis of hosting sites for URLs –Clustering
Analysis Within Spam Dataset Clustering to identify groups (coordination suggests likely bot) –Temporal-based correlation –Content-based correlation Based on URLs Analysis of hosting URLs: Perhaps useful for identifying phishing sites –Where hosted? –Transience?
Correlation: Across Datasets DNSBL datasets require bootstrapping –As per SRUTI paper –Use spam dataset as a graph pruning mechanism Possibility: Use spam sinkhole as a source for malware. Strip attachments. –Likely already being done by lots of others Get information about exfiltration addresses and domains from binary analysis –Look for those appearing in sinkhole to build confidence and monitor ongoing activity