Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li.

Slides:



Advertisements
Similar presentations
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Advertisements

Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Research Summary Nick Feamster. The Big Picture Improving Internet availability by making networks easier to operate Three approaches –From the ground.
Network Security Highlights Nick Feamster Georgia Tech.
1 Network-Level Spam Detection Nick Feamster Georgia Tech.
Network Operations Research Nick Feamster
Network Security Highlights Nick Feamster Georgia Tech.
A Survey of Botnet Size Measurement PRESENTED: KAI-HSIANG YANG ( 楊凱翔 ) DATE: 2013/11/04 1/24.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
Peer-to-Peer Distributed Search. Peer-to-Peer Networks A pure peer-to-peer network is a collection of nodes or peers that: 1.Are autonomous: participants.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Presented by: Alex Misstear Spam Filtering An Artificial Intelligence Showcase.
Small-world Overlay P2P Network
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
BotMiner Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Network Security: Spam Nick Feamster Georgia Tech CS 6250 Joint work with Anirudh Ramachanrdan, Shuang Hao, Santosh Vempala, Alex Gray.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Secure Overlay Services Adam Hathcock Information Assurance Lab Auburn University.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
Wide-area cooperative storage with CFS
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
1 Authors: Anirudh Ramachandran, Nick Feamster, and Santosh Vempala Publication: ACM Conference on Computer and Communications Security 2007 Presenter:
An Effective Defense Against Spam Laundering Paper by: Mengjun Xie, Heng Yin, Haining Wang Presented at:CCS'06 Presentation by: Devendra Salvi.
Team Excel What is SPAM ?. Spam Offense Team Excel '‘a distinctive chopped pork shoulder and ham mixture'' Image Source:Appscout.com.
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
BOTNETS & TARGETED MALWARE Fernando Uribe. INTRODUCTION  Fernando Uribe   IT trainer and Consultant for over 15 years specializing.
Login Screen This is the Sign In page for the Dashboard Enter Id and Password to sign In New User Registration.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
B OTNETS T HREATS A ND B OTNETS DETECTION Mona Aldakheel
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Login Screen This is the Sign In page for the Dashboard New User Registration Enter Id and Password to sign In.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Chapter 9: Cooperation in Intrusion Detection Networks Authors: Carol Fung and Raouf Boutaba Editors: M. S. Obaidat and S. Misra Jon Wiley & Sons publishing.
BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection Guofei Gu, Roberto Perdisci, Junjie Zhang, and.
Client X CronLab Spam Filter Technical Training Presentation 19/09/2015.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
1 Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Speaker: Jun-Yi Zheng 2010/03/29.
Understanding the Network-Level Behavior of Spammers Best Student Paper, ACM Sigcomm 2006 Anirudh Ramachandran and Nick Feamster Ye Wang (sando)
1 Characterizing Botnet from Spam Records Presenter: Yi-Ren Yeh ( 葉倚任 ) Authors: L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten,
A Technical Approach to Minimizing Spam Mallory J. Paine.
DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
What’s New in WatchGuard XCS v9.1 Update 1. WatchGuard XCS v9.1 Update 1  Enhancements that improve ease of use New Dashboard items  Mail Summary >
Not So Fast Flux Networks for Concealing Scam Servers Theodore O. Cochran; James Cannady, Ph.D. Risks and Security of Internet and Systems (CRiSIS), 2010.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Spam Detection Ethan Grefe December 13, 2013.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, Presented.
Understanding the Network-Level Behavior of Spammers Author: Anirudh Ramachandran, Nick Feamster SIGCOMM ’ 06, September 11-16, 2006, Pisa, Italy Presenter:
Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
Tracking Malicious Regions of the IP Address Space Dynamically.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine.
1 Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Speaker: Jun-Yi Zheng 2010/01/18.
Dec 14, 2014, Harvard University
Learning to Detect and Classify Malicious Executables in the Wild by J
NOX: Towards an Operating System for Networks
Presentation transcript:

Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li

Everybody knows about SPAM Spam is unsolicited bulk sent for profit and general mayhem. BOTNETs = Distributed Network of hijacked IPs. IPs hard to track 70 billion s sent per day. 70% spam

How Anti-SPAM uses DBs? Spam databases collect network layer and application layer data. IP Blacklisting  Detect a malicious host during SMTP dialog.  Difficult to detect IP address DHCP, botnet size or good IPs used to forward Content Analysis  Detect malicious mail content.  Requires that MTA complete the SMTP connection.  Arms race between content filter designers and spammers.

Summary of DB Techniques Grey Space Analysis Trinity: Peer-to-Peer Database Behavioral Blacklisting Progressive Scanning Content filtering using Bayesian Analysis

Grey Space Analysis Characterize IP Space: Active vs. Grey Space IP Flow Database Detect malicious IPs by extracting dominant scanning ports (DSPs) Find DSPs using relative uncertainty algorithm

Mining Technique: Relative Uncertainty Determines entropy of IP ports in flows database. Formula := Entropy of dstPrt distribution ÷ maximum entropy. p := number of flows with port[i] ÷ total flows RU close to 1 shows ~even distribution, near 0 shows uneven distribution

Grey Space Algorithm Isolate flows toward grey space Find dominant scanning ports (DSPs) Find outside hosts with DSPs flows toward grey and active hosts. Find inside host footprint for outside hosts. Classify adversary as hitter or scanner.

Focused Hitters vs Bad Scanners Focused hitters tend to send tens or hundreds of flows to each grey host. Bad scanners send one or a few flows to each grey host

Trinity: Distribute IP Reputation Database Botnets send a large amount of data in a short amount of time. Trinity uses distributed in-memory hash table containing IP reputation entries. Each peer has 10 to 50 megabytes of data (833K – 4.17M entries)

Chord Distributed Hash Table Distribute data over a large P2P network  Quickly find any given item Stores key/value pairs  The key value controls which node(s) stores the value  Each node is responsible for some section of the space Basic operations  Store(key; val)  val = Retrieve(key)

Chord (cont) Each node chooses a n-bit ID  IDs are arranged in a ring Each lookup key is also a n-bit ID  i.e., the hash of the real lookup key  Node IDs and keys occupy the same space! Each node is responsible for storing keys “near" its ID  Replication usaully between current and previous node  Items can be replicated at multiple successors  No single host contains large fraction of a particular space to guard against DDoS.

Database Updates Compute the number of interval quarters since last update. Shift and update counters accordingly Determine site responsible for entry and send UDP. Once received by owner site, forward entry to k peers using TCP. Updates communicative, order doesn’t matter. Consistency not required. Even if host goes down, database can be rebuilt in an hour.

Security Secure communications for neighbors Limit updates for nodes that have sent more than 100 s in 10 minutes. Falsified source IPs can cause false positives.

Clustering Technique for Behavioral Blacklisting Identify spammers that attack many domains. Domain distribution and frequency is the sending pattern Form clusters of sending patterns Use clusters to ID new attack

Spectral Clustering Divide Phase – produces a tree whose leaves are elements of the set. Merge Phase – Start with each leaf in its own cluster and merge going up the tree.

Vector Generation Database contains: M(i,j,k)  Total times that IP ‘i' sent to domain ‘j’ in time slot ‘k’. Find total flows for IP/Domain across entire time axis (M’). Generate feature vector from M’  IP :=

Clustering Clusters contain IP addresses that send mail to similar sets of domains. Define traffic pattern for each cluster  Averaging the rows (vector contents) for all IPs in the cluster. IPxIP matrix of related spam senders

Classification Input IP vector ‘r’ :=1 x d vector Use similarity algorithm to find closes cluster Spam score is the maximum similarity of r with any cluster.

Progressive Scanner Maintains Feature Instance (FI) database FI is any feature that can discriminate HAM from SPAM. Dynamic Features - Use any feature that IDs mail such as contents, network, etc.)  Paper only uses URL links as FIs

PEC Architecture FI States  Grey (Ambiguous FI)  Black (Spam FI)  White (HAM FI) Blacklist Module – Extracts and hashes FIs Scoreboard Module – Tracks FI occurrences and timestamp (age)

Competitive Aging and Scoring System (CASS) Transition between states governed by  Score – number of occurrence of FI  Age – time since last score update. Score (R) exceeds score threshold (S) causes Grey to Black transition. Age (A) exceeds age threshold (M) triggers Grey to White transition.  Purge

Bayesian Content Filtering Determine the probability that a message is spam based on contents Use Bayesian combination of spam probabilities

Bayesian Training Requires training corpus of HAM/SPAM Find interesting tokens. Create HAM/SPAM token tables

Classification Tokenize new message Calculate spam probability for each message Derive overall spam probablity using Bayes formula. Sample Message = 0.0 Non-spam tokens outweigh spam tokens to prevent false positives Hi, Just a reminder: don’t forget your allergy prescription when you visit New York City today. Mom Sample Message Spam Probability Table

Real World Applications TrustedSource.org Messaging Security Architecture

Summary A variety of database techniques are used in Anti- Spam Technology  IP Blacklisting  Content filtering Databases can contain:  Network traffic: IP Addresses, Domain, Ports  Message Content: Words, URLs, HTML Text Challenges:  Scalability – Must handle many connections or messages  Minimize False Positive Rates – Cannot classify a HAM message as SPAM.  Finding useful SPAM features. Using machine learning techniques.

References Brodsky, et al, A Distributed Content Independent Method for Spam Detection, HotBots 2007 Jin, et al, Identifying and Tracking Suspicious Activities through IP Gray Space Analysis, MineNet 2007 Liu, et al, High-Speed Detection of Unsolicited Bulk s, ANCS 2007 Ramachandran, A., Filtering Spam with Behavioral Blacklisting, CCS 2007 Cheng, et al., A Divide-and-Merge Methodology for Clustering, ACM Transactions on Database Systems, 2006 Graham P., A Plan for Spam, Secure Computing Corporation, http://trustedsource.org