SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.

Slides:

Advertisements

Similar presentations

Florida International University COP 4770 Introduction of Weka.

Advertisements

Active Learning to Classify

Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.

S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

Presented by: Alex Misstear Spam Filtering An Artificial Intelligence Showcase.

6/1/2015 Spam Filtering - Muthiyalu Jothir 1 Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – Media Informatics.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.

ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 18, 2004.

Recommender systems Ram Akella November 26 th 2008.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.

Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.

Spam Filtering Techniques Arnold Perez Joseph Tilley.

SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.

Bayesian Decision Theory Making Decisions Under uncertainty 1.

Advanced Multimedia Text Classification Tamara Berg.

JSP Standard Tag Library

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

Presented by Tienwei Tsai July, 2005

A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.

The identification of interesting web sites Presented by Xiaoshu Cai.

Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.

Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine.

Using Probability and Discrete Probability Distributions

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

A Technical Approach to Minimizing Spam Mallory J. Paine.

 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.

SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.

Machine learning system design Prioritizing what to work on

Classification Techniques: Bayesian Classification

C August 24, 2004 Page 1 SMS Spam Control Nobuyuki Uchida QUALCOMM Incorporated Notice ©2004 QUALCOMM Incorporated. All rights reserved.

Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki

Spam Detection Ethan Grefe December 13, 2013.

Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.

By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Class Imbalance in Text Classification

Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

Classification using Co-Training

CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.

Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Machine Learning. k-Nearest Neighbor Classifiers.

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Assignment 1: Classification by K Nearest Neighbors (KNN) technique

Text Mining Application Programming Chapter 9 Text Categorization

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner

Outline Introduction Related Work Algorithm Measurements Implementation Future Work

Introduction Anti-spam efforts Legislation Technology –White listing of addresses –Black Listing of addresses/domains –Challenge Response mechanisms –Content Filtering Learning Techniques

Introduction Learning techniques for Spam classification Feature Extraction Assignment of weights to individual features representing the predictive strength of a feature Combining weights of extracted features during classification to numerically determine whether mail is spam/legitimate

Introduction Current algorithms Word or phrases as features Probabilities of occurrence in spam/legitimate collections as weights Bayes rule or one of it’s variants for combining weights

Outline Introduction Related Work Algorithm Measurements Implementation Future Work

Related Work Cohen (1996): –RIPPER, Rule Learning System –Rules in a human-comprehensible format Pantel & Lin (1998): – Naïve-Bayes with words as features Microsoft Research (1998): –Naïve-Bayes with the mutual information measure to select features with strongest resolving power –Words and domain-specific attributes of spam used as features

Related Work Paul Graham (2002): A Plan for spam –Very popular algorithm credited with starting the craze for Bayesian Filters –Uses naïve-bayes with words as features Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm –Most accurate algorithm till date (over 99.7% accuracy) –Distinctive because of it’s powerful feature extraction technique –Uses Bayesian chain rule for combining weights

Related Work CRM114 algorithm Feature Extraction –Slide a Window of 5 words over the incoming text –Generate order-preserving sub-phrases containing all combinations of windowed words –For one window, 2 4 = 16 features are generated –Very high computational complexity –E.g. “Click here to buy Viagra” Features generated would be “Click”, “Click here”, “Click to”,“Click buy”, “Click Viagra”, “Click here to”, “Click here buy” etc.

Outline Introduction Related Work Algorithm Measurements Implementation Future Work

Algorithm Feature Extraction –Sentences in a message are identified by using the delimiting characters ‘.’, ‘?’, ‘!’, ‘;’, ‘ ’ –All possible word-pairings are formed from the sentences –Commonly occuring words are skipped –These word-pairings serve as features to be used for classification

Algorithm Feature Extraction (continued….) –If number of words become greater than a constant K, then series of K words is treated as a sentence – Value of K is set to 20 –E.g. “There is a problem in the tables that have been copied to the database” “problem tables”, “tables problem”, “problem copied”, “copied problem”, “problem database”, “database problem” etc. are the features that would be formed out of the sentence

Algorithm Feature Extraction (continued….) –Entire subject line is treated as one sentence –For HTML, all content within ‘ ’ is treated as one sentence –For a sentence of n words, ‘scavenger’ creates (n-1)*(n-2) features as compared to 2 n-1 created by CRM114

Algorithm Weight Assignment –Weights represent predictive strength of features –Discretized values are assigned as weights to features depending on whether the feature is a ‘strong’ evidence or a ‘weak’ evidence –‘Strong’ pieces of evidence should have high impact on the classification decision and ‘weak’ pieces of evidence should have low impact on the classification decision

Algorithm Weight Assignment (Continued…) –Categorization of features into ‘strong’ and ‘weak’ pieces of evidence is done on the basis of frequency of occurrence of the feature in spam/legitimate collections, exclusivity of occurrence and on heuristic rules like distance between words in the word-pairing, whether the feature is from the subject or the body. –Only exclusively occuring features are assigned weights – Features occuring in both spam and legitimate collections are ignored.

Algorithm Weight Assignment (Continued…) –What weights to select for the ‘strong’ evidences and the ‘weak’ evidences? –During classification, the class having more pieces of ‘strong’ evidence should ‘win’ regardless of the number of ‘weak’ evidences on either side. –In the absence of ‘strong’ evidences on either side, the class having more pieces of ‘weak’ evidence should ‘win’.

Algorithm Weight Assignment (Continued…) –Intuitively, we would like to have as much ‘distance’ between the values we choose for the ‘strong’ and ‘weak’ evidences. –We select 0.9 as the weight for ‘strong’ evidences and 0.1 as the weight for ‘weak’ evidences.

Algorithm Combining of weights –Total spam evidence = sum of spam weights of matching features –Total legitimate evidence = sum of legitimate weights of matching features –If Total spam evidence >= M* Total legitimate evidence, then message is spam –M is the thresold parameter which can be used as ‘tuning knob’

Outline Introduction Related Work Algorithm Measurements Implementation Future Work

Measurements Precision and Recall used as parameters of measurement –Spam Precision=Messages correctly classified as spam / Total Messages classified as spam –Spam Recall = Messages correctly classified as spam / Total Spam Messages in Testing set –Precision gives accuracy with respect to false positives –Recall gives capacity of filter to catch spam

Measurements Testing data –Downloaded around 5600 spam messages from –Used around 960 legitimate mails from Dr. Fink’s mailbox Cross-Validation –K-fold cross-validation for two values of K, K=2 and K=5 –K=2: Dividing data into 2 equal-sized sets –K=5: Dividing data into 5 equal-sized sets

Measurements Comparison with Paul Graham’s naïve-bayes algorithm Implemented Graham’s algorithm for two methods of feature extraction –Words+phrases as features –Feature extraction similar to ‘scavenger’

Measurements ALGORITHMK=5K=2 SPAM PRECISION (AVERAGE) SPAM RECALL (AVERAGE) SPAM PRECISION (AVERAGE) SPAM RECALL (AVERAGE) Scavenger (M=1)100%99.85%99.92%99.72% Naïve-bayes (words+phrases) 100%98.87%99.80%97.03% Naïve-bayes (with scavenger feature extraction method) 100%99.15%99.65%98.68%

Measurements MMissed Spam (%)False Positives (%)Spam Recall (%)

Measurements

Why ‘scavenger’ performs better than naïve- bayes? –Powerful feature extraction (as powerful as CRM114) –Calculates predictive strength on basis of frequency of occurrence as well as heuristic rules

Outline Introduction Related Work Algorithm Measurements Implementation Future Work

Implementation Windows-PC based filter Runs for Individual accounts in IMAP mail servers Three Modules –Configuration –Training –Classification

Implementation Classifier runs as a Windows Service Connects to mail server every ten minutes Downloads new messages, classifies them Moves messages classified as spam to a pre- configured folder on the server

Outline Introduction Related Work Algorithm Measurements Implementation Future Work

Incorporating message headers during feature extraction step Incorporating domain-specific attributes of spam during weight combination step

Publications Dr. William Yerazunis (inventor of CRM114) mentioned the ‘scavenger’ algorithm at the MIT spam conference on Jan 16, 2004 To be published in the ‘First Conference on and anti-spam’ in Palo-Alto, California in July 2004

Acknowledgements Dr. Eugene Fink Dr. Dewey Rundus, Dr. Alan Hevner Dr. Paul Graham, MIT, Boston Dr. William Yerazunis, MERL, Boston