A Bayesian Approach to filter Junk E-Mail Yasir IQBAL Master Student in Computer Science Universität des Saarlandes Seminar: Classification and Clustering.

Slides:



Advertisements
Similar presentations
Microsoft ® Office Outlook ® 2003 Training Outlook can help protect you from junk Upstate Technology Services presents:
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Basic Communication on the Internet:
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
What is Spam  Any unwanted messages that are sent to many users at once.  Spam can be sent via , text message, online chat, blogs or various other.
Surrey Public Library Electronic Classrooms Essentials.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Yazd University, Electrical and Computer Engineering Department Course Title: Machine Learning By: Mohammad Ali Zare Chahooki Bayesian Decision Theory.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
What is Statistical Modeling
6/1/2015 Spam Filtering - Muthiyalu Jothir 1 Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – Media Informatics.
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Introduction to machine learning
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
6/28/2014 CSE651C, B. Ramamurthy1.  Classification is placing things where they belong  Why? To learn from classification  To discover patterns  To.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Text Classification, Active/Interactive learning.
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
A Technical Approach to Minimizing Spam Mallory J. Paine.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Machine learning system design Prioritizing what to work on
Bayesian Spam Filter By Joshua Spaulding. Statement of Problem “Spam now accounts for more than half of all messages sent and imposes huge productivity.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
C August 24, 2004 Page 1 SMS Spam Control Nobuyuki Uchida QUALCOMM Incorporated Notice ©2004 QUALCOMM Incorporated. All rights reserved.
Spam Detection Ethan Grefe December 13, 2013.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
Managing Your Inbox. Flagging Messages Message requires a specific response or action from the recipient Flagging draws attention to your request Quick.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Electronic Mail. Gmail Accounts USERNAME Skyward PASSWORD Same password as you use to log in to your computer.
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
How to manage your s Tips and tricks. Use Folders Folders are used to manage files in your hard disk drive. Similarly you can create folders in your.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Naïve Bayes CSE651C, B. Ramamurthy 6/28/2014.
Text Mining CSC 600: Data Mining Class 20.
Erasmus University Rotterdam
How to manage your s Tips and tricks.
Classification Techniques: Bayesian Classification
iSRD Spam Review Detection with Imbalanced Data Distributions
Prepared by: Mahmoud Rafeek Al-Farra
Text Mining CSC 576: Data Mining.
Text Mining Application Programming Chapter 9 Text Categorization
NAÏVE BAYES CLASSIFICATION
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

A Bayesian Approach to filter Junk Yasir IQBAL Master Student in Computer Science Universität des Saarlandes Seminar: Classification and Clustering methods for Computational Linguistics

2 Presentation Overview Problem description (What’s Spam problem?) –Classification problem –Naïve Bayes Classifier Logical system view –Features selection, representation Results –Precision and Recall Discussion

3 Spam/junk/bulk s The messages you spend your time to throw out –Spam: do not want to get, unsolicited messages –Junk: irrelevant to the recipient, unwanted –Bulk: mass mailing for business marketing (or fill-up mailbox etc)

4 Problem examples „You have won!!!!“, you are almost winner of $... “Viagra”, generic Viagra available order now “Your order”, your item$ have to be $hipped “Lose your weight”, no subscription required “Assistance required”, an amount of million 25 US$ “Get login and password now”, age above 18 “Check this”, hi, your document has error “Download it”, free celebrity wallpapers download

5 Who? and how should one decide what is Spam? How to get rid of this Spam automatically? –Because “Time is money” –and offensive material in such s Motivation What are the computers for? “Let them work”

6 How to fight? (techniques) Rule based filtering of s –if $SENDER$ contains “schacht” $ACTION$=$INBOX$ –if $SUBJECT$ contains “Win” $ACTION$=$DELETE$ –if $BODY$ contains “Viagra” $ACTION$=$DELETE$ –Problems: static rules, language dependent, how many rules, and who should define them? Statistical filter (classifier) based on message attributes –Decision Trees –Support Vector Machines –Naïve Bayes Classifier (We’ll discuss) Problems: when no features can be extracted??? Error loss?

7 Classification tasks These are few other classification tasks: –Text Classification (like the mail message) Content management, information retrieval –Document classification Same like text classification –Speech recognitions “what do you mean? ”: yeh you understand ;)! –Named Entity Recognition: “Reading and Bath”: Cities or simple verbs? –Biometric sensors for authentication “fingerprints”, “face”… to identify someone

8 Training methods Offline learning: –some training data, prepared manually, with annotation (used to train the system before test) hi, have you thought online credit? Soha! sorry cannot reach at 18:00 Online learning: –At run-time user increases “knowledge” of the system by a kind of “feedback” to the given decision. Example: We can click on “Spam” or “Not Spam” in Yahoo mail service.

9 Yahoo Mail (Online learning)

10 Model overview Flow (training/test)Steps –Training data of annotated s s annotated –A set of classes In our case two possible classes Can further be personalized –Feature extraction (text etc) Tokenization Domain specific features Most often features to be selected –Classify (each message/ ) Calculate posterior probabilities –Evaluate results (precision/recall) test training Features extraction & selection classify Evaluate ?Spam?

11 Message attributes (features) These are the indicators for classification the messages into “legitimate” or “Spam” Features of the messages –Words (tokens): free, win, online, enlarge, weight, money, offer… –Phrases :”FREE!!!”, “only $”, “order now”… –Special characters : $pecial, grea8, “V i a g r a” –Mail headers :sender name, to and from address, domain name / IP address,

12 Feature vector matrix (binary variables) #“online”“Viagra”“Order now!!!”“offer”“win”SPAM? YES NO YES NO Words/phrases as features, 1 if the feature exists, otherwise 0

13 Feature Selection How to select most prominent features? –Words/Tokens, phrases, header information Text of the , HTML messages, header fields, address –Removing insignificant features Calculate the mutual information between each feature and the class

14 B Conditional probability Probability of an event B while given an observed event A –P(B | A) = P(A|B) * P(B) / P(A) Probability of even A must be > 0 (must have occurred) A Feature set SPAM P that A and B occurred together Calculate P that these features belong to SPAM or class

15 How to apply to the problem? When X={x1, x2, x3, x4…x n } is a feature vector –a set of feature (feature vector), X={“online”, “credit”, “now!!!”…”Zinc”} C={c1, c2, c3, c4…c k } is a set of classes –in our case only two classes i.e. C={“SPAM”, “LEGITIMATE”}. P(C=c k | X = x) = P(X=x | C=c k ) * P(C=c k ) / P(X=x) –assumption is made that each feature is independent from other P(SPAM | “online credit $”) = P(“online”|SPAM) * P(“credit”|SPAM) * P(“$”|SPAM) * P(SPAM) / P(“online”) * P(“credit”) * P(“$”)

16 Classification (Naïve Bayes) P(C SPAM | x1,x2,x3…x n ) = P(x1,x2,x3…x n | C SPAM )*P(SPAM) / P(x1,x2,x3…x n ) Prior probability –Let us say we observe 35% of s as junk/spam P(SPAM)=0.35 and P(LEGITIMATE)=0.65 Posterior probability (for Spam) –Is conditional probability of certain features in certain class P(x1,x2,x3…x n | C SPAM ) [assumption of independence of features]

17 Classifier Finally we classify –If the mail is a Spam? P(SPAM | X) / P(LEGITIMATE | X) >= Choice of depends on the “cost” we imply on misclassification (as a threshold) –What’s cost? Classifying an important as SPAM is worse Classifying a SPAM as is not that worst!

18 Experiments Used features selection to decrease dimensions of features/data Corpus of 1789 actual (1578 junks, 211 legitimate) Features from the text tokens –removed too rare tokens –Added about 35 hand-crafted phrase features –20 non textual domain specific features –Non-Alphanumeric characters and percentage of numeric were handy indicators –Top 500 features according to Mutual Info between classes and features (greater this value )

19 Evaluation? How to know how good is our classifier? –Calculate precision and recall! Precision is percentage of s classified as SPAM, that are in fact SPAM Recall is percentage of all SPAM s that are correctly classified as SPAM Ideal precision/recall curve Recall Precision

20 Results

21 Conclusion It is very successful to use automatic filter Hand-crafted features enhance performance Success in this problem is confirmation that the technique can be used in other text categorization tasks. Spam filter could be enhanced to classify other types of s like “business”, “friends” (subclasses).

22 Discussion What are we classifying? (Objects) What are the features? –What could be the features? Bayesian classification –Strong and weak points –Possible improvements? –Why Bayesian instead of other methods? –What are the questionable assumptions? Subclasses? How to control error loss? –When a normal is moved to trash… or a junk mail in the inbox?

23 Merci, Danke, Muchas Gracias, Ačiu, شكريه All of you are very patient, thank you! Special thanks to Irene –For such an opportunity to talk about classification –Her hard work & help for me to prepare this talk Thanks Sabine, Stefan for conducting this seminar Thank you (for support): –Imran Rauf –Habib Ur Rahman (“ حبيب اضط نشت دا ”) and now?…maybe thanks to spammers also 

24 References Sahami et al; " A Bayesian Approach to Filtering Junk ” 1998 Manning, Schütze: “Foundations of Statistical Natural Language Processing”, 2000.

25 Extra slides ;) oder … 

26 What are we classifying? (Objects) – s (to be classified as “normal” or “Spam”) What are the features? “indicator for any class” –What could be the features? “words, phrases, headers” Naïve Bayesian classification –Strong and weak points High throughput, simple calculation, Assumption of independent features might not always hold truth –Possible improvements? ??? Detect feature dependency –Why Bayesian instead of other methods? See strong points

27 –What are the questionable assumptions? “features are independent of each other” Subclasses? – s could be classified in subclasses: »SPAM  “PORNO_SPAM”, “BUSINESS_SPAM” »LEGITIMATE  “BUSINESS”, “APPOINTMENTS”.. etc How to control error loss? –When a normal is moved to trash… or a junk mail in the inbox?

28 Bayesian networks CLASS x1x1 x2x2 x3x3...xnxn - nodes influence parent -features are independent CLASS x1x1 x2x2 x3x3...xnxn - nodes influence parent & siblings -features are dependent