Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Email Not necessarily commercial – “flaming”, political.

Slides:



Advertisements
Similar presentations
Microsoft ® Office Outlook ® 2003 Training Outlook can help protect you from junk Upstate Technology Services presents:
Advertisements

1 Web Based A Module of the CYC Course – Internet Basics
Basic Communication on the Internet:
A Bayesian Approach to filter Junk Yasir IQBAL Master Student in Computer Science Universität des Saarlandes Seminar: Classification and Clustering.
What is Spam  Any unwanted messages that are sent to many users at once.  Spam can be sent via , text message, online chat, blogs or various other.
Surrey Public Library Electronic Classrooms Essentials.
COMPUTER BASICS METC 106. The Internet Global group of interconnected networks Originated in 1969 – Department of Defense ARPANet Only text, no graphics.
Using Different Forms of Basic Knowledge of the 3 Different Platform: Outlook, AOL and HTML Prepared by Mitch.
----Presented by Di Xu  Introduction  Overview of Spam  Solutions to Spam  Conclusion.
Click and Connect - Session 2 More Internet Searching Introduction to BenefIT 3 Dept. Communications Energy & Natural Resources
Basics. 2 Class Outline Part 1 - Introduction –Explaining –Parts of an address –Types of services –Acquiring an account.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Standard Grade Computing Electronic Communication.
Phishing (pronounced “fishing”) is the process of sending messages to lure Internet users into revealing personal information such as credit card.
IMF Mihály Andó IT-IS 6 November Mihály Andó 2 / 11 6 November 2006 What is IMF? ­ Intelligent Message Filter ­ provides server-side message filtering,
Preventing Spam: Today and Tomorrow Zane Bonny Vilaphong Phasiname The Spamsters!
Security Awareness: Applying Practical Security in Your World, Second Edition Chapter 3 Internet Security.
Fighting Spam Randy Appleton Northern Michigan University
August 15 click! 1 Basics Kitsap Regional Library.
Surrey Libraries Computer Learning Centres Totally New to Computers Easy Gmail September 2013 Easy Gmail Teaching Script.
Surrey Libraries Computer Learning Centres Totally New to Computers Easy Gmail March 2013 Easy Gmail Teaching Script.
Pro Exchange SPAM Filter An Exchange 2000 based spam filtering solution.
Managing and Avoiding Junkmail. Junk  Where does Junk Mail come from? People with whom you do business  Pepsi Friends of people with whom you.
Belnet Antispam Pro A practical example Belnet – Aris Adamantiadis BNC – 24 November 2011.
AND SPAM BY OLUWATOBI BAKARE
Technology ICT Option: . Electronic mail is the transmission of mainly text based messages across networks This can be within a particular.
GOT SPAM? Spam is the unsolicited or undesired bulk electronic messages. Spam usually contains pornography, viruses, phishing attacks, scams, trojans,
Protecting Yourself Online (Information Assurance)
This presentation will be all about s, etiquette and software. I will be going through each one of these individually and thoroughly step.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
1 Chapter 2 (Continued) Section 2.2 Section 2.2. Internet Service Provider (ISP) ISP - a company that connects you through your communications line to.
Client X CronLab Spam Filter Technical Training Presentation 19/09/2015.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
(or ?) Short for Electronic Mail The transmission of messages over networks.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
Privacy & Security Online Ivy, Kris & Neil Privacy Threat - Ivy Is Big Brother Watching You? - Kris Identity Theft - Kris Medical Privacy - Neil Children’s.
Spam Act 2003 Consumer Education and Awareness. About the ACA Independent government regulator Ensures industry compliance with legislation (Telecommunications.
What is and How Does it Work?  Electronic mail ( ) is the most popular use of the Internet. It is a fast and inexpensive way of sending messages.
A Technical Approach to Minimizing Spam Mallory J. Paine.
Etiquette – a list of rules that we observe Phishing - sending an to a user falsely claiming to be a legitimate company to scam the user into providing.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
advantages The system is nearly universal because anyone who can access the Internet has an address. is fast because messages.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
What’s New in WatchGuard XCS v9.1 Update 1. WatchGuard XCS v9.1 Update 1  Enhancements that improve ease of use New Dashboard items  Mail Summary >
Concepts  messages are passed through the internet by using a protocol called simple mail transfer protocol.  The incoming messages are.
Computing Science, University of Aberdeen1 Reflections on Bayesian Spam Filtering l Tutorial nr.10 of CS2013 is based on Rosen, 6 th Ed., Chapter 6 & exercises.
s This presentation is all about s, etiquette and software. I will go through these things step by step to give you a clear understanding.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Marketing Amanda Freeman. Design Guidelines Set your width to pixels Avoid too many tables Flash, JavaScript, ActiveX and movies will not.
This presentation will be all about s, etiquette and software. I will be going through each one of these individually and thoroughly step.
Chapter 4 Communicating on the Internet. How Works? Most used Feature TCP breaks & reassembles messages into packets IP delivers packets to the.
Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
Activity 4 Catching Phish. Fishing If I went fishing what would I be doing? On the Internet fishing (phishing) is similar!
Basics What is ? is short for electronic mail. is a method for sending messages electronically from one computer.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
By Toby Reed.
Spam By Dan Sterrett. Overview ► What is spam? ► Why it’s a problem ► The source of spam ► How spammers get your address ► Preventing Spam ► Possible.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Created by the E-PoliceSlide 122 February, 2012 Dangers of s By Michael Kuc.
Chapter 8 Browsing and Searching the Web
Call Outlook customer support toll free number Ireland.
Huntington Beach Public Library
What is it? Why do I keep getting from Barracuda? SPAM.
Ethics Tutorial Assignment#2
ethical issues in business
Basics HURY DEPARTMENT OF COMPUTER SCIENCE M.TEJASWINI.
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

Spam Filters

What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political

Spam arriving in Michael’s mail box in August You have won a lottery Your bank needs your account details Money transfer from Nigeria On-line pharmaceuticals Software for sale Alarm systems Looking for a safe, ethical secondary income? Music and film downloads

Why send spam? is fast, cheap,easy Availability of enormous address lists (or guess likely addresses from dictionaries e.g. harvesting) 7% of users have bought something 100 responses to 10 million s will produce a profit Illegal in the EU, but not in all US states

What’s wrong with spam? Wastes time deleting unwanted messages User sees offensive material Fills up file server storage space Some people vulnerable to confidence tricks BrightMail estimate 8% of was spam in 2001, 40% in May stall the internet altogether

Combating spam Blacklisting – maintain a list of addresses of known spammers Greylisting – challenge suspected spam s e.g. by answering a question which is simple for a human but difficult for a computer e.g. how many animals in this picture? Munging - to defeat harvesters, e.g. post your as cormac at dublin dot com on the web Litigation - e.g. anti-spam company Habeas haiku winter into spring, brightly anticipated, like Habeas SWE. EU says all bulk should be opt in unless there is an “existing relationship”.

Spam filters Spam filters are an example of text classification (e.g. topic, language, author) What is worse, saying a legitimate is spam or letting through a spam message ?

Rule-based filters Some systems allow users to handcraft rules, rather than yes/no, best to have an associated probability, e.g. Barcalys  90%, Ivory Coast  70%. But this is time consuming and tedious Users must be “savvy” enough to create them They must be constantly refined as the nature of spam changes

Adaptive filters Learn directly from the data in the user’s mailbox Which words are truly characteristic of spam? Compare with automatic indexing (stemming, mid-frequency words)

Training vs. test sets 1. Learn the rules on the training data 2. See if the rules work on the test data E.g. use the LingSpam corpus (400 spams, 200 legitimate messages sent to the Linguist List Better to build your own corpus – spammers can overcome filters built on just one corpus

Chi-Squared Test Find most characteristic words in spam / non-spam by chi-squared test (also finds difference between men and women’s speech)

Mutual Information (1) [word, category] e.g. how often is the word “download” found in spam? [word] e.g. how many messages altogether contain “download”? [category] e.g. how many messages altogether are spam? N = total number of messages

Mutual Information (2) MI = log2 ( [download,spam] * N / [download] [spam] ) The higher the MI, the more “download” is typical of spam Now we have found which words are most typical of spam and legitimate messages, we must use this information to classify the unseen messages in the test set

Bayesian Modelling Used in expert systems We want to work our the probability of the hypothesis given the evidence, P ( H | E ) E.g. P ( spam | contains “NOW!” ) P ( not spam | contains “NOW!” ) Which is greater? Bayes’ rule: P ( H | E ) = P (E | H) * P (H) / P (E)

Combining Evidence (1) A Naïve Bayesian model assumes that multiple evidence is not conditionally dependent. Compare: Toffee Vodka wins the 2:00 at Newmarket All for Laura wins the 2:35 at Newmarket Nebraska Tornado wins the 3:15 at Newmarket Newcastle beat Birmingham Newcastle lead Birmingham at half-time Shearer scores a hat-trick

Combining Evidence (2) In a Naïve Bayesian model, P ( cheap, v1agra, NOW! | spam) = P (cheap | spam) * P ( v1agra | spam ) * P (NOW! | spam) Now we can find: P ( spam | cheap, v1agra, NOW! ) =a P (not spam | cheap, v1agra, NOW!) = b Odds on spam given that the message contains these three words = a / b In real text, words are conditionally dependent e.g. “click here” Only classify as spam if 100 – 1 on.

Non-word indicators of spam phrases e.g. “free money”, “only $”, “over 21” punctuation!!! domain name of sender:.edu less likely to be spam than.com spam more likely to be sent at night than legitimate If less than 9% non-alphanumeric characters, more likely to be legitimate Look for images, colours, HTML tags

Evaluation of spam filters Junk precision: percentage of messages in the test data classified as junk which truly are junk Junk recall: percentage of junk messages in the test data classified as junk Legitimate precision: percentage of messages in the test data classified as legitimate which truly are legitimate Legitimate recall: percentage of legitimate messages in the test data which are classified as legitimate

Summary The need to create spam filters automatically Find words which are typical of spam, and words which are typical of legitimate s, using training data Use this knowledge to automatically classify new s