How does computer know what is spam and what is ham?

Slides:



Advertisements
Similar presentations
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Advertisements

Announcements •Homework due Tuesday. •Office hours Monday 1-2 instead of Wed. 2-3.
Classification Classification Examples
Rapid Object Detection using a Boosted Cascade of Simple Features Paul Viola, Michael Jones Conference on Computer Vision and Pattern Recognition 2001.
The 10 Commandments for Java Programmers VII: Thou shalt study thy libraries with care and strive not to reinvent them without cause, that thy code may.
電腦視覺 Computer and Robot Vision I Chapter2: Binary Machine Vision: Thresholding and Segmentation Instructor: Shih-Shinh Huang 1.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
The Viola/Jones Face Detector Prepared with figures taken from “Robust real-time object detection” CRL 2001/01, February 2001.
Assuming normally distributed data! Naïve Bayes Classifier.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Multi-Class Object Recognition Using Shared SIFT Features
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai.
Introduction to machine learning
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Naïve Bayes Chapter 4, DDS. Introduction Classification Training set  design a model Test set  validate the model Classify data set using the model.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Comment Spam Identification Eric Cheng & Eric Steinlauf.
6/28/2014 CSE651C, B. Ramamurthy1.  Classification is placing things where they belong  Why? To learn from classification  To discover patterns  To.
Spam Filtering. From: "" Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There.
Classification: Feature Vectors
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Spam Detection Ethan Grefe December 13, 2013.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Gesture Recognition in a Class Room Environment Michael Wallick CS766.
November 30, PATTERN RECOGNITION. November 30, TEXTURE CLASSIFICATION PROJECT Characterize each texture so as to differentiate it from one.
Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.
Robust Real Time Face Detection
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Bibek Jang Karki. Outline Integral Image Representation of image in summation format AdaBoost Ranking of features Combining best features to form strong.
Analysis of Classification Algorithms In Handwritten Digit Recognition Logan Helms Jon Daniele.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
FACE DETECTION : AMIT BHAMARE. WHAT IS FACE DETECTION ? Face detection is computer based technology which detect the face in digital image. Trivial task.
1 Introduction to Machine Learning Chapter 1. cont.
Application of Facial Recognition in Biometric Security Kyle Ferris.
A COMPARISON OF ANN, NAÏVE BAYES, AND DECISION TREE FOR THE PURPOSE OF SPAM FILTERING KAASHYAPEE JHA ECE/CS
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Naïve Bayes CSE651C, B. Ramamurthy 6/28/2014.
Machine Learning – Classification David Fenyő
Artificial Intelligence
Perceptrons Lirong Xia.
Classification with Perceptrons Reading:
Lecture 15: Text Classification & Naive Bayes
CS 4/527: Artificial Intelligence
Classification Techniques: Bayesian Classification
ADABOOST(Adaptative Boosting)
Building a Naive Bayes Text Classifier with scikit-learn
Naïve Bayes Classifiers
Naïve Bayes CSE487/587 Spring2017 4/4/2019.
Basics of ML Rohan Suri.
Pattern Recognition and Training
Pattern Recognition and Training
Perceptrons Lirong Xia.
Speech recognition, machine learning
Presentation transcript:

How does computer know what is spam and what is ham?

Attempt 1: (define (spam? )‏ (cond ( ( from known sender) False)‏ ( ( contains “viagra”) True)‏ ( ( begins with “Dear Mr/Mrs.”) True)‏ ( ( contains URL) True)‏ ( ( contains attachment) True)‏ (...

Problem: ( contain URL) is an indication, NOT a PROOF Attempt 1: (define (spam? )‏ (cond ( ( from known sender) False)‏ ( ( contains “viagra”) True)‏ ( ( begins with “Dear Mr/Mrs.”) True)‏ ( ( contains URL) True)‏ ( ( contains attachment) True)‏ (...

Features: Score: from known sender -50 contains "viagra" 75 begins with "Dear Mr/Mrs." 70 contains URL 10 contains attachment If Total Sum > 100, classify as spam.

Features: Score: from known sender -50 contains "viagra" 75 begins with "Dear Mr/Mrs." 70 contains URL 10 contains attachment If Total Sum > 100, classify as spam. Problems: - How to determine the score? - How to combine the score?

Key Idea: Learn which features are important through examples Training Set: lots of s with correct labels (both spam and ham)

The Naive Bayes Algorithm: Step 1. Gather Statistics inside Training Set:

The Naive Bayes Algorithm: Step 1. Gather Statistics inside Training Set: - Count percentage of spams in Training Set: P(spam)‏ - Count percentage of hams in Training Set: P(ham)‏ - For every feature F_1, F_2, F_3... : = Count percentage of spams with feature F_i : P(F_i | spam)‏ = Count percentage of hams with feature F_i : P(F_i | ham)‏

The Naive Bayes Algorithm: Say, F_1 = contains “viagra” F_2 = begins with “Dear Mr/Mrs.”

The Naive Bayes Algorithm: Say, F_1 = contains “viagra” F_2 = begins with “Dear Mr/Mrs.” From Training Set, we discovered: P(spam) = 0.85 P(ham) = 0.15 P(F_1 | spam) = 0.2 P(NOT F_1 | spam) = 0.8 P(F_1 | ham) = P(NOT F_1 | ham) P(F_2 | spam) = 0.99 P(NOT F_2 | spam) = 0.01 P(F_2 | ham) = P(NOT F_2 | ham) =

The Naive Bayes Algorithm: Step 1. Gather Statistics inside Training Set: - Count percentage of spams in Training Set: P(spam)‏ - Count percentage of hams in Training Set: P(ham)‏ - For every feature F_1, F_2, F_3... : = Count percentage of spams with feature F_i : P(F_i | spam)‏ = Count percentage of hams with feature F_i : P(F_i | ham)‏ Step 2. On a new Instance: - Find what features the new instance has - Use Bayes Rule to compute probability - Take the most probable label

Example: Optical Character Recognition GOAL: recognize scanned hand-written numbers ############## ########## ## ## ## ## # ## # ## ## ### ## ## ### ## # ### #### ###### ###+#### ##..+#### #+...+## #+...### ##+++#### #####++## ###+..+## # ## # ## # ## # # #

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares # ### #### ###### ###+#### ##..+#### #+...+## #+...### ##+++#### #####++## ###+..+## # ## # ## # ## # # #

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares # ### #### ###### ###+#### ##..+#### #+...+## #+...### ##+++#### #####++## ###+..+## # ## # ## # ## # # #

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares # ### #### ###### ###+#### ##..+#### #+...+## #+...### ##+++#### #####++## ###+..+## # ## # ## # ## # # #

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares # ### #### ###### ###+#### ##..+#### #+...+## #+...### ##+++#### #####++## ###+..+## # ## # ## # ## # # #

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares # ### #### ###### ###+#### ##..+#### #+...+## #+...### ##+++#### #####++## ###+..+## # ## # ## # ## # # #

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type)‏ (done for you)‏

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type)‏ (done for you)‏ - Gather feature statistics from Training File (mostly done for you)‏

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type)‏ (done for you)‏ - Gather feature statistics from Training File (mostly done for you)‏ - Implement Bayes' Rule (mostly your own work)‏

Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project)‏ every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type)‏ (done for you)‏ - Gather feature statistics from Training File (mostly done for you)‏ - Implement Bayes' Rule (mostly your own work)‏ - Evaluate your OCR by guessing labels on Validation File (mostly done for you)‏