Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes.

Slides:



Advertisements
Similar presentations
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Advertisements

Unit 0, Pre-Course Math Review Session 0.2 More About Numbers
CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian.
Video Shot Boundary Detection at RMIT University Timo Volkmer, Saied Tahaghoghi, and Hugh E. Williams School of Computer Science & IT, RMIT University.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Chapter 7: Statistical Analysis Evaluating the Data.
Multimedia Data Introduction to Image Processing Dr Mike Spann Electronic, Electrical and Computer.
Environmental Data Analysis with MatLab Lecture 24: Confidence Limits of Spectra; Bootstraps.
Assessing cognitive models What is the aim of cognitive modelling? To try and reproduce, using equations or similar, the mechanism that people are using.
First we need to understand the variables. A random variable is a value of an outcome such as counting the number of heads when flipping a coin, which.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Chapter 5 DESCRIBING DATA WITH Z-SCORES AND THE NORMAL CURVE.
3.1 Derivative of a Function
PSY 307 – Statistics for the Behavioral Sciences
The mathematician’s shorthand
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Statistical Techniques I EXST7005 Review. Objectives n Develop an understanding and appreciation of Statistical Inference - particularly Hypothesis testing.
When you see… A1. Find the zeros You think…. A1 To find the zeros...
Basic Concepts of Algebra
Welcome to the Unit 8 Seminar Dr. Ami Gates
Two computations concerning fatigue damage and the Power Spectral Density Frank Sherratt.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Multimedia Data Introduction to Image Processing Dr Sandra I. Woolley Electronic, Electrical.
Some Mathematics Time Response Significant Figures Statistics
A Unified Model of Spam Filtration William Yerazunis 1, Shalendra Chhabra 2, Christian Siefkes 3, Fidelis Assis 4, Dimitrios Gunopulos 2 1 Mitsubishi Electric.
Chapter Thirteen Copyright © 2004 John Wiley & Sons, Inc. Sample Size Determination.
Tom.h.wilson Dept. Geology and Geography West Virginia University.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
1 Research Methods in Psychology AS Descriptive Statistics.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
GS/PPAL Section N Research Methods and Information Systems
Chapter 7 Rational Expressions
8 – Properties of Exponents No Calculator
Powers of Ten / Scale Mr. Davis.
Market-Risk Measurement
Regression Analysis AGEC 784.
Dr. Clincy Professor of CS
Section 7.3: Probability Distributions for Continuous Random Variables
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
N-Gram Based Approaches
Increasing, Decreasing, Constant
Sample Size Determination
UNIVERSITY OF MASSACHUSETTS Dept
Dr. Clincy Professor of CS
Properties of the Normal Distribution
Exponents Scientific Notation
Number Theory and the Real Number System
Essentials of Modern Business Statistics (7e)
Improving Digest-Based Collaborative Spam Detection
Lecture 15: Text Classification & Naive Bayes
Elementary Statistics
Discrete Structures for Computer Science
Machine Learning. k-Nearest Neighbor Classifiers.
Chapter 2 Section 3-C.
CIS 350 – 3 Image ENHANCEMENT SPATIAL DOMAIN
Number Theory and the Real Number System
Digital Image Processing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Two Categorical Variables: The Chi-Square Test
Chapter 2 Exploring Data with Graphs and Numerical Summaries
Calculating Probabilities for Any Normal Variable
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Machine Learning in Practice Lecture 23
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Fall 2018 Room 150 Harvill Building 10: :50 Mondays, Wednesdays.
Map Generalization and Data Classification Gary Christopherson
Data Management & Graphing
Chapter 3: Central Tendency
OUTLINE Questions? Quiz Go over homework Next homework Forecasting.
§5.2, The Integers; Order of Operations
Presentation transcript:

Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes 3 Shalendra Chhabra 2,4 1: Empresa Brasileira de Telecomunicações ­ Embratel, Rio de Janeiro, RJ Brazil 2: Mitsubishi Electric Research Labs- Cambridge MA 3: Database and Information Systems Group, Freie Universität Berlin, Berlin-Brandenburg Graduate School in Distributed Information Systems 4: Computer Science and Engineering, University of California, Riverside CA CRM114 Team

What is it? The EDDC, aka Confidence Factor, is an intuitively and empirically derived factor for automatically detecting and reducing the influence of features with low class separation power. 2

Motivation ● To mimic what we, humans, do when inspecting a message to tell if it is spam or not: ● Consider mainly the features which carry strong indications; ● Discard or reduce the importance of those which occur approximately equally in both classes. 3

How is the Confidence Factor used? ● It is used to adjust the local probability of a feature towards the don't care value (0.5), in the Bayes formula, inversely to its class separation power: ● P a (F|C) is the adjusted value of P(F|C), after applying the Confidence Factor: ● P a (F|C) = CF(F) * (P(F|C) – 0.5) ● Where CF(F) is the Confidence Factor of the feature F. 4

How is it calculated? ● Goal: to derive a factor to reduce the influence of features with approximately the same frequency in both classes, spam and ham; ● First attempt: to use the difference of the normalized counts in both classes because it's zero when the frequencies are equal: CF(F) = ND f,h - ND f,s ND f,h - Normalized count of ham messages containing the feature f ND f,s - Normalized count of spam messages containing the feature f Problem: ● CF(F) must be in the interval [0,1] because the adjusted local probability, P a (F|C), must also be in that interval! 5

How is it calculated? ● The previous problem can be fixed with two simple steps: – dividing by the sum of the normalized counts: keep <= 1; – taking the square to avoid negative values: keep >= 0; Or, in a simplified notation: But we still have problems when ND f,h and ND f,s are small: For ND f,h = 1 and ND f,s = 0, the confidence factor is 1, completely unrealistic! 6

How is it calculated? ● A new term is then subtracted from the numerator to smooth the results for small counts – the inverse of the sum of the observed (not normalized) counts in both classes: D f,h - Count of ham messages containing the feature f D f,s - Count of spam messages containing the feature f Or, using the simplified notation: which is the core of the EDDC formula. 7

OSB Features and Intrinsic Weight ● Orthogonal Sparse Bigrams (OSB) is a technique for feature generation that produces bigrams like the four below for the sentence “OSBF is a Bayesian filter”: ● Bigram Distance – OSBF is 0 – OSBF a 1 – OSBF Bayesian 2 – OSBF filter 3 ● For a 5-token sliding window the intrinsic weight of the feature, in OSBF, is given experimentally by where d is the number of skipped tokens in the bigram. The closer the tokens, the greater the weight. 8

Generic EDDC formula ● Finally, adding a term to take into account the intrinsic weight of the feature, we have the generic EDDC formula: - Difference of the normalized counts, in both classes, of documents containing f - Sum of the normalized counts, in both classes, of documents containing f - Sum of the counts, in both classes, of documents containing f Wf - Intrinsic weight of the feature K1, K2, K3 – Constants empirically defined 9

Variant implemented in CRM114-OSBF and in OSBF-Lua -Normalized number of documents containing f in class spam -Normalized number of documents containing f in class ham ● Slower decay of the confidence factor as decreases ● Requires higher K 2 (best experimental value = 10, while only 2 in the original formula) 10

Visualization of the Confidence Factor ● ● Function of the difference between the frequencies of the feature in both classes; ● Features with small differences in frequency are strongly de-emphasized. CF(F) Frequency of F in Ham Frequency of F in Spam 11

Classifiers: OSB, OSB Unique and OSBF ● The three classifiers examined in this work have in common that they are all Bayesian and use Orthogonal Sparse Bigrams for feature generation. – The OSB classifier takes its name from the feature generating technique - it's the standard CRM114 with OSB features; – OSB Unique is a variant of the OSB classifier that ignores duplicate features; – OSBF is similar to OSBU Unique but uses EDDC as a Confidence Factor for feature selection. 12

ROC Curves for the Full Corpus ● The OSBF curve was generated using the original formula; ● The OSBF with the variant in the previous slide; ● The original formula has been showing better performance as the code improves ● - better pruning, bugs removed. 13

Conclusions ● The EDDC factor clearly improved the overall performance of the OSB Bayesian filters presented; ● More importantly, it highly improved the performance in the most critical region for spam filtration, that where ham% <= 0.1; ● It doesn't impact speed – all 3 filters are really fast. Even with the large database used in these tests (2x46 Mbytes), they were able to process the Full corpus ( messages) in minutes on an AMD Athlon(TM) XP

Thanks! The CRM114-OSBF source is available at The OSBF-Lua source is available at 15