A Unified Model of Spam Filtration William Yerazunis 1, Shalendra Chhabra 2, Christian Siefkes 3, Fidelis Assis 4, Dimitrios Gunopulos 2 1 Mitsubishi Electric.

Slides:

Advertisements

Similar presentations

Overview of this week Debugging tips for ML algorithms

Advertisements

Applied Informatics Štefan BEREŽNÝ

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Linear Algebra Applications in Matlab ME 303. Special Characters and Matlab Functions.

Linear Equations in Linear Algebra

CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

Mathematics. Matrices and Determinants-1 Session.

HCI 530 : Seminar (HCI) Damian Schofield. HCI 530: Seminar (HCI) Transforms –Two Dimensional –Three Dimensional The Graphics Pipeline.

Reduced Support Vector Machine

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Transforms What does the word transform mean?. Transforms What does the word transform mean? –Changing something into another thing.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

Linear Equations in Linear Algebra

INDR 262 INTRODUCTION TO OPTIMIZATION METHODS LINEAR ALGEBRA INDR 262 Metin Türkay 1.

Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.

1 Chapter 2 Matrices Matrices provide an orderly way of arranging values or functions to enhance the analysis of systems in a systematic manner. Their.

Review of Lecture Two Linear Regression Normal Equation

SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.

Chapter 10 Review: Matrix Algebra

Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices

SVM by Sequential Minimal Optimization (SMO)

Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.

ME 1202: Linear Algebra & Ordinary Differential Equations (ODEs)

Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.

1 Program Correctness CIS 375 Bruce R. Maxim UM-Dearborn.

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Statistics and Linear Algebra (the real thing). Vector A vector is a rectangular arrangement of number in several rows and one column. A vector is denoted.

Chapter 8 Arrays and Strings

Querying Structured Text in an XML Database By Xuemei Luo.

1 1.3 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra VECTOR EQUATIONS.

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

SINGULAR VALUE DECOMPOSITION (SVD)

Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.

Algorithms 2005 Ramesh Hariharan. Algebraic Methods.

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Vector Space Models.

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

1 ECE 102 Engineering Computation Chapter 3 Math Review 3: Cramer’s Rule Dr. Herbert G. Mayer, PSU Status 10/11/2015 For use at CCUT Fall 2015.

Arrays Declaring arrays Passing arrays to functions Searching arrays with linear search Sorting arrays with insertion sort Multidimensional arrays Programming.

1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

A Simple Approach for Author Profiling in MapReduce

Linear Algebra Review.

Linear Equations in Linear Algebra

A Straightforward Author Profiling Approach in MapReduce

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Computer Programming BCT 1113

Lecture 12: Data Wrangling

Linear Equations in Linear Algebra

Implementation Based on Inverted Files

Word Embedding Word2Vec.

Boolean and Vector Space Retrieval Models

Linear Algebra and Matrices

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

A Unified Model of Spam Filtration William Yerazunis 1, Shalendra Chhabra 2, Christian Siefkes 3, Fidelis Assis 4, Dimitrios Gunopulos 2 1 Mitsubishi Electric Research Labs, Cambridge, MA 2 University of California Riverside, CA 3 GKVI/ FU Berlin, Germany 4 Embractel, Rio de Janeiro, Brazil

“Spam is a Nuisance” Spam Statistics

The Problem of Spam Filtering Given a large set of readers (who desire to reciprocally communicate without prearrangement with each other) and another set of spammers (who wish to communicate with the readers who do not wish to communicate with the spammers), the set of filtering actions are the steps readers take to “maximize their desired communication and minimize their undesired communications”.

Motivation All current spam filters have very similar performance – why? Can we learn anything about current filters by modeling them? Does the model point out any obvious flaws in our current designs?

What do you MEAN? Similar Performance? Gordon Cormack's “Mr. X” test suite – all filters (heuristic, statistical) get about 99% accuracy. Large variance of reported real-user results for all filters – everything from “nearly perfect” to “unusable” on the same software. Conclusion: This problem is evil.

Method Examine a large number of spam filters Think Real Hard (Feynmann method of deduction) Come up with a unified model for spam filters Examine the model for weaknesses and scalability Consider remedies for the weaknesses

The Model Models information flow Considers stateful v. stateless computations and computational complexity Simplification: training is usually done “offline”, so we will assume a stationary (no learning during a filtering) filter configuration for the computational model.

Empirical Stages in Filtering Pipeline Preprocessing – text-to-text transform Tokenization Feature Generation Statistics Table Lookup Mathematical Combination Thresholding

Empirical Stages in Filtering Pipeline Preprocessing – text-to-text transform Tokenization Feature Generation Statistics Table Lookup Mathematical Combination Thresholding Everybody does it this way!

Empirical Stages in Filtering Pipeline Preprocessing – text-to-text transform Tokenization Feature Generation Statistics Table Lookup Mathematical Combination Thresholding Everybody does it this way! Including SpamAssassin

The Obligatory Flowchart Note that this is now a pipeline – anyone for a scalable solution?

Step 1: Pre Processing: Arbitrary Text to Text Transformation  Character Set Folding / Case Folding  Stopword Removal  MIME Normalization / Base64 Decoding  HTML Decommenting Hypertextus Interruptus  Heuristic Tagging “FORGED_OUTLOOK_TAGS”  Identifying Lookalike Transformations instead of ‘a’, $ instead of S

Step 2: Tokenization  Converting incoming text (and heuristic tags) into features  Two step process: Use regular expression (regex) to segment the incoming text into tokens. We will then convert this token stream T in features. This conversion is not necessarily 1:1

Step 3: Converting Tokens into Features A “Token” is a piece of text that meets the requirements of the tokenizing regex A “Feature” is a mostly-unique identifier that the filter's trained database can convert into a mathematically manipulable value

Converting Tokens into Features first- convert Text to ArbInteger For each unknown text token, convert that text into a (preferably unique) arbitrary integer. This conversion can be done by dictionary lookup (i.e. “foo” is the 82,934 th word in the dictionary) or by hashing. Output of this stage is a stream of unique integer IDs. Call this row vector T.

Second part: Matrix-based Feature Generation Use a rectangular feature generation profile matrix P; each column of P is a vector containing small primes (or zero) The dot product of each column of P against a segment of the token stream T is a unique encoding of a segment of the feature stream at that position. The number of features out per element of T is equal to the column rank of P Zero elements in a P column indicate that token is disregarded in this particular feature

Matrix-based Feature Generation Example – Unigram + Bigram features A P matrix for unigram plus bigram features is: If the incoming token stream T is 10, 20, then successive dot products of TP yield: 10 * * 0 = 10 <--- first position, first column 10 * * 3 = 80 <-- first position, second column 20 * * 0 = 30 <--- second position, first column 20 * * 3 = 130 <-- second position, second col

Matrix-based Feature Generation Nice Side Effects Zeroes in a P matrix indicate “disregard this location in the token stream” Identical primes in a column of a P matrix allow order-invariant feature outputs, eg: generates the unigram features, AND the set of bigrams in an order-invariant style (ie. “foo bar” == “bar foo”)

Matrix-based Feature Generation Nice Side Effects II Identical primes in different columns of a P matrix allow order-sensitive distance-invariant feature outputs, eg: generates the unigram features and the set of bigrams in an order-sensitive distance-invariant style (ie. “foo bar” == “foo bar”)

Step 4: Feature Weighting by Lookup Feature Lookup tables are pre-built by learning side of filter EG: Strict Probability: Local Weight = TimesSeenInClass TimesSeenOverAllClasses Buffered probability: Local Weight =__________TimesSeenInClass (TimesSeenOverallClasses + constant) Document Count Certainty: Weight= TimesSeenInThisClass * DocumentsTrainedIntoThisClass (TimesSeenOverallClasses+Constant)*TotalDocumentsActuallyTrained

In this model, Weight Generators Do Not Have To Be Statistical SpamAssassin: Weights generated by genetic optimization algorithm SVM: weights generated by linear algebra Winnow Algorithm:  uses additive weights stored in the database  Each features weight starts at  Weights are multiplied by a promotion / demotion factor if the results are below /above a predefined threshold during learning

Unique Features --> Nonuniform Weights Because all features are unique, we can have nonuniform weighting schemes ➢ Can have nonuniform weights compiled into the weighting lookup tables ➢ Can pass a row vector of weights from the feature generation column vectors

Step 5: Weight Combination 1 st Stateful Step Winnow Combining Bayesian Combining Chi squared Combining Sorted Weight Approach - “Only use the most extreme N weights found in the document” Uniqueness – use only 1 st occurrence of any feature Output: a state vector – may be of length 1, or longer, and may be nonuniform.

Final Threshold Comparison to either a fixed threshold or Comparison of one part of the output state to another part of the output state (useful when the underlying mathematical model includes a null hypothesis)

Emulation of Other Filtering Methods In the Unified Filtration Model  Emulating Whitelists and Blacklists Voting-style whitelists Prioritized-rule Blacklists --> Compile entries into lookup tables with superincreasing weights for each whitelist or blacklisted term

Emulation of Other Filtering Methods In the Unified Filtration Model Emulating Heuristic Filters:  Use text-to-text translation to insert text tag strings corresponding to each heuristic satisfied  Compile entries in the lookup table corresponding to the positive or negative weight of each heuristic text tag string; ignore all other text.  Sum outputs and threshold This is the SpamAssassin model

Emulation of Other Filtering Methods In the Unified Filtration Model Emulating Bayesian Filters:  Text-to-text translation as desired  Use a unigram (SpamBayes) or digram (Dspam) P matrix to generate the features  Preload the lookup tables with local probabilities  Use Bayes Rule or Chi-Squared to combine probabilities, and threshold at some value This is just about every statistical filter...

Conclusions Everybody's filter is more or less the same! But- filtration on only the text limits our information horizon. We need to broaden the information input horizon... but from where? how?

Conclusions Better information input - examples: CAMRAM – use outgoing as auto whitelist Smart Squid – look at your web site surfing to infer what might be legitimate (say, receipts from online merchants) Honey pots – sources of new spam, and newly compromised zombie spambots (a.k.a realtime blacklists, inoculation, peer-to-peer filter systems). Large ISPs have a big statistical advantage here over single users.

Thank You! questions or comments?