Rajesh Pampapathi, Boris Mirkin, Mark Levene

Slides:



Advertisements
Similar presentations
Números.
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
Angstrom Care 培苗社 Quadratic Equation II
1
EuroCondens SGB E.
Worksheets.
STATISTICS Linear Statistical Models
Addition and Subtraction Equations
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
Introduction to Turing Machines
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
Chapter 7 Sampling and Sampling Distributions
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
MM4A6c: Apply the law of sines and the law of cosines.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Text Categorization.
Chapter 10 Estimating Means and Proportions
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
Chapter 1: Expressions, Equations, & Inequalities
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
2.10% more children born Die 0.2 years sooner Spend 95.53% less money on health care No class divide 60.84% less electricity 84.40% less oil.
Foundation Stage Results CLL (6 or above) 79% 73.5%79.4%86.5% M (6 or above) 91%99%97%99% PSE (6 or above) 96%84%100%91.2%97.3% CLL.
: 3 00.
5 minutes.
Statistics for the Social Sciences Psychology 340 Spring 2005 Using t-tests.
Numeracy Resources for KS2
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Chapter 10 Correlation and Regression
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
Static Equilibrium; Elasticity and Fracture
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Ch 14 實習(2).
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Biostatistics course Part 14 Analysis of binary paired data
Select a time to count down from the clock above
Completing the Square Topic
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Classification Classification Examples
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Spam Detection Ethan Grefe December 13, 2013.
Presentation transcript:

Rajesh Pampapathi, Boris Mirkin, Mark Levene A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information Systems Birkbeck College, University of London

Introduction – Outline Motivation: Examples of Spam Suffix Tree construction Document scoring and classification Experiments and results Conclusion

1. Standard spam mail Buy cheap medications online, no prescription needed. We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol and many more products. No embarrasing trips to the doctor, get it delivered directly to your door. Experienced reliable service. Most trusted name brands. For your solution click here: http://www.webrx-doctor.com/?rid=1000

5. Embedded message (plus word salad) zygotes zoogenous zoometric zygosphene zygotactic zygoid zucchettos zymolysis zoopathy zygophyllaceous zoophytologist zygomaticoauricular zoogeologist zymoid zoophytish zoospores zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zombies zoogrpahy zoneless zoonic zoom zoosporic zoolatrous zoophilous zymotically zymosterol FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFN * GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX <http://healthygrow.biz/index.php?id=2> zonally zooidal zoospermia zoning zoonosology zooplankton zoochemical zoogloeal zoological zoologist zooid zoosphere zoochemical & Safezoonal andNGASXHBPnatural & TestedQLOLNYQandEAVMGFCapproved zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zymes zoodendrium zygomata zoometries zoographist zygophoric zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zoopathological noZFYFEPBmas <http://healthygrow.biz/remove.php>

4. Word salads Buy meds online and get it shipped to your door Find out more here <http://www.gowebrx.com/?rid=1001> a publications website accepted definition. known are can Commons the be definition. Commons UK great public principal work Pre-Budget but an can Majesty's many contains statements statements titles (eg includes have website. health, these Committee Select undertaken described may publications

Creating a Suffix Tree MEET FEET ROOT F E T M E T T E T (1) (1) (2) (4) (1) (2) T

Levels of Information Characters: the alphabet (and their frequencies) of a class. Matches: between query strings and a class. s =nviaXgraU>Tabl$$$ets t =xv^ia$graTab£££lets Matches(s, t) = {v, ia, gra, Tab, l, ets, $} - But what about overlapping matches? Trees: properties of the class as a whole. ~size ~density (complexity)

Document Similarity Measure The score for a document, d, is the sum of the scores for each suffix: d(i) is the suffix of d beginning at the ith letter tau is a tree normalisation coefficient

Substring Similarity Measure Score for match, m = m0m1m2…mn, is score(m): T is the tree profile of the class. v(m|T) is a normalisation coefficient based on the properties of T. p(mt) is the probability of the character, mt, of the match m. Φ[p] is a significance function.

Decision Mechanism

Specifications of Φ[p] (character level) Constant: 1 Linear: p Square: p2 Root: p0.5 Logit: ln(p) – ln(1-p) Sigmoid: (1 + exp(-p))-1 Note: Logit and Sigmoid need to be adjusted to fit in the range [0,1]

Significance function

Threshold Variation ~ Significance functions ~

Threshold Variation ~ Significance functions ~

Match normalisation Match unnormalised 1 Match permutation normalised Match length normalised m* is the set of all strings formed by permutations of m m’ is the set of all strings of length equal to length of m

Match normalisation MUN: match unnormalised; MPN: permutation normalised; MLN: length normalised

Threshold Variation ~ match normalisation ~ Constant significance function unnormalised Constant significance function match normalised

Specifications of tau Unnormalised: 1 Size(T): The total number of nodes Density(T): The average number of children of internal nodes AvFreq(T): Average frequency of nodes

Tree normalisation

Androutsopoulos et al. (2000) ~ Ling-Spam Corpus ~ Pre-processing Number of Features Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List 100 17.22% 0.51% Suffix Tree (ST) None N/A 2.50% 0.21% Naïve Bayes* (NB*) Unlimited 0.84% 2.86% Pre-processing Number of Features Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List 300 36.95% 0% Suffix Tree (ST) None N/A 3.96% Naïve Bayes* (NB*) Unlimited 10.42%

~ SpamAssassin Corpus ~ Pre-processing False Positive Rate False Negative Rate Suffix Tree (ST) None 3.50% 3.25% Naïve Bayes* (NB*) Lemmatizer + Stop-List 10.50% 1.50% ~ Ling-BKS Corpus ~ Pre-processing False Positive Rate False Negative Rate Suffix Tree (ST) None 0% Naïve Bayes* (NB*) Lemmatizer + Stop-List 12.25%

Conclusions Good overall classifier - improvement on naïve Bayes - but there’s still room for improvement Can one method ever maintain 100% accuracy? Extending the classifier Applications to other domains - web page classification

Future Work - ODP

Computational Performance Data Set Training (s) Av. Spam (ms) Av. Ham (ms) Av. Peak Mem. LS-FULL (7.40MB) 63 843 659 765MB LS-11 (1.48MB) 36 221 206 259MB SAeh-11 (5.16MB) 155 504 2528 544MB BKS-LS-11 (1.12MB) 41 161 222 345MB

Experimental Data Sets Ling-Spam (LS) Spam (481) collected by Androutsopoulos et al. Ham (2412) from online linguists’ bulletin board Spam Assassin - Easy (SAe) - Hard (SAh) Spam (1876) and ham (4176) examples donated BBK Spam (652) collected by Birkbeck

Androutsopoulos et al. (2000) ~ Ling-Spam Corpus ~ Classifier Configuration Threshold No. of Attrib. Spam Recall Spam Precision Bare 0.5 50 81.10\% 96.85\% Stop-List 82.35% 97.13% Lemmatizer 100 99.02% Lemmatizer + Stop-List 82.78% 99.49% 0.9 200 76.94\% 99.46\% 76.11\% 99.47\% 77.57\% 99.45\% Lemmatizer + Stop-list 78.41\% 0.999 73.82\% 99.43\% 73.40\% 300 63.67\% 100.00\% 63.05\%

Androutsopoulos et al. (2000) ~ Ling-Spam Corpus ~ Classifier Configuration Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List 17.22% 0.51% Suffix Tree (ST) N/A 2.5% 0.21% Naïve Bayes* (NB*) 0.84% 2.86% Classifier Configuration Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List 36.95% 0% Suffix Tree (ST) N/A 3.96% Naïve Bayes* (NB*) 10.42%

~ SpamAssassin Corpus ~ Classifier Configuration Spam Recall Spam Precision Naïve Bayes (NB) Lemmatizer + Stop-List 82.78% 99.49% Suffix Tree (ST) N/A 97.50% 99.79% Naïve Bayes* (NB*) 99.16% 97.14% Classifier Configuration Spam Recall Spam Precision Naïve Bayes (NB) Lemmatizer + Stop-List 82.78% 99.49% Suffix Tree (ST) N/A 97.50% 99.79% Naïve Bayes* (NB*) 99.16% 97.14%

“What then?” sang Plato’s ghost, “What then?” Vector Space Model “What then?” sang Plato’s ghost, “What then?” W. B. Yeats what host plate Plato ghost then sang book 1 2 Word Probability = 0.05 P(w = ‘what’) = 50/1000

Creating Profiles Mark

Profiles Mark Levene engines databases information search data Mike Hu police intelligence criminal computational data

Classification Boris Mirkin Mark Levene Mike Hu SBM SML SMH

Naïve Bayes (similarity measure) For a document d = {d1d2d3 … dm }and set of classes c = {c1, c2 ... cJ}: (1) Where: (2) (3)

Criticisms Pre-processing: - Stop-word removal - Word stemming/lemmatisation - Punctuation and formatting Smallest unit of consideration is a word. Classes (and documents) are bags of words, i.e. each word is independent of all others.

Word Dependencies Boris Mirkin means intelligence clustering computational data Mike Hu means intelligence criminal computational data

Word Inflections Intelligent Intellig- Intelligence OR intelligent Intelligentsia Intelligible

Success measures Recall is the proportion of correctly classified examples of a class. If SR is spam recall, then (1-SR) gives the proportion of false negatives. Precision is the proportion assigned to a class which are true members of that class. It is a measure of the number of true positives. If SP is spam precision, then (1 – SP) would give the proportion of false positives.

Success measures True Positive Rate (TPR) is the proportion of correctly classified examples of the ‘positive’ class. Spam is typically taken as the positive class, so TPR is then the number of spam classified as spam over the total number of spam. False Positive Rate (FPR) is the proportion of the ‘negatve’ class erroneously assigned to the ‘positive’ class. Ham is typically taken as the negative class, so FPR is then the number of ham classified as spam over the total number of ham.

Classifier Structure Training Data Profiling Method Spam Ham Training Data Profiling Method Profile Representation Similarity/Comparison Measure Decision Mechanism or Classification Criterion Decision ? Ham Spam

Classification using a suffix tree Method of profiling is construction of the tree (no pre-processing, no post-processing) The tree is a profile of the class. Similarity measure? Decision mechanism?

Threshold Variation ~ match normalisation ~ Constant significance function unnormalised Constant significance function match normalised SPE = spam precision error; HPE = ham precision error

Threshold Variation ~ Significance functions ~ Root function, no normalisation Logit function, no normalisation SPE = spam precision error; HPE = ham precision error

Constant significance function (unnormalised) Threshold Variation Constant significance function (unnormalised) SPE = spam precision error; HPE = ham precision error