Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information.

Similar presentations


Presentation on theme: "A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information."— Presentation transcript:

1 A Suffix Tree Approach to Text Classification Applied to Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information Systems Birkbeck College, University of London

2 Introduction – Outline Motivation: Examples of Spam Suffix Tree construction Suffix Tree construction Document scoring and classification Document scoring and classification Experiments and results Experiments and results Conclusion Conclusion

3 Buy cheap medications online, no prescription needed. We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol and many more products. No embarrasing trips to the doctor, get it delivered directly to your door. Experienced reliable service. Most trusted name brands. For your solution click here: 1. Standard spam mail

4 zygotes zoogenous zoometric zygosphene zygotactic zygoid zucchettos zymolysis zoopathy zygophyllaceous zoophytologist zygomaticoauricular zoogeologist zymoid zoophytish zoospores zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zombies zoogrpahy zoneless zoonic zoom zoosporic zoolatrous zoophilous zymotically zymosterol FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFN * GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX * GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX zonally zooidal zoospermia zoning zoonosology zooplankton zoochemical zoogloeal zoological zoologist zooid zoosphere zoochemical & Safezoonal andNGASXHBPnatural & TestedQLOLNYQandEAVMGFCapproved zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zymes zoodendrium zygomata zoometries zoographist zygophoric zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zoopathological noZFYFEPBmas zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zymes zoodendrium zygomata zoometries zoographist zygophoric zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zoopathological noZFYFEPBmas 5. Embedded message (plus word salad)

5 Buy meds online and get it shipped to your door Find out more here a publications website accepted definition. known are can Commons the be definition. Commons UK great public principal work Pre-Budget but an can Majesty's many contains statements statements titles (eg includes have website. health, these Committee Select undertaken described may publications 4. Word salads

6 Creating a Suffix Tree Creating a Suffix Tree F E E T M E T E ROOT E E T T T MEETFEET (1) (2) (1) (2) (4)

7 Levels of Information Characters: the alphabet (and their frequencies) of a class. Characters: the alphabet (and their frequencies) of a class. Matches: between query strings and a class. Matches: between query strings and a class. s =nviaXgraU>Tabl$$$ets t =xv^ia$graTab£££lets Matches(s, t) = {v, ia, gra, Tab, l, ets, $} - But what about overlapping matches? Trees: properties of the class as a whole. Trees: properties of the class as a whole.~size ~density (complexity)

8 Document Similarity Measure The score for a document, d, is the sum of the scores for each suffix: d(i) is the suffix of d beginning at the i th letter tau is a tree normalisation coefficient

9 Substring Similarity Measure Score for match, m = m 0 m 1 m 2 …m n, is score(m): T is the tree profile of the class. v(m|T) is a normalisation coefficient based on the properties of T. p(m t ) is the probability of the character, m t, of the match m. Φ[p] is a significance function.

10 Decision Mechanism

11 Specifications of Φ[p] (character level) Constant:1 Linear:p Square: p2p2p2p2 Root: p 0.5 Logit: ln(p) – ln(1-p) Sigmoid: (1 + exp(-p)) -1 Note: Logit and Sigmoid need to be adjusted to fit in the range [0,1]

12 Significance function

13 Threshold Variation ~ Significance functions ~

14

15 Match normalisation Match unnormalised 1 Match permutation normalised Match length normalised m* is the set of all strings formed by permutations of m m is the set of all strings of length equal to length of m

16 Match normalisation MUN: match unnormalised; MPN: permutation normalised; MLN: length normalised

17 Threshold Variation ~ match normalisation ~ Constant significance function unnormalised Constant significance function match normalised

18 Specifications of tau Unnormalised:1 Size(T): The total number of nodes Density(T): The average number of children of internal nodes AvFreq(T): Average frequency of nodes

19 Tree normalisation

20 Androutsopoulos et al. (2000) ~ Ling-Spam Corpus ~ Pre-processing Number of Features Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List %0.51% Suffix Tree (ST) NoneN/A2.50%0.21% Naïve Bayes* (NB*) Lemmatizer + Stop-ListUnlimited0.84%2.86% Pre-processing Number of Features Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List %0% Suffix Tree (ST) NoneN/A3.96%0% Naïve Bayes* (NB*) Lemmatizer + Stop-ListUnlimited10.42%0%

21 ~ Ling-BKS Corpus ~ Pre-processing False Positive Rate False Negative Rate Suffix Tree (ST) None0%0% Naïve Bayes* (NB*) Lemmatizer + Stop-List 0%12.25% ~ SpamAssassin Corpus ~ Pre-processing False Positive Rate False Negative Rate Suffix Tree (ST) None3.50%3.25% Naïve Bayes* (NB*) Lemmatizer + Stop-List 10.50%1.50%

22 Conclusions Good overall classifier - improvement on naïve Bayes - but theres still room for improvement Good overall classifier - improvement on naïve Bayes - but theres still room for improvement Can one method ever maintain 100% accuracy? Can one method ever maintain 100% accuracy? Extending the classifier Extending the classifier Applications to other domains - web page classification Applications to other domains - web page classification

23 Future Work - ODP

24 Computational Performance Data Set Training (s) Av. Spam (ms) Av. Ham (ms) Av. Peak Mem. LS-FULL (7.40MB) MB LS-11 (1.48MB) MB SAeh-11 (5.16MB) MB BKS-LS-11 (1.12MB) MB

25 Experimental Data Sets Ling-Spam (LS) Spam (481) collected by Androutsopoulos et al. Ham (2412) from online linguists bulletin board Ling-Spam (LS) Spam (481) collected by Androutsopoulos et al. Ham (2412) from online linguists bulletin board Spam Assassin - Easy (SAe) - Hard (SAh) Spam (1876) and ham (4176) examples donated Spam Assassin - Easy (SAe) - Hard (SAh) Spam (1876) and ham (4176) examples donated BBK Spam (652) collected by Birkbeck BBK Spam (652) collected by Birkbeck

26 Androutsopoulos et al. (2000) ~ Ling-Spam Corpus ~ Classifier Configuration Threshold No. of Attrib. Spam Recall Spam Precision Bare \% 81.10\%96.85\% Stop-List %97.13% Lemmatizer %99.02% Lemmatizer + Stop-List %99.49% Bare \%99.46\% Stop-List \%99.47\% Lemmatizer \%99.45\% Lemmatizer + Stop-list \%99.47\% Bare \%99.43\% Stop-List \%99.43\% Lemmatizer \%100.00\% Lemmatizer + Stop-List \%100.00\%

27 Androutsopoulos et al. (2000) ~ Ling-Spam Corpus ~ Classifier Configuration Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List 17.22%0.51% Suffix Tree (ST) N/A2.5%0.21% Naïve Bayes* (NB*) Lemmatizer + Stop-List0.84%2.86% Classifier Configuration Spam Recall Error Spam Precision Error Naïve Bayes (NB) Lemmatizer + Stop-List 36.95%0% Suffix Tree (ST) N/A3.96%0% Naïve Bayes* (NB*) Lemmatizer + Stop-List10.42%0%

28 ~ SpamAssassin Corpus ~ Classifier Configuration Spam Recall Spam Precision Naïve Bayes (NB) Lemmatizer + Stop-List 82.78%99.49% Suffix Tree (ST) N/A97.50%99.79% Naïve Bayes* (NB*) Lemmatizer + Stop-List 99.16%97.14% Classifier Configuration Spam Recall Spam Precision Naïve Bayes (NB) Lemmatizer + Stop-List 82.78%99.49% Suffix Tree (ST) N/A97.50%99.79% Naïve Bayes* (NB*) Lemmatizer + Stop-List 99.16%97.14%

29

30

31

32 Vector Space Model What then? sang Platos ghost, What then? whathostplatePlatoghostthensangbook W. B. Yeats 50/1000P(w = what) = = 0.05 Word Probability

33 Creating Profiles Mark

34 Profiles datadatabasesinformationsearchengines dataintelligencecriminalcomputationalpolice Mark Levene Mike Hu

35 Classification Boris MirkinMark LeveneMike Hu S BM S ML S MH

36 Naïve Bayes (similarity measure) (1) For a document d = {d 1 d 2 d 3 … d m }and set of classes c = {c 1, c 2... c J }: Where: (2) (3)

37 Criticisms Pre-processing: - Stop-word removal - Word stemming/lemmatisation - Punctuation and formatting Pre-processing: - Stop-word removal - Word stemming/lemmatisation - Punctuation and formatting Smallest unit of consideration is a word. Smallest unit of consideration is a word. Classes (and documents) are bags of words, i.e. each word is independent of all others. Classes (and documents) are bags of words, i.e. each word is independent of all others.

38 Word Dependencies dataintelligenceclusteringcomputationalmeans dataintelligencecriminalcomputationalmeans Boris Mirkin Mike Hu

39 Word Inflections Intellig- ORintelligent Intelligent Intelligence Intelligentsia Intelligible

40 Success measures Recall is the proportion of correctly classified examples of a class. If SR is spam recall, then (1- SR) gives the proportion of false negatives. Recall is the proportion of correctly classified examples of a class. If SR is spam recall, then (1- SR) gives the proportion of false negatives. Precision is the proportion assigned to a class which are true members of that class. It is a measure of the number of true positives. If SP is spam precision, then (1 – SP) would give the proportion of false positives. Precision is the proportion assigned to a class which are true members of that class. It is a measure of the number of true positives. If SP is spam precision, then (1 – SP) would give the proportion of false positives.

41 Success measures True Positive Rate (TPR) is the proportion of correctly classified examples of the positive class. Spam is typically taken as the positive class, so TPR is then the number of spam classified as spam over the total number of spam. True Positive Rate (TPR) is the proportion of correctly classified examples of the positive class. Spam is typically taken as the positive class, so TPR is then the number of spam classified as spam over the total number of spam. False Positive Rate (FPR) is the proportion of the negatve class erroneously assigned to the positive class. False Positive Rate (FPR) is the proportion of the negatve class erroneously assigned to the positive class. Ham is typically taken as the negative class, so FPR is then the number of ham classified as spam over the total number of ham. Ham is typically taken as the negative class, so FPR is then the number of ham classified as spam over the total number of ham.

42 Classifier Structure Training Data Profiling Method Profile Representation Similarity/Comparison Measure Decision Mechanism or Classification Criterion Decision SpamHam Spam Ham ?

43 Classification using a suffix tree Method of profiling is construction of the tree (no pre-processing, no post-processing) Method of profiling is construction of the tree (no pre-processing, no post-processing) The tree is a profile of the class. The tree is a profile of the class. Similarity measure? Similarity measure? Decision mechanism? Decision mechanism?

44 Threshold Variation ~ match normalisation ~ Constant significance function unnormalised Constant significance function match normalised SPE = spam precision error; HPE = ham precision error

45 Threshold Variation ~ Significance functions ~ SPE = spam precision error; HPE = ham precision error Root function, no normalisationLogit function, no normalisation

46 Threshold Variation Constant significance function (unnormalised) SPE = spam precision error; HPE = ham precision error


Download ppt "A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information."

Similar presentations


Ads by Google