Download presentation

Presentation is loading. Please wait.

Published byBeverley Johnston Modified about 1 year ago

1
CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian Siefkes 3 Shalendra Chhabra 1,4 1: Mitsubishi Electric Research Labs- Cambridge MA 2: Empresa Brasileira de Telecomunicações Embratel, Rio de Janeiro, RJ Brazil 3: Database and Information Systems Group, Freie Universität Berlin, Berlin-Brandenburg Graduate School in Distributed Information Systems 4: Computer Science and Engineering, University of California, Riverside CA

2
CRM114 TeamKNN and Hyperspace Spam Sorting2 Bayesian is Great. Why Worry? ● Typical Spam Filters are linear classifiers – Consider the “checkerboard” problem ● Markovian requires the nonlinear features to be textually “near” each other – can’t be sure that will work forever because spammers are clever. ● Winnow is just a different weighting + different chain rule rule

3
CRM114 TeamKNN and Hyperspace Spam Sorting3 Bayesian is Great. Why Worry? ● Bayesian is only a linear classifier – Consider the “checkerboard” problem ● Markovian requires the nonlinear features to be textually “near” each other – can’t be sure of that; spammers are clever ● Winnow is just a different weighting ● KNNs are a very different kind of classifier

4
CRM114 TeamKNN and Hyperspace Spam Sorting4 Typical Linear Separation

5
CRM114 TeamKNN and Hyperspace Spam Sorting5 Typical Linear Separation

6
CRM114 TeamKNN and Hyperspace Spam Sorting6 Typical Linear Separation

7
CRM114 TeamKNN and Hyperspace Spam Sorting7 Nonlinear Decision Surfaces Nonlinear decision surfaces require tremendous amounts of data.

8
CRM114 TeamKNN and Hyperspace Spam Sorting8 Nonlinear Decision and KNN / Hyperspace Nonlinear decision surfaces require tremendous amounts of data.

9
CRM114 TeamKNN and Hyperspace Spam Sorting9 ● Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties KNNs have been around

10
CRM114 TeamKNN and Hyperspace Spam Sorting10 ● Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties ● In 1951 ! KNNs have been around

11
CRM114 TeamKNN and Hyperspace Spam Sorting11 ● Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties ● In 1951 ! ● Interesting Theorem: Cover and Hart (1967) KNNs are within a factor of 2 in accuracy to the optimal Bayesian filter KNNs have been around

12
CRM114 TeamKNN and Hyperspace Spam Sorting12 ● Start with bunch of known things and one unknown thing. ● Find the K known things most similar to the unknown thing. ● Count how many of the K known things are in each class. ● The unknown thing is of the same class as the majority of the K known things. KNNs in one slide!

13
CRM114 TeamKNN and Hyperspace Spam Sorting13 ● How big is the neighborhood K ? ● How do you weight your neighbors? – Equal-vote? – Some falloff in weight? – Nearby interaction – the Parzen window? ● How do you train? – Everything? That gets big... – And SLOW. Issues with Standard KNNs

14
CRM114 TeamKNN and Hyperspace Spam Sorting14 ● How big is the neighborhood? We will test with 3, 7, 21 and |corpus| ● How do we weight the neighbors? We will try equal-weighting, similarity, Euclidean distance, and combinations thereof. Issues with Standard KNNs

15
CRM114 TeamKNN and Hyperspace Spam Sorting15 ● How do we train? – To compare with a good Markov classifier we need to use TOE – Train Only Errors – This is good in that it really speeds up classification and keeps the database small. – This is bad in that it violates the Cover and Hart assumptions, so the quality limit theorem no longer applies – BUT – we will train multiple passes to see if an asymptote appears. Issues with Standard KNNs

16
CRM114 TeamKNN and Hyperspace Spam Sorting16 ● We found the “bad” KNNs mimic Cover and Hart behavior- they insert basically everything into a bloated database, sometimes more than once! ● The more accurate KNNs inserted fewer examples into their database. Issues with Standard KNNs

17
CRM114 TeamKNN and Hyperspace Spam Sorting17 ● Use the TREC 2005 SA dataset. ● 10-fold validation – train on 90%, test on 10%, repeat for each successive 10% (but remember to clear memory!) ● Run 5 passes (find the asymptote) ● Compare it versus the OSB Markovian tested at TREC How do we compare KNNs?

18
CRM114 TeamKNN and Hyperspace Spam Sorting18 ● Use the OSB feature set. This combines nearby words to make short phrases; the phrases are what are matched. ● Example “this is an example” yields: “this is” “this an” “this example” These features are the measurements we classify against What do we use as features?

19
CRM114 TeamKNN and Hyperspace Spam Sorting19 Test 1: Equal Weight Voting KNN with K = 3, 7, and 21 Asymptotic accuracy: 93%, 93%, and 94% (good acc: 98%, spam acc 80% for K = 2 and 7, 96% and 90% for K=21) Time: ~50-75 milliseconds/message

20
CRM114 TeamKNN and Hyperspace Spam Sorting20 Test 2: Weight by Hamming -1/2 KNN with K = 7 and 21 Asymptotic accuracy: 94% and 92% (good acc: 98%, spam acc 85% for K=7, 98% and 79% for K=21) Time: ~ 60 milliseconds/message

21
CRM114 TeamKNN and Hyperspace Spam Sorting21 Test 3: Weight by Hamming -1/2 KNN with K = |corpus| Asymptotic accuracy: 97.8% Good accuracy: 98.2%Spam accuracy: 96.9% Time: 32 msec/message

22
CRM114 TeamKNN and Hyperspace Spam Sorting22 Test 4: Weight by N-dimensional radiation model (a.k.a. “Hyperspace”)

23
CRM114 TeamKNN and Hyperspace Spam Sorting23 Test 4: Hyperspace weight, K = |corpus|, d=1, 2, 3 Asymptotic accuracy: 99.3% Good accuracy: 99.64%, 99.66% and 99.59% Spam accuracy: 98.7, 98.4, 98.5% Time: 32, 22, and 22 milliseconds/message

24
CRM114 TeamKNN and Hyperspace Spam Sorting24 Test 5: Compare vs. Markov OSB (thin threshold) Asymptotic accuracy: 99.1% Good accuracy: 99.6%, Spam accuracy: 97.9% Time: 31 msec/message

25
CRM114 TeamKNN and Hyperspace Spam Sorting25 Test 6: Compare vs. Markov OSB (thick threshold = 10.0 pR) ● Thick Threshold means: – Test it first – If it is wrong, train it. – If it was right, but only by less than the threshold thickness, train it anyway! ● 10.0 pR units is roughly the range between 10% to 90% certainty.

26
CRM114 TeamKNN and Hyperspace Spam Sorting26 Test 6: Compare vs. Markov OSB (thick threshold = 10.0 pR) Asymptotic accuracy: 99.5% Good accuracy: 99.6%, Spam accuracy: 99.3% Time: 19 msec/message

27
CRM114 TeamKNN and Hyperspace Spam Sorting27 ● Small-K KNNs are not very good for sorting spam. Conclusions:

28
CRM114 TeamKNN and Hyperspace Spam Sorting28 ● Small-K KNNs are not very good for sorting spam. ● K=|corpus| KNNs with distance weighting are reasonable. Conclusions:

29
CRM114 TeamKNN and Hyperspace Spam Sorting29 ● Small-K KNNs are not very good for sorting spam ● K=|corpus| KNNs with distance weighting are reasonable ● K=|corpus| KNNs with hyperspace weighting are pretty good. Conclusions:

30
CRM114 TeamKNN and Hyperspace Spam Sorting30 ● Small-K KNNs are not very good for sorting spam. ● K=|corpus| KNNs with distance weighting are reasonable. ● K=|corpus| KNNs with hyperspace weighting are pretty good. ● But thick-threshold trained Markovs seem to be more accurate, especially in single-pass training. Conclusions:

31
CRM114 TeamKNN and Hyperspace Spam Sorting31 Thank you! Questions? Full source is available at (licensed under the GPL)

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google