CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian.

Slides:



Advertisements
Similar presentations
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Indian Statistical Institute Kolkata
Lazy vs. Eager Learning Lazy vs. eager learning
Lecture 3 Nonparametric density estimation and classification
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Instance Based Learning. Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return the answer associated.
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Three kinds of learning
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.
Sergios Theodoridis Konstantinos Koutroumbas Version 2
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Ensemble Based Systems in Decision Making Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: IEEE CIRCUITS AND SYSTEMS MAGAZINE 2006, Q3 Robi.
1 E. Fatemizadeh Statistical Pattern Recognition.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
A Unified Model of Spam Filtration William Yerazunis 1, Shalendra Chhabra 2, Christian Siefkes 3, Fidelis Assis 4, Dimitrios Gunopulos 2 1 Mitsubishi Electric.
A Bootstrap Interval Estimator for Bayes’ Classification Error Chad M. Hawes a,b, Carey E. Priebe a a The Johns Hopkins University, Dept of Applied Mathematics.
Final Year Project Lego Robot Guided by Wi-Fi (QYA2) Presented by: Li Chun Kit (Ash) So Hung Wai (Rex) 1.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
NTU & MSRA Ming-Feng Tsai
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
KNN & Naïve Bayes Hongning Wang
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes.
Machine Learning Supervised Learning Classification and Regression
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Statistical Techniques
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Presentation transcript:

CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian Siefkes 3 Shalendra Chhabra 1,4 1: Mitsubishi Electric Research Labs- Cambridge MA 2: Empresa Brasileira de Telecomunicações ­ Embratel, Rio de Janeiro, RJ Brazil 3: Database and Information Systems Group, Freie Universität Berlin, Berlin-Brandenburg Graduate School in Distributed Information Systems 4: Computer Science and Engineering, University of California, Riverside CA

CRM114 TeamKNN and Hyperspace Spam Sorting2 Bayesian is Great. Why Worry? ● Typical Spam Filters are linear classifiers – Consider the “checkerboard” problem ● Markovian requires the nonlinear features to be textually “near” each other – can’t be sure that will work forever because spammers are clever. ● Winnow is just a different weighting + different chain rule rule

CRM114 TeamKNN and Hyperspace Spam Sorting3 Bayesian is Great. Why Worry? ● Bayesian is only a linear classifier – Consider the “checkerboard” problem ● Markovian requires the nonlinear features to be textually “near” each other – can’t be sure of that; spammers are clever ● Winnow is just a different weighting ● KNNs are a very different kind of classifier

CRM114 TeamKNN and Hyperspace Spam Sorting4 Typical Linear Separation

CRM114 TeamKNN and Hyperspace Spam Sorting5 Typical Linear Separation

CRM114 TeamKNN and Hyperspace Spam Sorting6 Typical Linear Separation

CRM114 TeamKNN and Hyperspace Spam Sorting7 Nonlinear Decision Surfaces Nonlinear decision surfaces require tremendous amounts of data.

CRM114 TeamKNN and Hyperspace Spam Sorting8 Nonlinear Decision and KNN / Hyperspace Nonlinear decision surfaces require tremendous amounts of data.

CRM114 TeamKNN and Hyperspace Spam Sorting9 ● Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties KNNs have been around

CRM114 TeamKNN and Hyperspace Spam Sorting10 ● Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties ● In 1951 ! KNNs have been around

CRM114 TeamKNN and Hyperspace Spam Sorting11 ● Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties ● In 1951 ! ● Interesting Theorem: Cover and Hart (1967) KNNs are within a factor of 2 in accuracy to the optimal Bayesian filter KNNs have been around

CRM114 TeamKNN and Hyperspace Spam Sorting12 ● Start with bunch of known things and one unknown thing. ● Find the K known things most similar to the unknown thing. ● Count how many of the K known things are in each class. ● The unknown thing is of the same class as the majority of the K known things. KNNs in one slide!

CRM114 TeamKNN and Hyperspace Spam Sorting13 ● How big is the neighborhood K ? ● How do you weight your neighbors? – Equal-vote? – Some falloff in weight? – Nearby interaction – the Parzen window? ● How do you train? – Everything? That gets big... – And SLOW. Issues with Standard KNNs

CRM114 TeamKNN and Hyperspace Spam Sorting14 ● How big is the neighborhood? We will test with 3, 7, 21 and |corpus| ● How do we weight the neighbors? We will try equal-weighting, similarity, Euclidean distance, and combinations thereof. Issues with Standard KNNs

CRM114 TeamKNN and Hyperspace Spam Sorting15 ● How do we train? – To compare with a good Markov classifier we need to use TOE – Train Only Errors – This is good in that it really speeds up classification and keeps the database small. – This is bad in that it violates the Cover and Hart assumptions, so the quality limit theorem no longer applies – BUT – we will train multiple passes to see if an asymptote appears. Issues with Standard KNNs

CRM114 TeamKNN and Hyperspace Spam Sorting16 ● We found the “bad” KNNs mimic Cover and Hart behavior- they insert basically everything into a bloated database, sometimes more than once! ● The more accurate KNNs inserted fewer examples into their database. Issues with Standard KNNs

CRM114 TeamKNN and Hyperspace Spam Sorting17 ● Use the TREC 2005 SA dataset. ● 10-fold validation – train on 90%, test on 10%, repeat for each successive 10% (but remember to clear memory!) ● Run 5 passes (find the asymptote) ● Compare it versus the OSB Markovian tested at TREC How do we compare KNNs?

CRM114 TeamKNN and Hyperspace Spam Sorting18 ● Use the OSB feature set. This combines nearby words to make short phrases; the phrases are what are matched. ● Example “this is an example” yields: “this is” “this an” “this example” These features are the measurements we classify against What do we use as features?

CRM114 TeamKNN and Hyperspace Spam Sorting19 Test 1: Equal Weight Voting KNN with K = 3, 7, and 21 Asymptotic accuracy: 93%, 93%, and 94% (good acc: 98%, spam acc 80% for K = 2 and 7, 96% and 90% for K=21) Time: ~50-75 milliseconds/message

CRM114 TeamKNN and Hyperspace Spam Sorting20 Test 2: Weight by Hamming -1/2 KNN with K = 7 and 21 Asymptotic accuracy: 94% and 92% (good acc: 98%, spam acc 85% for K=7, 98% and 79% for K=21) Time: ~ 60 milliseconds/message

CRM114 TeamKNN and Hyperspace Spam Sorting21 Test 3: Weight by Hamming -1/2 KNN with K = |corpus| Asymptotic accuracy: 97.8% Good accuracy: 98.2%Spam accuracy: 96.9% Time: 32 msec/message

CRM114 TeamKNN and Hyperspace Spam Sorting22 Test 4: Weight by N-dimensional radiation model (a.k.a. “Hyperspace”)

CRM114 TeamKNN and Hyperspace Spam Sorting23 Test 4: Hyperspace weight, K = |corpus|, d=1, 2, 3 Asymptotic accuracy: 99.3% Good accuracy: 99.64%, 99.66% and 99.59% Spam accuracy: 98.7, 98.4, 98.5% Time: 32, 22, and 22 milliseconds/message

CRM114 TeamKNN and Hyperspace Spam Sorting24 Test 5: Compare vs. Markov OSB (thin threshold) Asymptotic accuracy: 99.1% Good accuracy: 99.6%, Spam accuracy: 97.9% Time: 31 msec/message

CRM114 TeamKNN and Hyperspace Spam Sorting25 Test 6: Compare vs. Markov OSB (thick threshold = 10.0 pR) ● Thick Threshold means: – Test it first – If it is wrong, train it. – If it was right, but only by less than the threshold thickness, train it anyway! ● 10.0 pR units is roughly the range between 10% to 90% certainty.

CRM114 TeamKNN and Hyperspace Spam Sorting26 Test 6: Compare vs. Markov OSB (thick threshold = 10.0 pR) Asymptotic accuracy: 99.5% Good accuracy: 99.6%, Spam accuracy: 99.3% Time: 19 msec/message

CRM114 TeamKNN and Hyperspace Spam Sorting27 ● Small-K KNNs are not very good for sorting spam. Conclusions:

CRM114 TeamKNN and Hyperspace Spam Sorting28 ● Small-K KNNs are not very good for sorting spam. ● K=|corpus| KNNs with distance weighting are reasonable. Conclusions:

CRM114 TeamKNN and Hyperspace Spam Sorting29 ● Small-K KNNs are not very good for sorting spam ● K=|corpus| KNNs with distance weighting are reasonable ● K=|corpus| KNNs with hyperspace weighting are pretty good. Conclusions:

CRM114 TeamKNN and Hyperspace Spam Sorting30 ● Small-K KNNs are not very good for sorting spam. ● K=|corpus| KNNs with distance weighting are reasonable. ● K=|corpus| KNNs with hyperspace weighting are pretty good. ● But thick-threshold trained Markovs seem to be more accurate, especially in single-pass training. Conclusions:

CRM114 TeamKNN and Hyperspace Spam Sorting31 Thank you! Questions? Full source is available at (licensed under the GPL)