A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University

Outline Nearest Neighbors Classification Problems SMART TV (SMall Absolute diffeRence of ToTal Variation): A Fast and Scalable Nearest Neighbors Classification Algorithm SMART TV in Image Classification

Given a (large) TRAINING SET, R(A 1,…,A n, C), with C=CLASSES and (A 1 …An)=FEATURES Classification task is: to label the unclassified objects based on the pre-defined class labels of objects in the training set Prominent classification algorithms: SVM, KNN, Bayesian, etc. Classification Training Set Search for the K-Nearest Neighbors Vote the class Unclassified Object

Problems with KNN Finding k-nearest neighbors is expensive when the training set contains millions of objects (very large training set) The classification time is linear to the size of the training set Can we make it faster and scalable?

The construction steps of P-trees: 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 0 0 0 0 1 0 0 10 01 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 ^ ^^^ ^ ^ ^ ^^ = R A 1 A 2 A 3 A 4 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100  4. Compress each bit slice into a P-tree 1. Convert the data into binary 2. Vertically project each attribute 3. Vertically project each bit position P-Tree Vertical Data Structure

TV(X,  )=TV(X,x 33 ) 1 2 3 4 5 X TVTV 1 2 3 4 5 g a-  a  Y Total Variation The Total Variation of a set X about (the mean),, measures total squared separation of objects in X about, defined as follows:

Total Variation (Cont.) 2323 1 0 1 2 1 x 2 + 2 0 x 1 = 5 2 1 x 1 + 2 0 x 0 + 2 1 x 1 + 2 0 x 1 = 5

Total Variation (Cont.)

- The root count operations are independence from, which allows us to run the operations once in advance and retain the count results - In classification task, the sets of classes are known and unchanged. Thus, the total variation of an object about its class can be pre-computed The Independency of RC

Overview of SMART-TV Compute Root Count Measure TV of each object Large Training Set Store the root count and TV values Preprocessing Phase Unclassified Object Approximate the candidate set of NNs Search the K- nearest neighbors for the candidate set Vote Classifying Phase

1.Compute the root counts of each class C j, 1  j  number of classes. Store the results. Complexity: O(kdb 2 ) where k is the number of classes, d is the total of dimensions, and b is the bit-width. 2.Compute, 1  j  number of classes. Complexity: O(n) where n is the cardinality of the training set. Also, retain the results. Preprocessing Phase

Classifying Phase Stored values of root count and TV Unclassified Object Approximate the candidate set of NNs Search the K- nearest neighbors from the candidate set Vote Classifying Phase

1.For each class C j with n j objects, 1  j  number of classes, do the followings: a.Compute, where is the unclassified object b.Find hs objects in C j such that the absolute difference between the total variation of the objects in C j and the total variation of about C j are the smallest, i.e. Let A be an array and, where c.Store all objectIDs in A into TVGapList Classifying Phase

2.For each objectID t, 1  t  Len(TVGapList) where Len(TVGapList) is equal to hs times the total number of classes, retrieve the corresponding object features from the training set and measure the pair-wise Euclidian distance between and, i.e. and determine the k nearest neighbors of 3.Vote the class label for using the k nearest neighbors Classifying Phase (Cont.)

Dataset 1.KDDCUP-99 Dataset (Network Intrusion Dataset) – 4.8 millions records, 32 numerical attributes – 6 classes, each contains >10,000 records – Class distribution: – Testing set: 120 records, 20 per class – 4 synthetic datasets (randomly generated): - 10,000 records (SS-I) - 100,000 records (SS-II) - 1,000,000 records (SS-III) - 2,000,000 records (SS-IV) Normal972,780 IP sweep12,481 Neptune1,072,017 Port sweep10,413 Satan15,892 Smurf2,807,886

Dataset (Cont.) 2.OPTICS dataset – 8,000 points, 8 classes (CL-1, CL-2,…,CL-8) – 2 numerical attributes – Training set: 7,920 points – Testing set: 80 points, 10 per class

3.IRIS dataset – 150 samples – 3 classes (iris-setosa, iris- versicolor, and iris-virginica) – 4 numerical attributes – Training set: 120 samples – Testing set: 30 samples, 10 per class Dataset (Cont.)

Speed and Scalability Comparison (k=5, hs=25) Algorithm x 1000 cardinality 10100100020004891 SMART-TV0.140.332.013.889.27 P-KNN0.891.063.9412.4430.79 KNN0.392.3423.4749.28 NA Speed and Scalability Machine used: Intel Pentium 4 CPU 2.6 GHz machine, 3.8GB RAM, running Red Hat Linux

Classification Accuracy Comparison (SS-III), k=5, hs=25 AlgorithmClassTPFPPRF SMART-TVnormal1801.000.900.95 ipsweep2010.951.000.98 neptune2001.00 portsweep1801.000.900.95 satan1720.900.850.87 smurf2040.831.000.91 P-KNNnormal2040.831.000.91 ipsweep2010.951.000.98 neptune1501.000.750.86 portsweep2001.00 satan1410.930.700.80 smurf2050.801.000.89 KNNnormal2030.871.000.93 ipsweep2010.951.000.98 neptune2001.00 portsweep1801.000.900.95 satan1710.940.850.89 smurf2001.00 Classification Accuracy (Cont.)

Overall Classification Accuracy Comparison DatasetsSMART-TVPKNNKNN IRIS0.970.710.97 OPTICS0.960.990.97 SS-I0.960.720.89 SS-II0.920.910.97 SS-III0.940.910.96 SS-IV0.920.910.97 NI0.930.91NA Overall Accuracy

Outline Nearest Neighbors Classification Problems SMART TV (SMall Absolute diffeRence of ToTal Variation): A Fast and Scalable Nearest Neighbors Classification Algorithm SMART TV in Image Classification

Image Preprocessing We extracted color and texture features from the original pixel of the images Color features: We used HVS color space and quantized the images into 52 bins i.e. (6 x 3 x 3) bins Texture features: we used multi-resolutions Gabor filter with two scales and four orientation (see B.S. Manjunath, IEEE Trans. on Pattern Analysis and Machine Intelligence, 1996)

Image Dataset Corel images (http://wang.ist.psu.edu/docs/related)http://wang.ist.psu.edu/docs/related 10 categories Originally, each category has 100 images Number of feature attributes: - 54 from color features - 16 from texture features We randomly generated several bigger size datasets to evaluate the speed and scalability of the algorithms. 50 images for testing set, 5 for each category

Image Dataset

Example on Corel Dataset

Results Class SMART-TVKNN k=3k=5k=7k=3k=5k=7 hs=15hs=25hs=35hs=15hs=25hs=35hs=15hs=25hs=35 C10.690.720.750.740.730.78 0.810.780.77 0.79 C20.640.600.590.62 0.680.640.630.660.73 C30.590.600.650.670.680.620.600.680.760.570.700.68 C40.730.810.790.780.840.790.740.840.87 0.900.88 C50.900.910.920.880.92 0.930.890.900.94 C60.610.680.700.640.740.660.610.710.720.590.620.68 C70.89 0.920.850.910.930.870.900.92 0.94 C80.940.910.93 0.96 C90.640.520.570.430.600.710.450.540.720.620.710.54 C100.710.79 0.760.77 0.780.790.820.750.78

Results Classification Time

Results Preprocessing Time

A nearest-based classification algorithm that starts its classification steps by approximating a number of candidates of nearest neighbors The absolute difference of total variation between data points in the training set and the unclassified point is used to approximate the candidates The algorithm is fast, and it scales well in very large dataset. The classification accuracy is very comparable to that of KNN algorithm. Summary

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Similar presentations

Presentation on theme: "A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Similar presentations

Presentation on theme: "A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University."— Presentation transcript:

Similar presentations

About project

Feedback