A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

Random Forest Predrag Radenković 3237/10
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 2: Pattern Recognition
CS107 Introduction to Computer Science
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
CS Instance Based Learning1 Instance Based Learning.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
Data mining and machine learning A brief introduction.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Chapter 9 – Classification and Regression Trees
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Scaling up Decision Trees. Decision tree learning.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center).
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Overview Data Mining - classification and clustering
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Data Transformation: Normalization
Data Science Algorithms: The Basic Methods
Instance Based Learning
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
Yue (Jenny) Cui and William Perrizo North Dakota State University
Proximal Support Vector Machine for Spatial Data Using P-trees1
= xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
Efficient Image Classification on Vertically Decomposed Data
A Fast and Scalable Nearest Neighbor Based Classification
K Nearest Neighbor Classification
Pre-Processing What is the best amount of amortized preprocessing?
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
Vertical K Median Clustering
North Dakota State University Fargo, ND USA
cskin(C,k)  allskin(C,k)s closed skin, and
Vertical K Median Clustering
Review Given a training space T(A1,…,An, C) and its features subspace X(A1,…,An) = T[A1,…,An], a functional f:X Reals, distance d(x,y)  |f(x)-f(y)| and.
Nearest Neighbors CSC 576: Data Mining.
North Dakota State University Fargo, ND USA
Contours: Y R f R* f(x) Y R f S
Presentation transcript:

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University

Given a (large) TRAINING SET, R(A 1,…,A n, C), with C=CLASSES and {A 1 …A n }=FEATURES Classification is: labeling unclassified objects based on the training set kNN classification goes as follows: Classification Search for the k-Nearest Neighbors Vote the class Training Set Unclassified Object

Machine Learning usually begins by identifying Near Neighbor Set(s), NNS. In Isotropic Clustering, one identifies round sets (disk shaped NNS s about a center). In Density Clustering, one identifies cores (dense round NNS s ) then pieces them together. In any Classification based on continuity we classifying a sample based on its NNS class histogram (aka kNN) or we identify isotropic NNS s of centroids (k-means) or we build decision tres with training leafsets and use them to classify samples that fall to that leaf, we find class boundaries (e.g., SVM) which distinguish NNS s in one class from NNS s in another. The basic definition of continuity from elementary calculus proves NNS s are fundamental:  >0  >0  : d(x,a)<   d(f(x),f(a))<  or  NNS about f(a),  a NNS about a that maps inside it. So NNS Search is a fundamental problem to be solved. We discuss NNS Search from the a vertical data point of view. With vertically structured data, the only neighborhoods that are easily determined are the cubic or Max neighborhoods (L  disks), yet usually we want Euclidean disks. We develop techniques to circumscribe Euclidean disks using the intersections of contour sets, the main ones are coordinate projection contours with intersections form L  disks. Database analysis can be broken down into 2 areas, Querying and Data Mining. Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining Machine Learning can be broken down into 2 areas, Clustering and Classification. Clustering can be broken down into 2 areas, Isotropic (round clusters) and Density-based

SOME useful NNSs Given a similarity, s:R  R  Reals (e.g., s(x,y)=s(y,x) and s(x,x)  s(x,y)  x,y  R ) and an extension to disjoint subsets of R (e.g., single/complete/average link...) and C  R, a k-disk of C is: disk(C,r)  {x  R | s(x,C)  r}, skin(C,r)  disk(C,r) - C ring(C,r 2,r 1 )  disk(C,r 2 ) - disk(C,r 1 )  skin(C,r 2 ) - skin(C,r 1 ). Given a [psuedo] distance, d, rather than a similarity, just reverse all inequalities. For C = {a} a r1r1 r2r2 C r1r1 r2r2 disk(C,k)  C  : |disk(C,k)  C'|=k and s(x,C)  s(y,C)  x  disk(C,k), y  disk(C,k). Define its skin(C,k)  disk(C,k) - C skin stands for s k immediate neighbors and is a kNNS of C cskin(C,k)   all skin(C,k) s closed skin, and ring(C,k) = cskin(C,k) - cskin(C,k-1)

A definition of Predicate trees (P-trees) based on functionals? Given f:R(A 1..A n )  Y and S  Y define the uncompressed Functional-P-tree as P f, S  a bit map given by P f,S (x)=1 iff f(x)  S.. The predicate for 0 P f,S is the set containment predicate, f(x)  S P f,S a Contour bit map (bitmaps, rather than lists the contour points). If f is a local density (ala OPTICS) and {S k } a partition of Y, {f -1 (S k )} is a clustering! What partition {S k } of Y should be use? (a binary partition? given by a threshold value). In OPTICS S k s are the intervals between crossing points of graph(f) and a threshold line pts below the threshold line are agglomerated into 1 noise cluster. Weather reporters use equi-width interval partitions (of barametric pressure or temp..).

(ls) P f,S is a compression of P f,S by doing the following: 1. order or walk R (converts the bit map to a bit vector) 2. equi-width partition R into segments of size, ls (ls=leafsize, the last 1 can be short) 3. eliminate and mask to 0, all pure-zero segments (via a Leaf Mask or LM ) 4. eliminate and mask to 1, all pure-one segments (via a Pure1 Mask or PM ) Notes: 1. LM is an existential aggregation of R (1 iff that leaf has a 1-bit). Others? (default=existential) 2. There are partitioning other than equi-width (but that will be the default). Compressed Functional-P-trees (with equi-width leaf size, ls) Doubly Compressed Functional-P-trees with equi-width leaf sizes, (ls 1,ls 2 ) Each leaf of (ls) P f,S is an uncompressed bit vector and can be compressed the same way: (ls 1,ls 2 ) P f,S (ls 2 is 2 nd equi-width segment size and ls 2 << ls 1 ) Recursive compression can continue ad infinitum, (ls 1,ls 2,ls 3 ) P f,S (ls 1,ls 2,ls 3,ls 4 ) P f,S...

For A i Real and f i,j ( x )  j th bit of the i th component, x i { (*) P f i,j,{1}  (*) P i,j } j=b..0 are the basic (*) P-trees of A i, (* = ls 1,...ls k k=0...). For A i Categorical, and f i,a (x)=1 if x i =a  R[A i ], else 0; then { (*) P f i,a,{1}  (*) P i,a } a  R[A i ] are the basic (*) P-trees of A i For A i real, the basic P-trees result from binary encoding of individual real numbers (categories). Encodings can be used for any attribute. Note that it is the binary encoding of real attributes, which turns an n-tuple scan into a Log 2 (n)-column AND (making P-tree technology scalable). BASIC P-trees

Problems with kNN Finding k-Nearest Neighbor Set from horizontally structured data (record oriented data) can be expensive for large training set (containing millions or trillions of tuples) – linear to the size of the training set (1 scan) – Closed kNN is much more accurate but requires 2 scans Vertically structuring the data can help.

6. 1 st half of 1 st of 2 nd is  st half of 2 nd half not  st half is not pure1  Whole file is not pure1  0 Horizontal structures (records) Scanned vertically P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P nd half of 2 nd half is  R process P-trees using multi-operand logical AND s. Vertical Predicate-tree (P-tree) structuring: vertically partition table; compress each vertical bit slice into a basic Ptree; R( A 1 A 2 A 3 A 4 ) A data table, R(A 1..A n ), containing horizontal structures (records) is processed vertically (vertical scans) The basic (1-D) Ptree for R 11 is built by recording the truth of the predicate “pure 1” recursively on halves, until purity is reached nd half is not pure1  nd half of 1 st of 2 nd not  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] Eg, to count, s, use “pure ”: level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = level = level But it is pure (pure0) so this branch ends

Total Variation The Total Variation of a set X, TV(a) is the sum of the squared separations of objects in X about a, defined as follows: TV(a) =  x  X (x-a)o(x-a) We will use the concept of functional contours (in particular, the TV contours) in this presentation to identify a well- pruned, small superset of the Nearest Neighbor Set of an unclassified sample (which can then be efficiently scanned) First we will discuss functional contours in general then consider the specific TV contours.

Given f:R(A 1..A n )  Y and S  Y, define contour(f,S)  f -1 (S). From the derived attribute point of view, Contour(f,S) = SELECT A 1..A n FROM R* WHERE R*.A f  S. If S={a}, f -1 ({a}) is Isobar(f, a) There is a DUALITY between functions, f:R(A 1..A n )  Y and derived attributes, A f of R given by x.A f  f(x) where Dom(A f )=Y A 1 A 2 A n x 1 x 2 x n :... Y f(x) f A 1 A 2 A n A f x 1 x 2 x n f(x) :... R R* A 1 A 2 A n : :... YSYS f R A 1 ..  A n space Y S graph(f) = { (a 1,...,a n,f(a 1.a n )) | (a 1..a n )  R } contour(f,S)

TV(a) =  x  R (x-a)o(x-a) If we use d for a index variable over the dimensions, =  x  R  d=1..n (x d 2 - 2a d x d + a d 2 ) i,j,k bit slices indexes =  x  R  d=1..n (  k 2 k x dk )  x  R  d=1..n a d (  k 2 k x dk ) + |R||a| 2 =  x  d (  i 2 i x di )(  j 2 j x dj ) - 2  x  R  d=1..n a d (  k 2 k x dk ) + |R||a| 2 =  x  d  i,j 2 i+j x di x dj - 2  x,d,k 2 k a d x dk + |R||a| 2 =  x,d,i,j 2 i+j x di x dj - |R||a| 2 2  d a d  x,k 2 k x dk + TV(a) =  i,j,d 2 i+j |P di^dj | - |R||a| 2  k 2 k+1  d a d |P dk | + The first term does not depend upon a. Thus, the simpler derived attribute, TV-TV(  ) (which does not have that 1 st term at all) has with identical contours as TV (just a lowered graph). We also find it useful to post-compose a log to reduce the number of bit slices. The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a). =  x,d,i,j 2 i+j x di x dj +  d a d a d ) |R|( -2  d a d  d +=  x,d,i,j 2 i+j x di x dj - |R|  d a d a d 2|R|  d a d  d +

The length of g (a) depends only on the length of a- , so isobars are hyper-circles centered at  The graph of g is a log-shaped hyper-funnel: From equation 7, f(a)=TV(a)-TV(  )  d (a d a d -  d  d ) ) = |R| ( -2  d (a d  d -  d  d ) + TV(a) =  x,d,i,j 2 i+j x di x dj + |R| ( -2  d a d  d +  d a d a d ) +  d  d 2 )= |R|(  d a d  d  d a d f(  )=0 and letting g(a)  HDTV(a) = ln( f(a) )=ln|R| + ln|a-  | 2 Taking  g /  a d (a) = | a-  | 2 2( a -)d-)d The Gradient of g at a = 2/| a-  | 2 (a -)-) = |R| |a-  | 2 so go inward and outward along a-  by  to the points; inner point, b=  +(1-  /|a-  |)(a-  ) and outer point, c=  -(1+  /|a-  |)(a-  ).  -contour (radius  about a) a For an  -contour ring (radius  about a) Then take g(b) and g(c) as lower and upper endpoints of a vertical interval. Then we use EIN formulas on that interval to get a mask P-tree for the  -contour (which is a well-pruned superset of the  - neighborhood of a) b c g(b) g(c)  x1x1 x2x2 g(x)

If more pruning is needed (i.e., HDTV(a) contour is still to big to scan) use a dimension projection contour (Dim-i projection P-trees are already computed = basic P-trees of R.A i. Form that contour_mask_P-tree; AND it with the HDTV contour P-tree. The result is a mask for the intersection).   -contour (radius  about a) a HDTV(b) HDTV(c) b c As pre-processing, calculate basic P-trees for the HDTV derived attribute. To classify a, 1. Calculate b and c (which depend upon a and  ) 2. Form the mask P-tree for training points with HDTV-values in [HDTV(b),HDTV(c)] (Note, when the camera ready paper was submitted we were still doing this step by sorting TV(a) values. Now we use the contour approach which speeds up this step considerably. The performance evaluation graphs in this paper are still based on the old method, however.). 3. User that P-tree to prune out the candidate NNS. 4. If the root count of the candidate set is small enough, proceed to scan and assign class votes using, e.g., a Gaussian vote function, else prune further using a dimension projection). contour of dimension projection f(a)=a 1 x1x1 x2x2 HDTV(x) If more pruning is needed (i.e., HDTV(a) contour is still to big to scan) We can also note that HDTV can be further simplified (retaining the same contour structure) by using h(a)=|a-  |. Since we create the derived attribute by scanning the training set anyway, why not just use this very simple function? Then other functionals leap to mind, e.g., h b (a)=|a-b|

1 2 3 TV(x 15 )- TV(  ) X Y TV- TV(  ) 4 5 TV(  )=TV(x 33 ) TV(x 15 ) X Y TV Graphs  h b (a)=|a-b| b  h(a)=|a-  |  HDTV

A principle: A job is not done until the Mathematics is completed (and, of course, until all the paper work is also completed). The Mathematics of a research project always includes 1. proofs of killer-ness, 2. simplifications (everything is simple once fully understood), 3. generalizations (to the widest possible application scope), and 4. insights (teasing out the main issues and underlying mega-truths with full drill down). Therefore, we need to ask the following questions at this point: Should we use the vector of medians (the only good choice of middle point in mulidimensional space, since the point closest to the mean is also influenced by skewness just like the mean). We will denote the vector of medians as h (a)=|a- | is an important functional (better than h  (a)=|a-  |?) If we compute the median of an even number of values as the count-weighted average of the middle two values, then in binary columns,  and coincide. What about the vector of standard deviations,  ? (computable with P-trees!) Do we have an improvement of BIRCH here? - generating similar comprehensive statistical measures, but much faster and more focused?) We can do the same for any rank statistic (or order statistic), e.g., vector of 1 st or 3 rd quartiles, Q 1 or Q 3 ; the vector of k th rank values (k th ordinal values). If we preprocessed to get the basic P-trees of, and each mixed quartile vector (e.g., in 2-D add 5 new derived attributes;, Q 1,1, Q 1,2, Q 2,1, Q 2,2 ; where Q i,j is the i th quartile of the j th column), what does this tell us (e.g., what can we conclude about the location of core clusters? Maybe all we really need is the basic P-trees of the column quartiles, Q 1,..., Q n ?)

Additional Mathematics to enjoy: Study the Vector Ordinal Disks (VODs) as alternatives to distance and ordinal disks (kNN disks), where VOD(a,k) = {x | x d is one of the [closed] k-Nearest Neighbors of a d for every column, d}. Are they easy to compute from P-trees? Do they offer advantages? When? What? Why?

Dataset 1.KDDCUP-99 Dataset (Network Intrusion Dataset) – 4.8 millions records, 32 numerical attributes – 6 classes, each contains >10,000 records – Class distribution: – Testing set: 120 records, 20 per class – 4 synthetic datasets (randomly generated): - 10,000 records (SS-I) - 100,000 records (SS-II) - 1,000,000 records (SS-III) - 2,000,000 records (SS-IV) Normal972,780 IP sweep12,481 Neptune1,072,017 Port sweep10,413 Satan15,892 Smurf2,807,886

Speed (Scalability) Comparison (k=5, hs=25) Algorithm x 1000 cardinality SMART-TV P-KNN KNN NA Speed and Scalability Machine used: Intel Pentium 4 CPU 2.6 GHz machine, 3.8GB RAM, running Red Hat Linux

Dataset (Cont.) 2.OPTICS dataset – 8,000 points, 8 classes (CL-1, CL-2,…,CL-8) – 2 numerical attributes – Training set: 7,920 points – Testing set: 80 points, 10 per class

3.IRIS dataset – 150 samples – 3 classes (iris-setosa, iris- versicolor, and iris-virginica) – 4 numerical attributes – Training set: 120 samples – Testing set: 30 samples, 10 per class Dataset (Cont.)

Overall Classification Accuracy Comparison DatasetsSMART-TVPKNNKNN IRIS OPTICS SS-I SS-II SS-III SS-IV NI NA Overall Accuracy

A nearest-based classification algorithm that starts its classification steps by approximating the Nearest Neighbor Set. The total variation functional is used prune down the NNS candidate set. It finishes classification in the traditional way The algorithm is fast. It scales well to very large dataset. The classification accuracy is very comparable to that of Closed kNN (which is better than kNN). Summary