Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees  A new distance metric, called HOBbit distance  Some useful properties of P-trees  New P-tree Nearest Neighbor classification method - called Closed-KNN

Data Mining extracting knowledge from a large amount of data Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis Information Pyramid Raw data Useful Information (sometimes 1 bit: Y/N) Data Mining More data volume = less information

Classification Predicting the class of a data object Bc3b3a3 Ac2b2a2 Ac1b1a1 ClassFeature3Feature2Feature1 Training data: Class labels are known and supervise the learning Classifier cba Sample with unknown class: Predicted class Of the Sample also called Supervised learning Eager classifier: Builds a classifier model in advance e.g. decision tree induction, neural network Lazy classifier: Uses the raw training data e.g. k-nearest neighbor

Clustering (unsupervised learning – cpt 8) The process of grouping objects into classes, with the objective: the data objects are similar to the objects in the same cluster dissimilar to the objects in the other clusters. A two dimensional space showing 3 clusters Clustering is often called unsupervised learning or unsupervised classification  the class labels of the data objects are unknown

Distance Metric (used in both classification and clustering) Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X  Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z)

Various Distance Metrics Minkowski distance or L p distance, Manhattan distance, Euclidian distance, Max distance, (P = 1) (P = 2) (P =  )

An Example A two-dimensional space: Manhattan, d 1 (X,Y) = XZ+ ZY = 4+3 = 7 Euclidian, d 2 (X,Y) = XY = 5 Max, d  (X,Y) = Max(XZ, ZY) = XZ = 4 X (2,1) Y (6,4) Z d1  d2  dd1  d2  d For any positive integer p,

Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance

HOBbit Similarity Higher Order Bit (HOBbit) similarity: HOBbitS(A, B) = Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x 1 : 0 1 1 0 1 0 0 1x 2 : 0 1 0 1 1 1 0 1 y 1 : 0 1 1 1 1 1 0 1y 2 : 0 1 0 1 0 0 0 0 HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 A, B: two scalars (integer) a i, b i : i th bit of A and B (left to right) m : number of bits

HOBbit Distance (related to Hamming distance) The HOBbit distance between two scalar value A and B: d v (A, B) = m – HOBbit(A, B) The previous example: Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x 1 : 0 1 1 0 1 0 0 1x 2 : 0 1 0 1 1 1 0 1 y 1 : 0 1 1 1 1 1 0 1y 2 : 0 1 0 1 0 0 0 0 HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 d v (x 1, y 1 ) = 8 – 3 = 5 d v (x 2, y 2 ) = 8 – 4 = 4 The HOBbit distance between two points X and Y: In our example (considering 2-dimensional data): d h (X, Y) = max (5, 4) = 5

HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (X  Y), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality

Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such that X  S if and only if d(T, X)  r 2r2r T X T 2r2r X 2r2r T X T 2r2r X Manhattan Euclidian Max HOBbit If X is a point on the boundary, d(T, X) = r

Decision Boundary decision boundary between points A and B, is the locus of the point X satisfying d(A, X) = d(B, X) B X A D R2R2 R1R1 d(A,X)d(A,X) d(B,X)d(B,X)  > 45  Euclidian B A Max Manhattan  < 45  B A Euclidian Max Manhattan B A B A Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Decision boundaries for Manhattan, Euclidean and max distance

Minkowski Metrics L p -metrics (aka: Minkowski metrics) d p (X,Y) = (  i=1 to n w i |x i - y i | p ) 1/p (weights, wi assumed =1) Unit Disks Dividing Lines p=1 (Manhattan) p=2 (Euclidean) p=3,4,…. Pmax (chessboard) P=½, ⅓, ¼, … d max ≡ max|x i - y i |  d  ≡ lim p   d p (X,Y). Proof (sort of) lim p   {  i=1 to n a i p } 1/p max(a i ) ≡ b. For p large enough, other a i p << b p since y=x p increasingly concave, so  i=1 to n a i p  k*b p (k=duplicity of b in the sum), so {  i=1 to n a i p } 1/p  k 1/p *b and k 1/p  1

P>1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2)e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 2 0.5 0 0.5 0 0.25 0.25 0.7071067812 4 0.5 0 0.5 0 0.0625 0.0625 0.5946035575 9 0.5 0 0.5 0 0.001953125 0.001953125 0.5400298694 100 0.5 0 0.5 0 7.888609E-31 7.888609E-31 0.503477775 MAX 0.5 0 0.5 0 0.5 2 0.70 0 0.7071 0 0.5 0.5 1 3 0.70 0 0.7071 0 0.3535533906 0.3535533906 0.8908987181 7 0.70 0 0.7071 0 0.0883883476 0.0883883476 0.7807091822 100 0.70 0 0.7071 0 8.881784E-16 8.881784E-16 0.7120250978 MAX 0.70 0 0.7071 0 0.7071067812 2 0.99 0 0.99 0 0.9801 0.9801 1.4000714267 8 0.99 0 0.99 0 0.9227446944 0.9227446944 1.0796026553 100 0.99 0 0.99 0 0.3660323413 0.3660323413 0.9968859946 1000 0.99 0 0.99 0 0.0000431712 0.0000431712 0.9906864536 MAX 0.99 0 0.99 0 0.99 2 1 0 1 0 1 1 1.4142135624 9 1 0 1 0 1 1 1.0800597389 100 1 0 1 0 1 1 1.0069555501 1000 1 0 1 0 1 1 1.0006933875 MAX 1 0 1 0 1 2 0.9 0 0.1 0 0.81 0.01 0.9055385138 9 0.9 0 0.1 0 0.387420489 0.000000001 0.9000000003 100 0.9 0 0.1 0 0.0000265614 ************** 0.9 1000 0.9 0 0.1 0 1.747871E-46 0 0.9 MAX 0.9 0 0.1 0 0.9 2 3 0 3 0 9 9 4.2426406871 3 3 0 3 0 27 27 3.7797631497 8 3 0 3 0 6561 6561 3.271523198 100 3 0 3 0 5.153775E+47 5.153775E+47 3.0208666502 MAX 3 0 3 0 3 6 90 0 45 0 531441000000 8303765625 90.232863532 9 90 0 45 0 3.874205E+17 7.566806E+14 90.019514317 100 90 0 45 0 **************************** 90 MAX 90 0 45 0 90

d 1/p (X,Y) = (  i=1 to n |x i - y i | 1/p ) p P<1 p=0 (lim as p  0) doesn’t exist (Does not converge.) q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.1 0 0.1 0 0.1 0.1 0.2 0.8 0.1 0 0.1 0 0.1584893192 0.1584893192 0.237841423 0.4 0.1 0 0.1 0 0.3981071706 0.3981071706 0.5656854249 0.2 0.1 0 0.1 0 0.6309573445 0.6309573445 3.2 0.1 0.1 0 0.1 0 0.7943282347 0.7943282347 102.4.04 0.1 0 0.1 0 0.9120108394 0.9120108394 3355443.2.02 0.1 0 0.1 0 0.954992586 0.954992586 112589990684263.01 0.1 0 0.1 0 0.977237221 0.977237221 1.2676506002E+29 2 0.1 0 0.1 0 0.01 0.01 0.1414213562 P<1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.5 0 0.5 0 0.5 0.5 1 0.8 0.5 0 0.5 0 0.5743491775 0.5743491775 1.189207115 0.4 0.5 0 0.5 0 0.7578582833 0.7578582833 2.8284271247 0.2 0.5 0 0.5 0 0.8705505633 0.8705505633 16 0.1 0.5 0 0.5 0 0.9330329915 0.9330329915 512 0.04 0.5 0 0.5 0 0.9726549474 0.9726549474 16777216 0.02 0.5 0 0.5 0 0.9862327045 0.9862327045 5.6294995342E+14 0.01 0.5 0 0.5 0 0.9930924954 0.9930924954 6.3382530011E+29 2 0.5 0 0.5 0 0.25 0.25 0.7071067812 q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.9 0 0.1 0 0.9 0.1 1 0.8 0.9 0 0.1 0 0.9191661188 0.1584893192 1.097993846 0.4 0.9 0 0.1 0 0.9587315155 0.3981071706 2.14447281 0.2 0.9 0 0.1 0 0.9791483624 0.6309573445 10.8211133585 0.1 0.9 0 0.1 0 0.9895192582 0.7943282347 326.27006047 0.04 0.9 0 0.1 0 0.9957944476 0.9120108394 10312196.9619 0.02 0.9 0 0.1 0 0.9978950083 0.954992586 341871052443154 0.01 0.9 0 0.1 0 0.9989469497 0.977237221 3.8259705676E+29 2 0.9 0 0.1 0 0.81 0.01 0.9055385138

Min dissimilarity function The d min function (d min (X,Y) = min i=1 to n |x i - y i | is strange. It is not a psuedo- metric. The Unit Disk is: And the neighborhood of the blue point relative to the red point (dividing nbrhd - those points closer to the blue than the red). Major bifurcations! http://www.cs.ndsu.nodak.edu/~serazi/research/Distance.html

Canberra metric: d c (X,Y) = (  i=1 to n |x i – y i | / (x i + y i ) - normalized manhattan distance Square Cord metric: d sc (X,Y) =  i=1 to n (  x i –  y i ) 2 - Already discussed as L p with p=1/2 Squared Chi-squared metric: d chi (X,Y) =  i=1 to n (x i – y i ) 2 / (x i + y i ) HOBbit metric (Hi Order Binary bit) d H (X,Y) = max i=1 to n {n – HOB(x i - y i )} where, for m-bit integers, A=a 1..a m and B=b 1..b m HOB(A,B) = max i=1 to m {s:  i(1  i  s  a i =b i )} (related to Hamming distance in coding theory) Scalar Product metric: d chi (X,Y) = X Y =  i=1 to n x i * y i Hyperbolic metrics: (which map infinite space 1-1 onto a sphere) Which are rotationally invariant? Translationally invariant? Other? Other Interesting Metrics

Notations P 1 & P 2 : P 1 AND P 2 (also P 1 ^ P 2 ) P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT P-tree of P P i, j : basic P-tree for band-i bit-j. P i (v) : value P-tree for value v of band i. P i ([v 1, v 2 ]) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree, P N : number of pixels n : number of bands m : number of bits

Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0 iff rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1  v 2  rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1  v 2

k-Nearest Neighbor Classification and Closed-KNN 1) Select a suitable value for k 2) Determine a suitable distance metric 3) Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified. T T is the target pixels. With k = 3, to find the third nearest neighbor, KNN arbitrarily select one point from the bdry line of the nhbd Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN

Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target sample, T = The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dim-i, [v i ] expanded to the interval [v i – a i, v i +b i ], for some pos integers a i and b i. Continue expansion until there are at least k points in the neighborhood.

HOBbit Similarity Method for KNN In this method, we match bits of the target to the training data First, find those matching in all 8 bits of each band (exact matches) let, b i,j = j th bit of the i th band of the target pixel. Define target-Ptree, Pt: Pt i,j = P i,j, if b i,j = 1 = P i,j, otherwise And precision-value-Ptree, Pv i,1  j = Pt i,1 & Pt i,2 & Pt i,3 & … & Pt i,j

An Analysis of HOBbit Method Let i th band value of the target T, v i = 105 = 01101001 b [01101001]  [105, 105] 1 st expansion [0110100-] = [01101000, 01101001] = [104, 105] 2 nd expansion [011010- -] = [01101000, 01101011] = [104, 107]  Does not expand evenly in both side: Target = 105 and center of [104, 111] = (104+107) / 2 = 105.5  And expands by power of 2.  Computationally very cheap

Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P 1 (v 1 ) & P 2 (v 2 ) & P 3 (v 3 ) & … & P n (v n ) If rc(Pnn) < k, Pnn = P 1 (v 1 -1, v 1 +1) & P 2 (v 2 -1, v 2 +1) & … & P n (v n -1, v n +1) If rc(Pnn) < k, Pnn = P 1 (v 1 -2, v 1 +2) & P 2 (v 2 -2, v 2 +2) & … & P n (v n -2, v n +2) Computationally costlier than HOBbit Similarity method But a little better classification accuracy Let, P c (i) is the value P-trees for the class i Plurality class =

Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7

Performance – Accuracy 1997 Dataset:

Performance - Accuracy (cont.) 1998 Dataset:

Performance - Time 1997 Dataset: both axis in logarithmic scale

Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale

Key a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 =C a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19 a 20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t15 0 0 1 0 1 0 1 2 a 5 a 6 a 10 =C a 11 a 12 a 13 a 14 distance The 3 nearest neighbors 0 0 0 d=2, don’t replace 0 0 0 d=4, don’t replace 0 0 0 d=4, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=1, replace t53 0 0 0 0 1 0 0 1 0 0 0 d=2, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=2, don’t replace 0 1 C=1 wins! 3 NEAREST NEIGHBOR CLASSIFICATION of an unclassified sample, a=(a 5 a 6 a 11 a 12 a 13 a 14 )=(000000). HORIZONTAL APPROACH (relevant attributes are Note only 1 of many training tuple at a distance=2 from the sample got to vote. We didn’t know that distance=2 was going to be the vote cutoff until the end of the 1 st scan. Finding the other distance=2 voters (Closed 3NN set or C3NN) requires another scan. a 5 a 6 a 11 a 12 a 13 a 14 )

Key a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 =C a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19 a 20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 Requires a 2 nd scan to find the Closed 3NN set. Does it change the vote? 0 0 0 d=2, include it also 0 0 0 d=4, don’t include 0 0 0 d=4, don’t include 0 0 0 d=3, don’t include 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=1, already have 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=3, don’t replace 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=2, already have 0 0 0 d=1, already have 0 1 Vote after 1 st scan. YES! t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 a 5 a 6 a 10 =C a 11 a 12 a 13 a 14 distance t53 0 0 0 0 1 0 0 1 Unclassified sample: 0 0 0 0 0 0 3NN set after 1 st scan

C00000000001111111C00000000001111111 C11111111110000000C11111111110000000 Closed 3NNC using Ptree s a 20 1 0 1 0 1 0 1 0 1 0 key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a111110000000000100a111110000000000100 a200001111111111000a200001111111111000 a311111100000000111a311111100000000111 a400000000001111011a400000000001111011 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a711110000001111011a711110000001111011 a811110000001111011a811110000001111011 a900000011110000100a900000011110000100 C11111111110000000C11111111110000000 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 a 15 1 0 1 0 1 0 1 0 1 0 a 16 1 0 1 0 1 0 a 17 0 1 0 1 0 1 0 1 0 1 a 18 0 1 0 1 0 1 0 1 0 1 a 19 0 1 0 1 0 1 0 1 0 Ps00000000000000000Ps00000000000000000 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 D s,0 is empty, proceed to S s,1 the sphere of radius=1 about s Let all training points in D s,0 (disk about sample, s, of radius 0) vote 1 st if  3 of them, done, else go to D s,1, etc. by constructing the tuple Ptree, P s then ANDing with P C and P C’ Below black is used to denote attribute complement (rather than ‘) and red means uncomplemented.

C00000000001111111C00000000001111111 C11111111110000000C11111111110000000 a 20 1 0 1 0 1 0 1 0 1 0 key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a111110000000000100a111110000000000100 a200001111111111000a200001111111111000 a311111100000000111a311111100000000111 a400000000001111011a400000000001111011 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a711110000001111011a711110000001111011 a811110000001111011a811110000001111011 a900000011110000100a900000011110000100 a 10 =C 1 0 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 a 15 1 0 1 0 1 0 1 0 1 0 a 16 1 0 1 0 1 0 a 17 0 1 0 1 0 1 0 1 0 1 a 18 0 1 0 1 0 1 0 1 0 1 a 19 0 1 0 1 0 1 0 1 0 P D(s,1) 0 1 0 1 0 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 121 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 S s,1 : Construct Ptree, P S(s,1) = OR P i = P |s i -t i |=1; |s j -t j |=0, j  i = OR P S(s i,1)   S(s j,0) black=attribute complement, red=attribute OR P5P5 P6P6 P 11 P 12 P 13 P 14 j  {5,6,11,12,13,14}-{i} 0 1 i= 5,6,11,12,13,14

key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a 10 C 1 0 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 D s,2 : Construct Ptree, P D(s,2) = OR{all double-dim interval-Ptrees}; P D(s,2) = OR P i,j P i,j = P S(s i,1)  S(s j,1)  S(s k,0) black=attribute complement, red=attribute k  {5,6,11,12,13,14}-{i,j} 0 1 i,j  {5,6,11,12,13,14} a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 121 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 P 5,6 P 5,11 P 5,12 P 5,13 P 5,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 0 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 P 6,11 P 6,12 P 6,13 P 6,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 11,12 P 11,13 P 11,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 12,13 P 12,14 We now have 3 nearest nbrs. We could quite and declare C=1 winner? a 140 1 0 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 13,14 We now have the closed 3-nbrhd. We declare C=0 winner!

Justification for using vertical structures (once again)?  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result (histogram), where there is no reconstructive post processing and the actual data records need never be involved? 1 0 1 or

In the previous example, no exact matches (dis=0 nbrs or sim=6 nbrs) for the sample were found, two nbrs were found at a distance of 1 (dis=1 nbrs or sim=5 nbrs) and nine sim=4 (dis=2) nbrs were found. All 11 got the equal votes even though the two sim=5 were closer nbrs than the nine sim=4 nbrs A better approach would be to weight votes by their similarity (The relative vote weight function used here would be the linear function of the similarity. Another function of similarity (better, but harder to compute) would be the Gaussian). While we are weighting votes by similarity, we can easily weight attributes by relevance also (assuming some attributes are more relevant neighbors than others. e.g., the relevance weight of a feature attribute could be the correlation of that attribute to the class label). P-trees accommodate this method very well (in fact, a variation on this theme won the KDD-cup competition in 02

Association of Computing Machinery KDD-Cup-02 NDSU Team

Closed Manhattan Nearest Neighbor Classifier (uses a linear function of Manhattan or Hamming similarity) sample is (000000) and attribute weights of relevant attributes are their subscripts) key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 C11111111110000000C11111111110000000 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 Below black is used to denote attribute complement and red means uncomplemented. The vote is even simpler than the "equal" vote case. We just note that all tuples vote in accordance with their weighted similarity (if the a i values differs form that of (000000) then the vote contribution is the subscript of that attribute, else zero). Thus, we can just add up the root counts of each relevant attribute weighted by their subscript. Class=1 root counts: rc(P C ^P a 5 )=4 C=1 vote is: 343 =4*5 + 8*6 + 7*11 + 4*12 + 4*13 + 7*14 rc(P C ^P a 6 )=8rc(P C ^P a 11 )=7 rc(P C ^P a 12 )=4rc(P C ^P a 13 )=4rc(P C ^P a 14 )=7 C=1 vote is: 343 Similarly, C=0 vote is: 258= 6*5 + 7*6 + 5*11 + 3*12 + 3*13 + 4*14

We note that the Closed Manhattan NN Classifier uses an influence function which is pyramidal It would be much better to use a Gaussian influence function but it is much harder to implement. One generalization of this method to the case of integer values rather than Boolean, would be to weight each bit position in a more Gaussian shape (i.e., weight the bit positions, b, b-1,..., 0 (high order to low order) using Gaussian weights. By so doing, at least within each attribute, influences are Gaussian. We can call this method, Closed Manhattan Gaussian NN Classification. Testing the performance of either CM NNC or CMG NNC would make a great paper for this course (thesis?). Improving it in some way would make an even better paper (thesis).

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Similar presentations

Presentation on theme: "Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Similar presentations

Presentation on theme: "Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees."— Presentation transcript:

Similar presentations

About project

Feedback