Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.

Similar presentations


Presentation on theme: "Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees."— Presentation transcript:

1 Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees  A new distance metric, called HOBbit distance  Some useful properties of P-trees  New P-tree Nearest Neighbor classification method - called Closed-KNN These notes contain NDSU confidential & Proprietary material. Patents pending on bSQ, Ptree technology

2 Data Mining extracting knowledge from a large amount of data Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis Information Pyramid Raw data Useful Information (sometimes 1 bit: Y/N) Data Mining More data volume = less information

3 Classification Predicting the class of a data object Bc3b3a3 Ac2b2a2 Ac1b1a1 ClassFeature3Feature2Feature1 Training data: Class labels are known and supervise the learning Classifier cba Sample with unknown class: Predicted class Of the Sample also called Supervised learning Eager classifier: Builds a classifier model in advance e.g. decision tree induction, neural network Lazy classifier: Uses the raw training data e.g. k-nearest neighbor

4 Clustering (unsupervised learning – cpt 8) The process of grouping objects into classes, with the objective: the data objects are similar to the objects in the same cluster dissimilar to the objects in the other clusters. A two dimensional space showing 3 clusters Clustering is often called unsupervised learning or unsupervised classification  the class labels of the data objects are unknown

5 Distance Metric (used in both classification and clustering) Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X  Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z)

6 Various Distance Metrics Minkowski distance or L p distance, Manhattan distance, Euclidian distance, Max distance, (P = 1) (P = 2) (P =  )

7 An Example A two-dimensional space: Manhattan, d 1 (X,Y) = XZ+ ZY = 4+3 = 7 Euclidian, d 2 (X,Y) = XY = 5 Max, d  (X,Y) = Max(XZ, ZY) = XZ = 4 X (2,1) Y (6,4) Z d1  d2  dd1  d2  d For any positive integer p,

8 Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance

9 HOBbit Similarity Higher Order Bit (HOBbit) similarity: HOBbitS(A, B) = Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x 1 : 0 1 1 0 1 0 0 1x 2 : 0 1 0 1 1 1 0 1 y 1 : 0 1 1 1 1 1 0 1y 2 : 0 1 0 1 0 0 0 0 HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 A, B: two scalars (integer) a i, b i : i th bit of A and B (left to right) m : number of bits

10 HOBbit Distance (related to Hamming distance) The HOBbit distance between two scalar value A and B: d v (A, B) = m – HOBbit(A, B) The previous example: Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x 1 : 0 1 1 0 1 0 0 1x 2 : 0 1 0 1 1 1 0 1 y 1 : 0 1 1 1 1 1 0 1y 2 : 0 1 0 1 0 0 0 0 HOBbitS(x 1, y 1 ) = 3 HOBbitS(x 2, y 2 ) = 4 d v (x 1, y 1 ) = 8 – 3 = 5 d v (x 2, y 2 ) = 8 – 4 = 4 The HOBbit distance between two points X and Y: In our example (considering 2-dimensional data): d h (X, Y) = max (5, 4) = 5

11 HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (X  Y), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality

12 Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such that X  S if and only if d(T, X)  r 2r2r T X T 2r2r X 2r2r T X T 2r2r X Manhattan Euclidian Max HOBbit If X is a point on the boundary, d(T, X) = r

13 Decision Boundary decision boundary between points A and B, is the locus of the point X satisfying d(A, X) = d(B, X) B X A D R2R2 R1R1 d(A,X)d(A,X) d(B,X)d(B,X)  > 45  Euclidian B A Max Manhattan  < 45  B A Euclidian Max Manhattan B A B A Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Decision boundaries for Manhattan, Euclidean and max distance

14 Minkowski Metrics L p -metrics (aka: Minkowski metrics) d p (X,Y) = (  i=1 to n w i |x i - y i | p ) 1/p (weights, wi assumed =1) Unit Disks Dividing Lines p=1 (Manhattan) p=2 (Euclidean) p=3,4,…. Pmax (chessboard) P=½, ⅓, ¼, … d max ≡ max|x i - y i |  d  ≡ lim p   d p (X,Y). Proof (sort of) lim p   {  i=1 to n a i p } 1/p max(a i ) ≡ b. For p large enough, other a i p << b p since y=x p increasingly concave, so  i=1 to n a i p  k*b p (k=duplicity of b in the sum), so {  i=1 to n a i p } 1/p  k 1/p *b and k 1/p  1

15 P>1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2)e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 2 0.5 0 0.5 0 0.25 0.25 0.7071067812 4 0.5 0 0.5 0 0.0625 0.0625 0.5946035575 9 0.5 0 0.5 0 0.001953125 0.001953125 0.5400298694 100 0.5 0 0.5 0 7.888609E-31 7.888609E-31 0.503477775 MAX 0.5 0 0.5 0 0.5 2 0.70 0 0.7071 0 0.5 0.5 1 3 0.70 0 0.7071 0 0.3535533906 0.3535533906 0.8908987181 7 0.70 0 0.7071 0 0.0883883476 0.0883883476 0.7807091822 100 0.70 0 0.7071 0 8.881784E-16 8.881784E-16 0.7120250978 MAX 0.70 0 0.7071 0 0.7071067812 2 0.99 0 0.99 0 0.9801 0.9801 1.4000714267 8 0.99 0 0.99 0 0.9227446944 0.9227446944 1.0796026553 100 0.99 0 0.99 0 0.3660323413 0.3660323413 0.9968859946 1000 0.99 0 0.99 0 0.0000431712 0.0000431712 0.9906864536 MAX 0.99 0 0.99 0 0.99 2 1 0 1 0 1 1 1.4142135624 9 1 0 1 0 1 1 1.0800597389 100 1 0 1 0 1 1 1.0069555501 1000 1 0 1 0 1 1 1.0006933875 MAX 1 0 1 0 1 2 0.9 0 0.1 0 0.81 0.01 0.9055385138 9 0.9 0 0.1 0 0.387420489 0.000000001 0.9000000003 100 0.9 0 0.1 0 0.0000265614 ************** 0.9 1000 0.9 0 0.1 0 1.747871E-46 0 0.9 MAX 0.9 0 0.1 0 0.9 2 3 0 3 0 9 9 4.2426406871 3 3 0 3 0 27 27 3.7797631497 8 3 0 3 0 6561 6561 3.271523198 100 3 0 3 0 5.153775E+47 5.153775E+47 3.0208666502 MAX 3 0 3 0 3 6 90 0 45 0 531441000000 8303765625 90.232863532 9 90 0 45 0 3.874205E+17 7.566806E+14 90.019514317 100 90 0 45 0 **************************** 90 MAX 90 0 45 0 90

16 d 1/p (X,Y) = (  i=1 to n |x i - y i | 1/p ) p P<1 p=0 (lim as p  0) doesn’t exist (Does not converge.) q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.1 0 0.1 0 0.1 0.1 0.2 0.8 0.1 0 0.1 0 0.1584893192 0.1584893192 0.237841423 0.4 0.1 0 0.1 0 0.3981071706 0.3981071706 0.5656854249 0.2 0.1 0 0.1 0 0.6309573445 0.6309573445 3.2 0.1 0.1 0 0.1 0 0.7943282347 0.7943282347 102.4.04 0.1 0 0.1 0 0.9120108394 0.9120108394 3355443.2.02 0.1 0 0.1 0 0.954992586 0.954992586 112589990684263.01 0.1 0 0.1 0 0.977237221 0.977237221 1.2676506002E+29 2 0.1 0 0.1 0 0.01 0.01 0.1414213562 P<1 Minkowski Metrics q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.5 0 0.5 0 0.5 0.5 1 0.8 0.5 0 0.5 0 0.5743491775 0.5743491775 1.189207115 0.4 0.5 0 0.5 0 0.7578582833 0.7578582833 2.8284271247 0.2 0.5 0 0.5 0 0.8705505633 0.8705505633 16 0.1 0.5 0 0.5 0 0.9330329915 0.9330329915 512 0.04 0.5 0 0.5 0 0.9726549474 0.9726549474 16777216 0.02 0.5 0 0.5 0 0.9862327045 0.9862327045 5.6294995342E+14 0.01 0.5 0 0.5 0 0.9930924954 0.9930924954 6.3382530011E+29 2 0.5 0 0.5 0 0.25 0.25 0.7071067812 q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) 1 0.9 0 0.1 0 0.9 0.1 1 0.8 0.9 0 0.1 0 0.9191661188 0.1584893192 1.097993846 0.4 0.9 0 0.1 0 0.9587315155 0.3981071706 2.14447281 0.2 0.9 0 0.1 0 0.9791483624 0.6309573445 10.8211133585 0.1 0.9 0 0.1 0 0.9895192582 0.7943282347 326.27006047 0.04 0.9 0 0.1 0 0.9957944476 0.9120108394 10312196.9619 0.02 0.9 0 0.1 0 0.9978950083 0.954992586 341871052443154 0.01 0.9 0 0.1 0 0.9989469497 0.977237221 3.8259705676E+29 2 0.9 0 0.1 0 0.81 0.01 0.9055385138

17 Min dissimilarity function The d min function (d min (X,Y) = min i=1 to n |x i - y i | is strange. It is not a psuedo- metric. The Unit Disk is: And the neighborhood of the blue point relative to the red point (dividing nbrhd - those points closer to the blue than the red). Major bifurcations! http://www.cs.ndsu.nodak.edu/~serazi/research/Distance.html

18 Canberra metric: d c (X,Y) = (  i=1 to n |x i – y i | / (x i + y i ) - normalized manhattan distance Square Cord metric: d sc (X,Y) =  i=1 to n (  x i –  y i ) 2 - Already discussed as L p with p=1/2 Squared Chi-squared metric: d chi (X,Y) =  i=1 to n (x i – y i ) 2 / (x i + y i ) HOBbit metric (Hi Order Binary bit) d H (X,Y) = max i=1 to n {n – HOB(x i - y i )} where, for m-bit integers, A=a 1..a m and B=b 1..b m HOB(A,B) = max i=1 to m {s:  i(1  i  s  a i =b i )} (related to Hamming distance in coding theory) Scalar Product metric: d chi (X,Y) = X Y =  i=1 to n x i * y i Hyperbolic metrics: (which map infinite space 1-1 onto a sphere) Which are rotationally invariant? Translationally invariant? Other? Other Interesting Metrics

19 Notations P 1 & P 2 : P 1 AND P 2 (also P 1 ^ P 2 ) P 1 | P 2 : P 1 OR P 2 P´ : COMPLEMENT P-tree of P P i, j : basic P-tree for band-i bit-j. P i (v) : value P-tree for value v of band i. P i ([v 1, v 2 ]) : interval P-tree for interval [v 1, v 2 ] of band i. P 0 : is pure0-tree, a P-tree having the root node which is pure0. P 1 : is pure1-tree, a P-tree having the root node which is pure1. rc(P) : root count of P-tree, P N : number of pixels n : number of bands m : number of bits

20 Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) 4. rc(P 1 | P 2 ) = 0 iff rc(P 1 ) = 0 and rc(P 2 ) = 0 5.v 1  v 2  rc{P i (v 1 ) & P i (v 2 )} = 0 6.rc(P 1 | P 2 ) = rc(P 1 ) + rc(P 2 ) - rc(P 1 & P 2 ) 7.rc{P i (v 1 ) | P i (v 2 )} = rc{P i (v 1 )} + rc{P i (v 2 )}, where v 1  v 2

21 k-Nearest Neighbor Classification and Closed-KNN 1) Select a suitable value for k 2) Determine a suitable distance metric 3) Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified. T T is the target pixels. With k = 3, to find the third nearest neighbor, KNN arbitrarily select one point from the bdry line of the nhbd Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN

22 Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target sample, T = The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dim-i, [v i ] expanded to the interval [v i – a i, v i +b i ], for some pos integers a i and b i. Continue expansion until there are at least k points in the neighborhood.

23 HOBbit Similarity Method for KNN In this method, we match bits of the target to the training data First, find those matching in all 8 bits of each band (exact matches) let, b i,j = j th bit of the i th band of the target pixel. Define target-Ptree, Pt: Pt i,j = P i,j, if b i,j = 1 = P i,j, otherwise And precision-value-Ptree, Pv i,1  j = Pt i,1 & Pt i,2 & Pt i,3 & … & Pt i,j

24 An Analysis of HOBbit Method Let i th band value of the target T, v i = 105 = 01101001 b [01101001]  [105, 105] 1 st expansion [0110100-] = [01101000, 01101001] = [104, 105] 2 nd expansion [011010- -] = [01101000, 01101011] = [104, 107]  Does not expand evenly in both side: Target = 105 and center of [104, 111] = (104+107) / 2 = 105.5  And expands by power of 2.  Computationally very cheap

25 Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P 1 (v 1 ) & P 2 (v 2 ) & P 3 (v 3 ) & … & P n (v n ) If rc(Pnn) < k, Pnn = P 1 (v 1 -1, v 1 +1) & P 2 (v 2 -1, v 2 +1) & … & P n (v n -1, v n +1) If rc(Pnn) < k, Pnn = P 1 (v 1 -2, v 1 +2) & P 2 (v 2 -2, v 2 +2) & … & P n (v n -2, v n +2) Computationally costlier than HOBbit Similarity method But a little better classification accuracy Let, P c (i) is the value P-trees for the class i Plurality class =

26 Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7

27 Performance – Accuracy 1997 Dataset:

28 Performance - Accuracy (cont.) 1998 Dataset:

29 Performance - Time 1997 Dataset: both axis in logarithmic scale

30 Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale


Download ppt "Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees."

Similar presentations


Ads by Google