Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center).

Similar presentations


Presentation on theme: "Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)."— Presentation transcript:

1 Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center). Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity  >0  >0  : d(x,a)<   d(f(x),f(a))<  where f assigns a class to a feature vector, or   -NNS of f(a),  a  -NNS of a in its pre-image. f(Dom) categorical  >0  : d(x,a)<  f(x)=f(a) Caution: For classification, it may be the case that one has to use the continuity in lower dimensions to get a predication (due to data sparseness). E.g., 1234 5 a 6 1,2,3,4,5,6,7,8 are all distance  from a and 1,2,3,4 -->C 5,6,7,8-->D. 7 8 Any  that gives us a vote gives us an tie vote. However projecting onto the vertical subspace taking  /2 we see that  /2 nbrhd about a contains only 5 and 6 so gives us class D. Using horizontal data, NNS derivation requires at least one scan (at least O(n)). L  disk NNS can be derived using vertical-data in O(log 2 n) yet usually Euclidean disks are preferred. (Note: Euclidean and Manhattan coincide in Binary data sets). Our solution in a sentence: Circumscribe the desired Euclidean-  nbrhd with functional-contours, (sets of the type f -1 ([b,c] ) until the intersection is scannable, then scan it for Euclidean-  -nbrhd membership. Advantage: intersection can be determined before scanning - create and AND functional contour P-trees. Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining Machine Learning can be broken down into 2 areas, Clustering and Classification. Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based Classification can be broken down into to types, Model-based and Neighbor-based Database analysis can be broken down into 2 areas, Querying and Data Mining.

2 and  S  Y, the f-contour(S) = f -1 (S) Equiv., A f -contour(S) = Select x 1..x n From R* Where x.A f =f(x 1..x n ) If S={a}, we use f-Isobar(a) equiv. A f -Isobar(a) If f is a local density and {S k } is a partition of Y, {f -1 (S k )} partitions R. (eg, In OPTICS, f=reachability distance, {S k } is the partition produced by intersections of graph-f wrt to a walk of R and a horizontal line. A Weather map use equiwidth interval partition of S=Reals (barometric pressure or temperature contours). A grid is the intersection partition with respect to the dimension projection functions (next slide). A Class is a contour under f:R  C, the class map. An L   -disk about a is the intersection of the  -dimension_projection contours containing a. Contours:  f:R(A 1..A n )  Y A 1 A 2 A n : :... S f R R Y S graph(f) = { ( x, f(x) ) | x  R } f-contour(S) Y A 1 A 2 A n A f x 1 x 2 x n f(x 1..x n ) :... R* Equivalently,  derived attribute, A f, with domain=Y (equivalence is x.A f = f(x)  x  R) A 1 A 2 A n x 1 x 2 x n :... f(x) f R Y xx

3 f:R  Y,  partition S={S k } of Y, {f -1 (S k )}= S,f-grid of R ( grid cells=contours) If Y=Reals, the j.lo f-grid is produced by agglomerating over the j lo bits of Y,  fixed (b-j) hi bit pattern. The j lo bits walk [isobars of] cells. The b-j hi bits identify cells. ( lo=extension / hi=intention) Let b-1,...,0 be the b bit positions of Y. The j.lo f-grid is the partition of R generated by f and S = {S b-1,...,b-j | S b-1,...,b-j = [(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)} partition of Y=Reals. If F={f h }, the j.lo F-grid is the intersection partition of the j.lo f h -grids (intersection of partitions). The canonical j.lo grid is the j.lo  -grid;  ={  d :R  R[A d ] |  d = d th coordinate projection} j-hi gridding is similar ( the b-j lo bits walk cell contents / j hi bits identify cells). If the horizontal and vertical dimensions have bitwidths 3 and 2 respectively: 000 001 010 011 100 101 110 111 11 10 01 00 2.lo grid 1.hi grid Want square cells or a square pattern? 000 001 010 011 100 101 110 111 11 10 01 00 GRIDs

4 2.lo grid 1.hi grid 111 110 101 100 j.lo and j.hi gridding continued The horizontal_bitwidth = vertical_bitwidth = b iff j.lo grid = (b-j).hi grid e.g., for hb=vb=b=3 and j=2: 000 001 010 011 100 101 110 111 011 010 001 000 111 110 101 100 000 001 010 011 100 101 110 111 011 010 001 000

5 SOME Useful NearNeighborSets (NNS) Given a similarity s:R  R  Reals and a C  R (i.e., s(x,y)=s(y,x); s(x,x)  s(x,y)  x,y  R) Cardinal disk, skins and rings: disk(C,r)  {x  R | s(x,C)  r} also = functional contour, f -1 ([r,  ), where f(x)=s C (x)=s(x,C) skin(C,r)  disk(C,r) - C ring(C,r 2,r 1 )  disk(C,r 2 )-disk(C,r 1 )  skin(C,r 2 )-skin(C,r 1 ) also = functional contour, s C -1 (r 1,r 2 ] For C = {a} a r1r1 r2r2 C r1r1 r2r2 Ordinal disks, skins and rings: disk(C,k)  C  : |disk(C,k)  C'|=k and s(x,C)  s(y,C)  x  disk(C,k), y  disk(C,k) skin(C,k) = disk(C,k)-C ( skin comes from s k immediate neighbors and is a kNNS of C.) ring(C,k) = cskin(C,k)-cskin(C,k-1) closeddisk(C,k)  all disk(C,k); closedskin(C,k)  all skin(C,k) L  skins: skin  (a,k) = {x |  d, x d is one of the k-NNs of a d } - a local normalization? A distance, d, generates a similarity many ways, e.g., s(x,y)=1/(1+d(x,y)): (or if the relationship various by location, s(x,y)=  (x,y)/(1+d(x,y)) s d 1 Note: closeddisk and closedskin(C,k) are redundant, since closeddisk(C,k) = disk(C,s(C,y)) where y is any k th NN of C s(x,y)=e -d(x,y) 2 : s d 1 0 : d(x,y)>  s(x,y)= e -d(x,y) 2 /std -e -  2 /std : d(x,y)  (vote weighting IS a similarity assignment, so the similarity-to-distance graph IS a vote weighting for classification) s d 1-e -  2 /std 

6 0 P f,S is called a P-tree for short and is just the existential R*-bit map of S  R*.A f The Compressed P-tree, s P f,S is the compression of 0 P f,S with equi-width leaf size, s, as follows 1. Choose a walk of R (converts 0 P f,S from bit map to bit vector) 2. Equi-width partition 0 P f,S with segment size, s (s=leafsize, the last segment can be short) 3. Eliminate and mask to 0, all pure-zero segments(call mask, NotPure0 Mask or EM) 4. Eliminate and mask to 1, all pure-one segments(call mask, Pure1 Mask or UM) Compressing each leaf of s P f,S with leafsize=s 2 gives: s 1,s 2 P f,S Recursivly, s 1, s 2, s 3 P f,S s 1, s 2, s 3, s 4 P f,S... (builds an EM and a UM tree) BASIC P-trees If A i Real or Binary and f i,j (x)  j th bit of x i ; { (*) P f i,j,{1}  (*) P i,j } j=b..0 are basic (*) P-trees of A i, *= s 1..s k If A i Categorical and f i,a (x)=1 if x i =a, else 0; { (*) P f i,a,{1}  (*) P i,a } a  R[A i ] are basic (*) P-trees of A i Notes: The UM masks (e.g., of 2 k,...,2 0 P i,j, with k=roof(log 2 |R| ), form a (binary) tree. Whenever the EM bit is 1, that entire subtree can be eliminated (since it represents a pure0 segment), then a 0-node at level-k (lowest level = level-0) with no sub-tree indicates a 2 k -run of zeros. In this construction, the UM tree is redundant. We call these EM trees the basic binary P-trees. The next slide shows a top-down (easy to understand) construction of and the following slide is a (much more efficient) bottom up construction of the same. We have suppressed the leafsize prefix. ( EM=existential aggregation UM=universal aggregation) f:R(A 1..A n )  Y S  Y The ( uncompressed) Predicate-tree 0 P f, S is  : 0 P f,S (x)=1(true) iff f(x)  S

7 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 Horizontal structures (records) Scanned vertically P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 5. 2 nd half of 2 nd half is  1 0 0 1 R 11 0 1 0 1 then process using multi-operand logical AND s. Vertical basic binary Predicate-tree (P-tree): vertically partition table; compress each vertical bit slice into a basic binary P-tree as follows 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) A data table, R(A 1..A n ), containing horizontal structures (records) is processed vertically (vertical scans) The basic binary P-tree, P 1,1, for R 11 is built top- down by record truth of predicate pure1 recursively on halves, until purity. 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Eg, Count number of occurences of 111 000 001 100 0 2 3 -level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 2 2 -level =2 01 2 1 -level But it is pure (pure0) so this branch ends

8 R 11 0 1 0 1 Top-down construction of basic binary P-trees is good for understanding, but bottom-up is more efficient. Bottom-up construction of P 11 is done using in-order tree traversal and the collapsing of pure siblings, as follow: 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 P 11 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0

9 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 To count occurrences of 7,0,1,4 use pure111000001100 : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 01 ^ 7 0 1 4 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s make this 1 2 1 -level has the only 1-bit so the 1-count = 1*2 1 = 2 Processing Efficiencies? (prefixed leaf-sizes have been removed)

10 A useful functional: TV(a) =  x  R (x-a)o(x-a) If we use d for a index variable over the dimensions, =  x  R  d=1..n (x d 2 - 2a d x d + a d 2 ) i,j,k bit slices indexes =  x  R  d=1..n (  k 2 k x dk ) 2 - 2  x  R  d=1..n a d (  k 2 k x dk ) + |R||a| 2 =  x  d (  i 2 i x di )(  j 2 j x dj ) - 2  x  R  d=1..n a d (  k 2 k x dk ) + |R||a| 2 =  x  d  i,j 2 i+j x di x dj - 2  x,d,k 2 k a d x dk + |R||a| 2 =  x,d,i,j 2 i+j x di x dj - |R||a| 2 2  d a d  x,k 2 k x dk + TV(a) =  i,j,d 2 i+j |P di^dj | - |R||a| 2  k 2 k+1  d a d |P dk | + Note that the first term does not depend upon a. Thus, the derived attribute, TV-TV(  ) (eliminate 1 st term) is much simpler to compute and has identical contours (just lowers the graph by TV(  ) ). We also find it useful to post-compose a log to reduce the number of bit slices. The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a). =  x,d,i,j 2 i+j x di x dj +  d a d a d ) |R|( -2  d a d  d +=  x,d,i,j 2 i+j x di x dj - |R|  d a d a d 2|R|  d a d  d +

11 The length of g(a) depends only on the length of a- , so isobars are hyper-circles centered at  The graph of g is a log-shaped hyper-funnel: From equation 7, f(a)=TV(a)-TV(  )  d (a d a d -  d  d ) ) = |R| ( -2  d (a d  d -  d  d ) + TV(a) =  x,d,i,j 2 i+j x di x dj + |R| ( -2  d a d  d +  d a d a d ) +  d  d 2 )= |R|(  d a d 2 - 2  d  d a d g(a)  HDTV(a) = ln( f(a) )=ln|R| + ln|a-  | 2 = |R| |a-  | 2 so f(  )=0 go inward and outward along a-  by  to the points; inner point, b=  +(1-  /|a-  |)(a-  ) and outer point, c=  -(1+  /|a-  |)(a-  ).  -contour (radius  about a) a For an  -contour ring (radius  about a) Then take g(b) and g(c) as lower and upper endpoints of a vertical interval. Then we use EIN formulas on that interval to get a mask P-tree for the  -contour (which is a well-pruned superset of the  -neighborhood of a) b c g(b) g(c)  x1x1 x2x2 g(a)=HDTV(x)

12 use circumscribing A  d -contour (Note: A  d is not a derived attribute at all, but just A d, so we already have its basic P-trees).   -contour (radius  about a) a HDTV(b) HDTV(c) b c As pre-processing, calculate basic P-trees for the HDTV derived attribute (or another hypercircular contour derived attribute). To classify a 1. Calculate b and c (Depend on a,  ) 2. Form mask P-tree for training pts with HDTV-values  [HDTV(b),HDTV(c)] 3. User that P-tree to prune out the candidate NNS. 4.If the count of candidates is small, proceed to scan and assign class votes using Gaussian vote function, else prune further using a dimension projections). contour of dimension projection f(a)=a 1 x1x1 x2x2 HDTV(x) If the HDTV circumscribing contour of a is still too populous, We can also note that HDTV can be further simplified (retaining same contours) using h(a)=|a-  |. Since we create the derived attribute by scanning the training set, why not just use this very simple function? Others leap to mind, e.g., h b (a)=|a-b| (Use voting function, G(x) = Gauss(|x-a|)-Gauss(  ), where Gauss(r) is (1/(std*  2  )e -(r-mean) 2 /2var (std, mean, var are wrt set distances from a of voters i.e., {r=|x-a|: x a voter} )

13 1 2 3 TV(x 15 )- TV(  ) 1 2 3 4 5 X Y TV- TV(  ) 4 5 TV(  )=TV(x 33 ) TV(x 15 ) 1 2 3 4 5 X Y TV 1 2 3 4 5 Graphs of functionals with hyper-circular contours  h b (a)=|a-b| b  h(a)=|a-  |  HDTV

14 Angular Variation functionals: e.g., AV(a)  ( 1/|a| )  x  R xoa d is an index over the dimensions, = (1/|a|)  x  R  d=1..n x d a d = (1/|a|)  d=1..n (  x x d ) a d COS  (and AV) has hyper-conic isobars center on  = |R|/|a|  d=1..n ((  x x d )/|R|) a d = |R|/|a|  d=1..n  d a d = ( |R|/|a| )  o a COS  (a)  AV(a)/(|  ||R|) =  oa/(|  ||a|) = cos(  a  ) COS  and AV have  -contour(a) = the space between two hyper-cones center on  which just circumscribes the Euclidean  -hyperdisk at a.  COS  (a)  a = (1/|a|)  d (  x x d a d ) factor out a d COS b (a)?  a b Intersection (in pink) with HDTV  -contour. Graphs of functionals with hyper-conic contours: E.g., COS b (a) for any vector, b

15 f(a) x = (x-a)o(x-a) d = index over dims, =  d=1..n (x d 2 - 2a d x d + a d 2 ) i,j,k bit slices indexes =  d=1..n (  k 2 k x dk ) 2 - 2  d=1..n a d (  k 2 k x dk ) + |a| 2 =  d (  i 2 i x di )(  j 2 j x dj ) - 2  d=1..n a d (2 k x dk ) + |a| 2 =  d  i,j 2 i+j x di x dj - 2  d,k 2 k a d x dk + |a| 2 f(a) x =  i,j,d 2 i+j (P di^dj ) x - |a|2 |a|2  k 2 k+1  d a d (P dk ) x + β exp( -  f(a) x ) = β exp(-  i,j,d 2 i+j (P di^dj ) x ) *exp( -  |a| 2 ) * exp(  k 2 k+1  d a d (P dk ) x ) Adding up the Gaussian weighted votes for class c: β exp( -  f(a) x ) = β ( exp(-  i,j,d 2 i+j (P di^dj ) x ) exp( -  |a| 2 ) * exp(  k 2 k+1  d a d (P dk ) x ) )  x  c β exp( -  f(a) x ) = β  x  c ( exp(-  i,j,d 2 i+j (P di^dj ) x ) exp( -  |a| 2 ) * exp(  k 2 k+1  d a d (P dk ) x ) )  x  c exp ( (-  i,j,d 2 i+j (P di^dj ) x ) +  k,d 2 k+1 a d (P dk ) x )  x  c exp (  i  j,d -  2 i+j (P di^dj ) x +  i=j,d (  a d 2 i+1 -  2 2i ) (P di ) x ) Collecting diagonal terms inside exp  i,j,d inside exp we have coefs which do not involve x multiplied by a 1-bit or a 0-bit, depending on x thus for fixed i,j,d we either have the x-indep coef (if 1-bit) or we don't (if 0-bit)  x  c (  i  j,d exp ( -  2 i+j (P di^dj ) x ) *  i=j,d exp (  (a d 2 i+1 -2 2i )(P di ) x ) )  (  i  j,d  :P dijx =1 exp ( -  2 i+j ) *  i=j,d  :P dijx =1 exp (  (a d 2 i+1 -2 2i ) ) ) (eq1)

16 Suppose there are two classes, red (-) and green (+) on the  -cylinder shown. Then the vector connecting medians (vcm) in YZ space is shown in purple. Then the unit vector in the direction of the vector connecting medians (uvcm) in YZ space is shown in blue. The vector from the midpoint of the vectors of medians to s is in orange. The inner product is of the blue and the orange is the same as the inner product we would get by doing it in 3D! The point is that the x-component of the red vector of medians and that of the green are identical so that the x component of the vcm is zero. (small vcm component means prune out! s x y z


Download ppt "Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center)."

Similar presentations


Ads by Google