Presentation is loading. Please wait.

Presentation is loading. Please wait.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Similar presentations


Presentation on theme: "Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees."— Presentation transcript:

1 Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees (Ptrees in either case) 1 processed horizontally (DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality. 1 Ptree Technology is patent pending by North Dakota State University

2 6. Lf half of lf of rt? true  1 0 0 1 1 4. Left half of rt half ? false  0 0 2. Left half pure1? false  0 0 0 1. Whole is pure1? false  0 Horizontally structured records Scanned vertically 5. Rt half of right half? true  1 0 0 1 R 11 0 1 0 1 Horizontally AND basic Ptrees Predicate tree technology: vertically project each attribute, 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) Current practice: Structure data into horizontal records. Process vertically (vertical scans) The compressed Ptree (1-Dim) version of R 11, denoted, P 11, is built by recording the truth of the predicate “pure 1” in a tree recursively on halves, until purity is achieved. 3. Right half pure1? false  0 0 7. Rt half of lf of rt? false  0 0 0 1 10 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R 11 into P 11 goes as follows: To count occurrences of 7,0,1,4 use pure111000001100: 0 2 3 -level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 2 2 -level =2 01 2 1 -level ^ 7 0 1 4 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ 0 0 0 0 1 10 ^ P 11 pure1? false=0 pure1? true=1 And it’s pure so branch ends pure1? false=0

3 RunListTrees? (RLtrees) To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed as follows: 0000101100001011 R 11 RL 11 0:000 1:100 0:101 1:110 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 5. 2 nd half of 2 nd half is  1 0 0 1 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 Or, a separate NotPure0 index tree (trees could be terminated at any level). 1 st, AND NP0trees. Only 1-branches / result need ANDing thru list scans. The more operands, the fewer 1-branches. 0000101100001011 R 11 RL 11 0:000 1:100 0:101 1:110 6. 1 st half of 1 st of 2 nd true  1 1 0 1 1 1 1 4. 1 st half of 2 nd half true  1 1 0 1 1 2. 1 st half is false  0 1 0 1 1. Whole file is true  1 5. 2 nd half of 2 nd half true  1 1 0 1 1 1 3. 2 nd half is true  1 1 0 1 7. 2 nd of 1 st of 2 nd false  0 1 0 1 1 1 10

4 Other Indexes on RunLists We could put Pure0-Run, Pure1-Run and even Mixed-Run (or LZV-Run) RunListIndexes on RL: 00001011010101010000101101010101 R 11 RL 11 0:0 1:100 0:101 1:110 01:1000 P1RI 11 100:1 110:2 P0RI 11 000:4 101:1 startlength PLZVRI 11 1000:1 pattern Length (# of consecutive replicas of pattern)

5 Best Pure1 tree Implementation? My guess circa 04jan For n-D Pure1 trees: 1.At any node, if |1-bits| in the tuple set represented by the node < lower threshold, LT, 1.Then that node will simply show the 1List, the list of 1-bit positions (use a 0-bit if =0) and have no children, 2.Else if the tuple set represented by that node < UT=2 n m, an upper threshold, leave bit-slice uncompressed Building such Ptrees bottom up: Using in-order ordering, 1.If 1-count of the next UT-segment  LT install P-sequence, else install 1List. 2.If current UT-segment node is numbered k*(2 n –1) and it and all 2 n -1 predecessors are 1Lists, and the cardinality of the union of said 1Lists < LT, install the union in the parent node. Recurse this collapsing process upward to the root. Building such Ptrees top down: 1.For datasets larger than UT, recursively build down the pure1. 2.If ever a node has < LT 1-bits, install the 1List and terminate that branch. 3.At the level where the represented tuple set = UT, install 1List if |1-bits| < LT, else install P-sequence. Notes: 1.This method should extend well to data streams. When the data stream exceeds the current allocation (which, for n-D Ptrees will be a power of 2 n ), just grow the root of each Ptree to a new level, using the current Ptree as node 0. Nodes 1,2,3,…2 n-1 of the new Ptree will start as 0 nodes without children and will grow as 1Lists until LLT is reach then they will be converted to P-sequences.

6 Ptrees Predicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree) Predicate can be quantified existentially (1 or a threshold %) or universally Purity-tree: universally quantified tree (predicate is “1  position), Pure1-tree) existential quantified tree (predicate is “  a 1 position), NotPure0-tree) We will focus on P1trees. All Ptrees shown so far were 1-dimensional (recursively partition by halving bit files), but they can be 2-D (recursively quartering) (e.g., used for 2-D images) 3-D (recursively eighth-ing), … Or based on purity runs or LZW-runs or … Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing P-trees are all of the following: Partition-tree: Tree of nested partitions: Root is the entire bit sequence Level-1 is a partition of the root P(R)={C 1..C n } In level-2 each level-1 component is partitioned by P(C i ) = {C i,1..C i,n i } i=1..n, In level-3 each level-2 component is partitioned by P(C i,j ) = {C i,j 1..C i,j n ij }... Partition tree R / … \ C 1 … C n / … \ … / … \ C 11 …C 1,n 1 C n1 …C n,n n...

7 Best Ptree Implementation? My guess circa 04feb Ptrees can be viewed “breadth first bottom up ” That is, one can view the leaf level (level-0) horizontally and then view level-1 as another bit vector in which each bit represents the truth of the predicate (e.g., pure1), applied to pairs of bits in the leaf level vector. This can be continued upwards for each successive level until the root is reached. The result is a bottom up construction of the 1-D ptree. This construction has the advantages that the leaf level vector can be any length and can grow (e.g., for data streams), Each level up e.g., level-k, can be grown as new 2 k -granules for that level fill up. Additionally, the breadth-first bottom up view of the 2-D ptree for that bit vector is just the leaf vector with every other level above it left out. The breadth first bottom up view of the 3-D ptree is just the leaf level then leaving out 2 consecutive levels at a time, going up the levels. 4-D is the same leaving out triples of levels at a time, etc. Another advantage of this view is that, stripping an additional k levels just above the leaf level amounts to using 2 k length p-sequence at the leaf as discussed discussed on a previous slide. The breadth-first bottom up (B2U) view doubles the storage requirement over the storage requirement of p- sequences. However, storage is free so it isn’t a big concern. The capture time (on the DII side) is probably much higher, but that is a one-time cost. On the face of it, compression is gone. However, we are talking about disk space again, which is free. In terms of anding speed, the compression is there (just don’t decent on purity). However, one should have the NZ trees (for the existential case) as well in order to distinguish purity. Note, we are replacing all pointers with offsets now, so distinguishing mixed from pure zero cannot be done by “no child pointers”.


Download ppt "Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees."

Similar presentations


Ads by Google