Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University, Fargo, ND William.perrizo@ndsu.edu

Vertical Data and pTrees Traditionally, a file or dataset comprises one or more horizontal records, each record comprises one or more fields, and each field comprises some number of bits that represent a piece of data. Traditionally, a dataset is structured horizontally as a set of such horizontal records, and is processed vertically, record-by-record. Alternatively, a file or dataset may be structured vertically as a set of columns and processed horizontally, column-by-column. A column-structured dataset may be further vertically decomposed as a set of bit-vectors or bitslices and processed horizontally, bitslice-by-bitslice using logical operations. A pTree is a tree-like organization of these bitslices having a zero bit value for each internal node and a zero or one bit value for each of its leaves (external nodes). The three basic pTree operations are AND, OR and complement. An example follows.

But it's pure0 so this branch ends 0 0 0 0 1 P 11 4. Left half of rt half ? false  0 0 2. Left half pure1? false  0 0 0 1. Whole thing pure1? false  0 5. Rt half of right half? true  1 0 0 1 R 11 0 1 predicate Trees (pTrees): slice by attribute or column Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure. 3. Right half pure1? false  0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 vertically slice off each bit position (12 vertical structures) then compress each bit slice into a tree using a predicate e.g., the compression of R 11 into pTree, P 11 : P 11 pure1? false=0 pure1? true=1 pure1? false=0 1 st, Vertically Processing of Horizontal Data (VPHD) R(A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 for Horizontally structured, record-oriented data, one must scan vertically 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 = Base 10Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 01 0 0 0 0 01 0 0 0 0 1 0 0 10 01 ^^ ^ ^ ^ ^^ e.g., find the number of occurences of 7 0 1 4 =2 2 nd, using pTrees find the number of occurences of 7 0 1 4 To count (7,0,1,4) s use 111000001100 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 7 0 1 4 0 *2 3 0 0 *2 2 =2 0 1 *2 1 *2 0 = A pTree example

Securing Vertical pTree Data How can a pTree database be made secure without causing the processing to be slowed down (by, for instance, requiring a slow decryption step prior to processing)? A first level of security may be provided by re-ordering (e.g., permuting) the pTrees. The re-ordering method or permutation becomes the security key (those who know it can use the data and those who do not, cannot). For big data, key compression techniques might be necessary. A second level of security can be provided through random bit padding (at the bottom and/or top) Padding will randomly place the true starting and ending position of the pTree somewhere in the middle of the structure. This second level is intended to prevent an adversary from discovering the re-ordering by focusing on a smaller part of the dataset (e.g., the first bit or the last bit only). A third level of security is also possible when needed. That third level would involve re-ordering the bits within the actual pTrees. For datasets in which the ordering of the record instances is random or otherwise unknown, this would be unnecessary. However, if there is a known ordering, it might be possible for the clever adversary to discover the starting and ending positions of the pTrees. At this time, this third level seems problematic since it would probably require the encryption and therefore a time consuming decryption step prior to processing. There may be a way to re-order the pTree bits without introducing significant additional processing time? This is "future work". The "bottom" padding might be unnecessary from a security perspective (add little or no additional security, however, we believe it will make it much easier to secure volatile datasets (by providing a large bottom pad that can be overwritten as the dataset grows in cardinality)

Implementations This security scheme can be implemented per dataset or over the entire database as a whole. That is, for each dataset, there could be a massive string of bits in which all the pTrees of just that dataset are embedded (separate massive string of bits for each dataset), or there could be a much more massive string of bits in which all the pTrees of the database are embedded (and intermingled for additional security?). The latter might be the best choice since it does not introduce any additional processing time overhead and it does provide additional security. In general, a security monitor subsystem would be used to implement this security scheme. The security monitor would be responsible for identification, authentication and authorization. The security monitor could manage free space within the massive bit strings (be they per dataset or per database).

Conclusion This vertical database scheme involves at least two levels of security. The first level is a matter of disguising the location of each vertical bitslice structure (e.g. permuting the ordering of the bitslice). The second level is a matter of disguising the bit position location of the pTree within a bigger bit string (to make it harder to focus on the first bits only, etc.). An important fact is that neither of these two schemes imposes almost no additional processing time. The reason we can say that is pTree processing always involves 1. identifying those pTrees that are to be involved in the processing 2. locating those pTrees (which is always a matter of following a pointer and we simple shuffle the order of those pointers so that those who know the shuffle can find the data and those who do not, cannot. 3. loading the pertinent pTrees (which is just as fast as long as you know at which bit position to start the load. Those who know that start position, can load and those who do not, cannot.).

Conclusion continued A third possible level of security was mentioned, that of encrypting or otherwise anonymizing each pTree itself. Again, this may be unnecessary if there is no know or expected record ordering involved. One simple way to [somewhat] anonymize individual pTrees it to shuffle the order of the levels within each pTree. This would be metadata (manage by the security monitor) but once known, could be used with no or very very little additional processing delay, since each level of a multi-level pTree is processed separately and therefore the locating portion of the process involves a pointer and no additional processing would be required on that pointer (it is just that the correct level of the correct pTree to be located would not be in the expected place within the pTree as stored.). Other partial anonymizations, short of full encryption, might suffice in most cases.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Similar presentations

Presentation on theme: "Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Similar presentations

Presentation on theme: "Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,"— Presentation transcript:

Similar presentations

About project

Feedback