Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo

Motivation Databases and data warehouses are currently separate systems Why? Standard answer: Details, details, details … Our answer: Fundamental issue of representation

Relations Revisited R(A 1, A 2, …, A N ) Set of tuples Any choices at a fundamental level? Yes! Duality between Element-based representation Space-based representation

Duality Element-based representation: Standard representation of tuples with all their attributes Space-based representation: The existence (count?) of a tuple is represented in its attribute space

Similar Dualities in Physics Particles can be represented by their position More fundamental level: Particle Particles can be 1 values in a grid of locations Field

Space-Based Representation Consider standard tuples as vectors in the space of attribute domains Represent all possible attribute combinations as one bit: 1 if data item is present 0 if it isn’t Allowing counts could be useful for projections (?)

Space-Based Representation as a Partition Partitions are mutually exclusive and collectively exhaustive sets of elements The Space-Based Representation partitions attribute space into two sets: Data item present in database (1) Data item not present (0)

Usefulness of Space-Based Representation No indexes needed: instant value-based access Index locking becomes dimensional locking Aggregation very easy due to value-based ordering Selections become “and”s What experience do we have with space-based representations?

Data Cube Representation One value (e.g., sales) given in the space of the key attributes Space-based with respect to key attributes Element-based with respect to non-key attributes

Properties of the Domain Space Ideally space should have distance, norm, etc. Especially important for data mining Does that make sense for all domains? Can any domain be mapped to integer?

Can all Domains be Mapped to Integer? Simplistic answer: yes! All information in a computer is saved as bits Any sequence of bits can be interpreted as an integer Problems Order may be irrelevant, e.g., hair-color Order may be wrong, e.g., sign bit for int Even if order is correct, spacing may vary, e.g., float (solution in paper: intervalization) Domains may be very large, e.g., movies

Categorical attributes (irrelevant order) We need more than one attribute for an appropriate representation Data mining solution: 1 attribute per domain value Our solution: 1 attribute per bit slice Values are corners of a Hypercube in log(Domain Size) dimensions Distances are given trough MAX metric

Fundamental Partition (Space-Based Representation) # of dimensions = Number of attributes # of represented points = product of all domain sizes  Exponential in number of dimensions!  We badly need compression!

How Do We Handle Size? Problem exponential in #of attributes  How can we reduce #of attributes? Review normalization: We can decompose a relation into a set of relations each of which contains the entire key and one other attribute This decomposition is loss less dependency preserving (BCNF relations only)

Compression for Non-Key Attributes Fundamental partition contains one non-zero data-point in any non-key dimension only Represent number by bit-slices Note: This works for numerical and categorical attributes Original values can be regained by anding Example 5 (binary 101) is bit 0 & bit 1’ & bit 2

Concept Hierarchies Bit sliced representation have significant benefits beyond compression: Bit slices can be combined into concept hierarchies: Highest level: bit 0 Next level: bit 0 & bit 1 Next level: bit 0 & bit 1 & bit 2

Compression for Key Attributes Database state-independent compression could lead to information loss (counts > 1) Database state-dependent compression: Tree structure that eliminates pure subtrees => P-trees

Other Ideas Compression is better if attribute values are dense within their domain We could use extent domain Compression good Problems with insertion Reorganization of storage Index locking has to be reintroduced …

How Good is Compression so far? If all domains are “dense”, i.e. all values occur Size can easily be smaller than original relation If non-key attributes are “sparse” Not usually a problem: good compression Problems only in extreme cases E.g., movies as attribute values! If key-attributes are “sparse” Larger potential for problems, but also large potential for benefit (see data cubes)

Are Key-Attributes Usually Sparse? Many key attributes are dense (“structure” attributes as keys) Automatically generated IDs are usually sequential x and y in spatial data mining Time in data streams Keys in tables that represent relationships tend to be sparse (feature attributes as keys) Student / course offering / grade Data cubes!

What Have We Gained? (Database Aspects) Data simultaneously acts as index No separate index locking (unless extent domain is used) All information saved as bit patterns Easy “select” Other database operations discussed in class

What Have We Gained? (Feature Attribute Keys) Direct mining possible on relations with feature attributes keys E.g., student / course offering / grade Rollup can be defined, etc. Clustering, classification, ARM can make use of proximity inherent in representation Bit-wise representation provides concept hierarchy for non-key attribute Tree structure provides concept hierarchy for key attributes

What Have We Gained? (Structure Attribute Keys) For relations with structure attribute keys mining requires “and”ing produces counts for feature attributes Bit-wise representation provides concept hierarchy for non-key attribute Duality: Concept hierarchies in this representation map exactly to tree structure when the attribute is a key

Mapping Concept Hierarchies Bit Slices Tree P-tree: Take key attributes, e.g. x and y, and bit interleave them: x = 1 0 0 1 y = 1 1 0 1  1 1 0 1 0 0 1 1 Any two of these digits form a level in the P- tree – or a level in a concept hierarchy

How Could We Use That Duality? Join with other relations and project off key attributes (Meta P-trees) Can we do that? We lose uniqueness We can use 1 to represent 1 or more tuples (equivalent to relational algebra) Or we can introduce counts Can be useful for data mining Need for non-duplicate eliminating counts exists also in other applications

How Do Hierarchies Benefit us in Databases? Multi-granularity Locking Subtrees form suitable units for storage in a block Fast access! Proportional to # of levels in tree # of bits for bit slices

Summary Space-based representation has many benefits Value-based access and storage No separate index needed Rollups easy P-Trees Follow from systematic compression Benefits from concept hierarchies

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Similar presentations

Presentation on theme: "Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Similar presentations

Presentation on theme: "Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo."— Presentation transcript:

Similar presentations

About project

Feedback