1 Tough Choices Materialize nothing. Compute every cell on demand. Worst query response time. No space requirements. Materialize part of the data cube. Many cells are computable from other cells. But which cells to materialize? More cells = better query performance. Materialize the entire data cube. Best query response time. Excessive space requirements.
2 Data Value Hypercube DATA VALUE HYPERCUBES store data- record indices, whereas existing data cubes can only store data aggregates. versus ordinary data cubes DATA VALUE HYPERCUBES are generated as quickly as existing data cubes.
3 Remember this? Now it doesn’t matter. OLTP OLAP UNSTRUCTURED DATA STRUCTURED DATA Multi- Dimensional Databases XML EDI Spreadsheets Web Pages RSS Web Log Voice recognition Instant Messaging Wikis Content Management Document Management Taxonomies, Ontologies Multimedia Legacy Databases Relational Databases Main Frame Databases +80% -80%
4 Hypercubes are constructed so that each cell corresponds to a unique combination of database attribute values. 3 attributes require at least 8 cells. Hypercube
5
6 CustomerPart Customer Supplier None PartSupplier Part CustomerPartSupplier
7 CustomerSupplier Boeing Delta FedEx Lockheed Delta FedEx CustomerPartSupplier Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx PartSupplier Boeing Cockpit Jet Engine Wing Lockheed Cockpit Jet Engine Wing CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Supplier Boeing Lockheed Customer Delta FedEx None Cockpit Jet Engine Wing Part
8 CustomerSupplier Boeing Delta FedEx Lockheed Delta FedEx CustomerPartSupplier Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx PartSupplier Boeing Cockpit Jet Engine Wing Lockheed Cockpit Jet Engine Wing CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Supplier Boeing Lockheed Customer Delta FedEx None Cockpit Jet Engine Wing Part attributes require at least 8 cells.
9 CustomerPartSupplier Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx Sales $10 $20 $30 $40 $50 $60 $70 $80 $90 $100 $110 $120 PartSupplier Boeing Cockpit Jet Engine Wing Lockheed Cockpit Jet Engine Wing Sales $30 $110 $190 $70 $150 $230 Cockpit Jet Engine Wing Part Sales $100 $260 $420 Supplier Boeing Lockheed Sales $330 $450 Customer Delta FedEx Sales $360 $420 All Sales $780 CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Sales $40 $60 $120 $140 $200 $220 CustomerSupplier Boeing Delta FedEx Lockheed Delta FedEx Sales $150 $180 $210 $240 This is entirely fictional data.
10 Lattice Notation A lattice is denoted as (L, <=). L = the set of elements (queries). <= is the dependence relation. ancestor(a) = {b | a <= b}. descendant(a) = {b | b <= a}. Every element is its own descendant and ancestor. next(a) = the immediate proper ancestors of a. next(a) = {b | a < b, there exists a < c, c < b}.
11 Lattice Diagrams Lattice diagrams are graphs. Elements are nodes. There is an edge from a to b iff b is in next(a). There is a path downward from y to x iff x <= y.
12 Hypercube Algebra Simple database warehouse example. Parts are purchased from suppliers and then sold to customers. Three dimensions: Part, Supplier, and Customer. The measure of interest is total sales. For each cell (p, s, c), store the total sales of part p that was bought from supplier s, and sold to customer c. Users are interested in consolidated sales. Example: what is the total sales of a given part p to a given customer c? This query is answered by looking up the value in cube cell (p, ALL, c). CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Sales $40 $60 $120 $140 $200 $220 Many cells are computable from other cells. Dependent cells. Example: cell (p, ALL, c) is the sum of cells (p, s1, c), …, (p, sn, c).
13 The Dependence Relation on Queries Consider two queries Q1 and Q2. Q1 ≤ Q2 iff Q1 can be answered using only Q2. Q1 is dependent on Q2. For example, the query (part), can be answered using only the query (part, customer). (part) <= (part, customer). Some queries are not comparable with each other using the <= operator. For example, (part) !<= (customer) and (customer) !<= (part). CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Sales $40 $60 $120 $140 $200 $220
14 B-TREE LOGIC EASIER THAN IT LOOKS ACEGIKMOQSUWYZ BFJNRVX DLT HP
15 B-TREE LOGIC B IS FOR BALANCED GIVEN 3 RD ORDER B TREE WITH THE NUMBERS: INSERT INSERT INSERT 51 Insert any number < 20 and becomes the root. Insert any number > 50 and becomes the root. Insert any number > 20 and < 50 and it becomes the root
16 B-Tree Forest Construction time for the tree forest is where d is the number of query dimensions and n i is the O ( 1≤ i ≤ d ( log n i )) number of attributes in the database at level d.
17 B-Tree Forest A Balanced B-Tree Forest is the data structure that is used to represent a Hypercube. Each dimension in the Hypercube is represented by a separate B-Tree. B-Trees are great for storing sparse data and have fast insertion and search characteristics, (nlogn).
18 B-Tree Forest A binary tree forest consists of multiple levels of binary trees. Each level represents a cube dimension. A binary tree consists of nodes – stems or leaves. Stems nodes point to left and right binary trees. Leaf nodes point to a linked list of fact table IDs. A linked list of fact table IDs points to fact table entries with identical attribute values. A depth first search on a binary tree forest results in a GROUP BY clause.
19 CustomerPartSupplier Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx Sales $10 $20 $30 $40 $50 $60 $70 $80 $90 $100 $110 $120 PartSupplier Boeing Cockpit Jet Engine Wing Lockheed Cockpit Jet Engine Wing Sales $30 $110 $190 $70 $150 $230 Cockpit Jet Engine Wing Part Sales $100 $260 $420 Supplier Boeing Lockheed Sales $330 $450 Customer Delta FedEx Sales $360 $420 All Sales $780 CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Sales $40 $60 $120 $140 $200 $220 CustomerSupplier Boeing Delta FedEx Lockheed Delta FedEx Sales $150 $180 $210 $240 B-Tree Forest in Reverse: A primer Boeing Lockheed Cockpit Wing Jet Engine Delta FedEx Supplier Tree Customer Tree Parts Tree
20 Extensive B-Trees Are Common BOEING GENERAL DYNAMICS LOCKHEED MARTIN HONEYWELL INT’LNORTHROP GRUMMAN UNITED TECHNOLOGIES AVIONICS ELEVATOR JET ENGINE AILERON FLIGHT CONTROLS STABILIZER COCKPIT FIN FUSELAGE RUDDER WING LANDING GEAR SOUTHWEST DHL DELTA VIRGINFED EX But let’s keep it simple for now.
21 PartSupplier Boeing Cockpit Jet Engine Wing Lockheed Cockpit Jet Engine Wing Sales $30 $110 $190 $70 $150 $230 Cockpit Jet Engine Wing Part Sales $100 $260 $420 Customer Delta FedEx Sales $360 $420 All Sales $780 CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Sales $40 $60 $120 $140 $200 $220 CustomerSupplier Boeing Delta FedEx Lockheed Delta FedEx Sales $150 $180 $210 $240 Incoming Data Stream Supplier Boeing Lockheed Sales $330 $450 CustomerPartSupplier Sales Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx $10 $20 $30 $40 $50 $60 Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx $70 $80 $90 $100 $110 $120 CustomerPartSupplier Sales CustomerPartSupplier Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx Sales $10 $20 $30 $40 $50 $60 $70 $80 $90 $100 $110 $120 DATA FLOW Chunk 1 2 intervals of Data Flow Chunk 2Chunk 1
22 Setting up Fact & Dimension Tables Supplier Boeing Lockheed Sales $330 $450 CustomerPartSupplier Sales Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx $10 $20 $30 $40 $50 $60 Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx $70 $80 $90 $100 $110 $120 CustomerPartSupplier Sales Chunk 2Chunk 1 CustomerPart Supplier Sales Cockpit Boeing Delta FedEx $10 $20 $30 $40 $50 $60 StringID Global String Table Boeing 0Lockheed1Cockpit2Jet Engine3 Part Wing4Delta5FedEx6 Lockheed Cockpit Jet Engine Wing Delta FedEx UNSORTED StringID Supplier Dimension Table Boeing 00 Lockheed 11 StringID Part Dimension Table Cockpit 20 Jet Engine 31 Wing 42 StringID Customer Dimension Table Delta 50 FedEx 61 SORTED SupplierID Fact Table PartCustomerSales 0000$ $ $ $ $ $ $ $ $ $ $ $120
23 Let’s just say ‘Parts’ is the most significant data of interest. ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$ Customer $120 Supplier Part
24 Understanding Nested B-Trees ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120
25 Understanding Nested B-Trees ID Fact Table Sales 0$10 1$20 2$30 3$40 4$50 5$60 6$70 7$80 8$90 9$100 10$110 Supplier Part Customer $120 Fact Table $10$20$30$40$50$60$70$80$90$100$110$120 Sales Supplier Part Customer ID ID StringID Supplier Dimension Table Boeing 00 Lockheed 11 StringID Part Dimension Table Cockpit 20 Jet Engine 31 Wing 42 StringID Customer Dimension Table Delta 50 FedEx 61 WingCockpit BBBLLL DDDDDDFFFFFF Jet Engine WingCockpit
26 Delta FedEx Delta FedEx Delta FedEx Delta FedEx Delta FedEx Making a B-Tree Forest IDFact Table Sales 0 $10 1 $20 2 $30 3 $40 4 $50 5 $60 6 $70 7 $80 8 $90 9 $ $110 Supplier Part Customer $120 Fact Table $10$20$30$40$50$60$70$80$90$100$110$120 Sales Supplier Part Customer ID ID WingCockpit BBBLLL DDDDDDFFFFFF Jet Engine WingCockpit BoeingLockheed Boeing Lockheed Boeing Lockheed Delta FedEx Drilling down the Hypercube to a Single Data Value
27 Data Structure & Concept Side by Side Do you see the Data Value Hypercube to the left? Delta FedEx Delta FedEx Delta FedEx Delta FedEx Delta FedEx Boeing Lockheed Boeing Lockheed Delta FedEx Boeing Lockheed Wing Cockpit Jet Engine CustomerSupplier Boeing Delta FedEx Lockheed Delta FedEx CustomerPartSupplier Boeing Cockpit Delta FedEx Lockheed Cockpit Delta FedEx Boeing Jet Engine Delta FedEx Lockheed Jet Engine Delta FedEx Boeing Wing Delta FedEx Lockheed Wing Delta FedEx PartSupplier Boeing Cockpit Jet Engine Wing Lockheed Cockpit Jet Engine Wing CustomerPart Cockpit Delta FedEx Jet Engine Delta FedEx Wing Delta FedEx Supplier Boeing Lockheed Customer Delta FedEx Cockpit Jet Engine Wing Part None
28 Network Data Stream ProtocolContentIDDestination IPSource IPTime Stamp ProtocolContentIDDestination IPSource IPTime Stamp StringID SMB0 LDAP1 SSH2 AOL3 JPEG4 ENGLISH5 ZIP6 COMPRESS7 GIFF8 POP9 SMPT10 IMAP11 FTP12 TELNET13 SKYPE14 CMS15 GLOBAL String Table FRENCH16 RUSSIAN17 BMP18 BASIC SOURCE19 C SOURCE20 DISCOVER21 String Table IDID BASIC SOURCE 190 BMP 181 C SOURCE 202 CMS 153 COMPRESS 74 DISCOVER 215 ENGLISH 56 FRENCH 167 GIFF 88 JPEG 49 RUSSIAN 1710 ZIP 611 CONTENT Dimension Table String Table IDID AOL 30 FTP 121 IMAP 112 LDAP 13 POP 94 SKYPE 145 SMB 06 SMTP 107 SSH 28 TELNET 139 PROTOCOL Dimension Table Only showing 2 out of 16 NETWORK DATA STREAM Dimensions
29 B-TREE Notation FTP B (1,3) Attribute Name Node B Level Record Number
30 NETWORK DATA STREAM POP B (1,9) AOL B (1,7) IMAP B (1,8) SKYPE B (1,4) FTP B (1,3) LDAP B (1,1) TELNET B (1,6) SMTP B (1,5) SSH B (1,2) SMB B (1,0) “Protocols” B-TREE
31 Notation BMP 4 B (7,9)(7,9)(7,9)(7,9) Chunk Record Number Attribute Name Record Count Tree nodes not only contain data aggregates but a linked list of data record indices.
32 “Content” B-Trees ZIP 3 (2,10) (2,11) (2,12) C SOURCE 4 (2,3) (2,4) (2,5) (2,6) BMP 1 (2,2) BASIC SOURCE 3 (1,15) (2,0) (2,1) RUSSIAN 3 (2,7) (2,8) (2,9) B (1,8) SSH C SOURCE 1 (1,4) BMP 1 (1,3) BASIC SOURCE 3 (1,0) (1,1) (1,2) B (1,0) AOL CMS 1 (1,5) B (1,1) FTP COMPRESS 1 (1,6) B (1,2) IMAP DISCOVER 2 (1,7) (1,8) B (1,3) LDAP FRENCH 1 (1,9) B (1,4) POP GIFF 1 (1,10) B (1,5) SKYPE JPEG 2 (1,11) (1,12) B (1,6) SMB RUSSIAN 1 (1,14) B (1,7) AOL
33 B-Tree Forest POP B (1,9) AOL B (1,7) IMAP B (1,8) SKYPE B (1,4) FTP B (1,3) LDAP B (1,1) TELNET B (1,6) SMTP B (1,5) SSH B (1,2) SMB B (1,0) Pointer C SOURCE 1 (1,4) BMP 1 (1,3) BASIC SOURCE 3 (1,0) (1,1) (1,2) B (1,0) AOL Level Index of Tree at the same level
34 ZIP 3 (2,10) (2,11) (2,12) C SOURCE 4 (2,3) (2,4) (2,5) (2,6) BMP 1 (2,2) BASIC SOURCE 3 (1,15) (2,0) (2,1) RUSSIAN 3 (2,7) (2,8) (2,9) B (1,8) SSH C SOURCE 1 (1,4) BMP 1 (1,3) BASIC SOURCE 3 (1,0) (1,1) (1,2) B (1,0) AOL CMS 1 (1,5) B (1,1) FTP COMPRESS 1 (1,6) B (1,2) IMAP DISCOVER 2 (1,7) (1,8) B (1,3) LDAP FRENCH 1 (1,9) B (1,4) POP GIFF 1 (1,10) B (1,5) SKYPE JPEG 2 (1,11) (1,12) B (1,6) SMB RUSSIAN 1 (1,14) B (1,7) AOL POP B (1,9) AOL B (1,7) IMAP B (1,8) SKYPE B (1,4) FTP B (1,3) LDAP B (1,1) TELNET B (1,6) SMTP B (1,5) SSH B (1,2) SMB B (1,0)
35 Conclusion B-tree forests are limited to data aggregates. Data aggregates only identify the existence of a dimensional combination. They do not provide access to complete data records. With current OLAP implementations, examining data records requires issuing additional database queries, which is inefficient. We solve this problem by extending a balanced b-tree forest to include references to data records. We call this new type of hypercube: the data value cube. Thus for our data cube, tree nodes not only contain data aggregates but a linked list of data record indices.
36 THE Q&A Stephen A. Broeker