Presentation is loading. Please wait.

Presentation is loading. Please wait.

Capturing Database Transformations

Similar presentations


Presentation on theme: "Capturing Database Transformations"— Presentation transcript:

1 Capturing Database Transformations
David Sergio Matusevich University of Houston DBMS GROUP

2 Organization Introduction Classification Program Case Study
Conclusions Time in minutes. DBMS GROUP

3 Extending ER Models to Capture Database Transformations to Build Data Sets for Data Mining Analysis
Carlos Ordonez Sofian Maabout David Sergio Matusevich Wellington Cabrera Carlos Ordonez, Sofian Maabout, David Sergio Matusevich, Wellington Cabrera. Extending ER Models to Capture Database Transformations to Build Data Sets for Data Mining, Data & Knowledge Engineering (DKE), 2014, Elsevier. DBMS GROUP

4 Data Mining Projects Data mining projects usually require the preparation of a dataset specially created for answering the particular question asked by the user. For example, given a database of a cellular phone company, we might ask: What percentage of users will change data plans with the advent of a new smartphone. This will require the creation of a dataset where at least one of the columns will be “CHANGED_PLAN” and another one could be “BEFORE_NEW_DEVICE”. Many of the intermediate tables created for this project might remain in the database, using resources. DBMS GROUP

5 Motivation: Saving Work
Different users might ask similar questions, leading to the creation of tables that are virtually identical, cluttering the system. For instance a researcher might want to answer the question: “What percentage of men between the ages of 18 and 35 will change data plans when a new device is introduced”. If the researcher is not aware of the previous user project, some of the intermediate tables created might be exact duplicates of the ones created before. DBMS GROUP

6 Contribution In this work we present:
A classification of the most common transformations user in data mining, We propose an extension of the ER Model to keep track of the intermediate tables created, and We introduce a tool designed to: Simplify the use of naming conventions, Keep track of attributes and keys, and Facilitate the recognition of duplicate tables. DBMS GROUP

7 Building a data set for data mining
Building a data mining dataset involves successive rounds of aggregation and denormalization. DBMS GROUP

8 Note: If the database is not static, the transformation tables must also be updated. This could be resource intensive, and could be left to the last minute, that is, a transformation table is only updated when it is reused. We also limit ourselves to transformations that happen inside the database. Transformations happening outside the DBMS, such as those performed by Extract-Transform-Load (ETL) Tools, are not considered here. DBMS GROUP

9 Model vs Theory Entities Relationships Tables Foreign Keys
Entity-Relationship (ER) Model Relational Model Entities Relationships Tables Foreign Keys Clarify input to the program. DBMS GROUP

10 Well Formed Queries We define a “well formed query” as one that complies with the following requirements: Always produces a table with a primary key and a potentially empty set of non-key attributes. Each join operator is computed based on a foreign key and primary key from the referencing table and the referenced table, respectively. DBMS GROUP

11 Database Transformation Queries
DBMS GROUP

12 Data Sets DBMS GROUP

13 The Transformation Tables
In order to allow for easy reuse, transformation tables must incorporate into their metadata: The query that created them An indication of whether the entities come from a source table or another transformation table (provenance). provenance DBMS GROUP

14 Transformations DBMS GROUP
Principal idea: if they share primary key, that is what put them together DBMS GROUP

15 Classification of transformations
We distinguish two mutually exclusive database transformations: Denormalization, which brings attributes from other entities into the transformation entity or simply combines existing attributes. Aggregation, which creates a new attribute grouping rows and computing a summarization. DBMS GROUP

16 The CASE statement Example:
SELECT .. CASE WHEN A1='married' or A2='employed' THEN 1 ELSE 0 END AS binaryVariable FROM .. The CASE statement does not have a relational algebra translation. It derives a binary attribute nor present before in the database, and might even introduce NULLS. DBMS GROUP

17 Sample Database In this simple example S1 could be a table of transactions, S2 a table pf products and S3 could contain details about the product. S2(product0) is a detail of S1(transaccion) S3 detalles del producto DBMS GROUP

18 Sample Script Entry point Output DBMS GROUP Input output….
/* q0: T0, universe */ SELECT I, /* I is the record id, or point id mathematically */ CASE WHEN A1=’married’ or A2=’employed’ THEN 1 ELSE 0 END AS Y,/* binary target variable */ A3 AS X1 /* 1st variable */ INTO T0 FROM S1; /* q1: denormalize and filter valid records */ SELECT S2.I,S2.J,A4,A5,A6,A7,K2,K3 INTO T1 FROM S1 JOIN S2 ON S1.I=S2.I WHERE A6>10; /* q2: aggregate */ SELECT I, sum(A4) AS X2,sum(A5) AS X3,max(1) AS k /* k is FK */ INTO T2 FROM T1 GROUP BY I; /* q3: get min, max */ SELECT 1 AS k, min(X3) AS minX3, max(X3) as maxX3 INTO T3 FROM T2; /*q4: math transform */ SELECT I, log(X2) AS X2 /* 2nd variable */ (X3-minX3)/(maxX3-minX3) AS X3 /* 3rd variable range [0,1]*/ INTO T4 FROM T2 JOIN T3 ON T2.K=T3.K; /* get the min/max */ /* q5: denormalize, gather attribute from referenced table S3 */ SELECT I,J,A7,A8 INTO T5 FROM T1 JOIN S3 ON T1.K2=S3.K2; /* q6: aggregate with CASE */ SELECT I, sum(CASE WHEN A7=’Y’ THEN A8 ELSE 0 END) AS X4 INTO T6 FROM T5 GROUP BY I; /* q7: data set, star join this data set can be used for: logistic regression, decision tree, SVM */ SELECT T0.I,X1,X2,X3,X4,Y INTO X FROM T0 JOIN T4 ON T0.I=T4.I JOIN T6 ON T0.I=T6.I; Input output…. Output DBMS GROUP

19 Something Tool To be done after I speak to Wellington…

20 Tool Development DBMS The program should create a database of queries.
Better key. Regular expression, query matching no log: db of queries… The program should create a database of queries. DBMS GROUP

21 Tool Output Denormalization: T0(I,Y,X1, PK(I), FK(S1.I)); Denormalization: T1(I,J,A4,A5,A6,A7,K2,K3, PK(I,J), FK(S2.I,S2.J),FK(S3.K2)); Aggregation: T2(I,X2,X3,K, PK(I), FK(S1.I)); Aggregation: T3(K,minX3,maxX3 ,PK(K)); Aggregation: T4(I,X2,X3 ,PK(I) ,FK(S1.I)); Denormalization: T5(I,J,A7,A8, PK(I,J), FK(S2.I,S2.J)); Aggregation: T6(I,X4, PK(I), FK(S1.I)); Denormalization: X(I,X1,X2,X3,X4,Y, PK(I), FK(S1.I)); DBMS GROUP

22 Program Detail Script Output
SELECT I, CASE WHEN (A1=’married’ or A2=’employed’) THEN 1 ELSE 0 END AS Y, A3 AS X1 INTO TABLE0 FROM S1; Denormalization: T0(I,Y,X1,PK(I),FK(S1.I)); The output of the code identifies the type of transformation (denormalization or aggregation), the attributes present in the new table as well as information about keys and foreign keys. Furthermore, it changes the name of the table to a ‘normalized’ name. DBMS GROUP

23 Denormalization DBMS GROUP

24 Aggregation DBMS GROUP

25 Future Extensions We need to extend the program to search the database for transformation tables that might have already been created. Incorporate it as a plugin of a major DBMS. This would allow considerable savings in time and resources when preparing datasets. Create a plugin for a modeling software to show the new tables created, as well as the metadata stored when using the program. Introduce a work-flow chart for the query plan.

26 Conclusions Minimal extension to the ER model to represent data transformations in an ER diagram. Introduced an algorithm to extend an existing ER model, keeping the data set in mind as the final goal. Help analysts reuse existing tables or views. Help understanding complex SQL queries at a high level. Our work bridges the gap between a logical database model represented by a standard ER model and a physical database model represented by SQL queries. Flow diagram??? DBMS GROUP

27 The AdventureWorks Database
DBMS GROUP


Download ppt "Capturing Database Transformations"

Similar presentations


Ads by Google