G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece) (2) University of Ioannina, Ioannina, Hellas (Greece) (3) HP Labs, Palo Alto, California, USA Design Metrics for Data Warehouse Evolution
ER'08, Barcelona, October Outline Motivation Graph-based modeling & DW Evolution Metrics for data warehouse evolution Evaluation Conclusions
ER'08, Barcelona, October Outline Motivation Graph-based modeling & DW Evolution Metrics for data warehouse evolution Evaluation Conclusions
ER'08, Barcelona, October Motivation WWW Act1 Act2 Act3 Act4 Act5 Data warehouses are evolving environments, e.g.: –A dimension is removed or renamed –The structure of a dimension table is updated –A fact table is completely decoupled from a dimension –The measures of a fact table change –An ETL so urce is modified, etc.
ER'08, Barcelona, October Evolution Effects SW and data artifacts around the warehouse (e.g., ETL activities, materialized views, reports) are affected: –Syntactically – i.e., become invalid –Semantically – i.e., must conform to the new source database semantics Adaptation to new semantics –time-consuming task –treated in most of the cases manually by the administrators/developers Evolution-driven design is missing
ER'08, Barcelona, October We would like to know… Can we measure and quantify in a principled way the vulnerability of certain parts of a data warehouse environment and find these constructs that are most sensitive to evolution? Can we predict and quantify the impact of a change towards the rest system? What are the “right” measures for evaluating the quality of the design of a data warehouse, with respect to its evolution capabilities?
ER'08, Barcelona, October Outline Motivation Graph-based modeling & DW Evolution Metrics for data warehouse evolution Evaluation Conclusions
ER'08, Barcelona, October Data Warehouse Schema Evolution Our approach Mechanism for performing what-if analysis for potential changes of database configurations Graph based representation of database constructs (i.e., relations, views, constraints, queries) Annotation of graph with rules for adapting queries to database schema evolution Evolving databases Queries Database Schema Graph-based modeling for uniform representation Metrics for Evaluating Evolution Design Evolving applications Rules for Handling Evolution
ER'08, Barcelona, October Graph based representation
ER'08, Barcelona, October Graph Annotation with rules
ER'08, Barcelona, October Graph Adaptation Annotated Query Graph Event Add attribute Phone to relation EMP Transformed Query Graph Q NameEID Name EID S S EMP S S map-select … ON attribute addition TO EMP THEN propagate Q: SELECT EID, Name FROM EMP Q: SELECT EID, Name, Phone FROM EMP Q NameEID Name EID S S EMP S S map-select … ON attribute addition TO EMP THEN propagate Phone S S map-select
ER'08, Barcelona, October Outline Motivation Graph-based modeling & DW Evolution Metrics for data warehouse evolution Evaluation Conclusions
ER'08, Barcelona, October Simple Metrics Simple: in-degree, out-degree, degree EMP.Emp# is more “important” than EMP.SAL, w.r.t. how many nodes depend directly on it
ER'08, Barcelona, October Transitive Metrics Transitive: in-degree, out-degree, degree Variant with a view + query is more “complicated” wrt how many nodes are involved in the propagation of EMP.Emp# towards the end
ER'08, Barcelona, October Zoomed-out degrees Only top-level nodes are retained Only one edge between modules is retained weighted with the number of edges suppressed Simple degrees Transitive degrees
ER'08, Barcelona, October Entropy-based metrics Probability that a node v is affected by an event occurring on another node y i : Examples P(Q|V) = 1/3, P(Q|EMP) = 1/3, P(V|WORKS) = 1/2
Entropy-based metrics - continued Entropy of a node v: The “sensitivity” that a node v is affected by a random event on the graph. ER'08, Barcelona, October
ER'08, Barcelona, October Outline Motivation Graph-based modeling & DW Evolution Metrics for data warehouse evolution Evaluation Conclusions
Testbed Configuration ER'08, Barcelona, October TPC-DS benchmark: Web Sales schema with 3 variants –Original (1 fact – 13 dimensions) –Surrounded with views –Customer dimensions merged
ER'08, Barcelona, October Distribution of Evolution Events OperationDistribution 1Distribution 2 Rename Measure29% (15)0% (0) Add Measure25% (13)0% (0) Rename Dimension Attribute21% (11)0% (0) Add Dimension Attribute15% (8)37% (25) Delete Measure6% (3)0% (0) Delete Dimension Attribute4% (2)44% (30) Delete FKs0%13% (9) Delete Dimension Table0%6% (4) Distr 1: Recorded from the Greek Public sector Distr 2: Migration to a pure star schema
ER'08, Barcelona, October Evaluating effectiveness Effectiveness –how well our metrics can “forecast” the impact of events over the different constructs of the schema Configuration –we used mainly the Distr. 1 of events (real data) –we tested nine configurations based on variations of the schema –Web Sales (WS), Web Sales extended with views (WS-views), star variant of Web Sales (WS-star) variations of the policy –Block-All, Propagate-All, Mixture
ER'08, Barcelona, October Events affecting dimensions (a) WS schema (b) WS-star schema
ER'08, Barcelona, October WS-views schema Events affecting views
ER'08, Barcelona, October Events affecting queries (a) WS schema (b) WS-star schema
ER'08, Barcelona, October Comparison of design configurations (a) only affected queries (b) all affected nodes for Distr. 1
ER'08, Barcelona, October Comparison of design configurations (a) only affected queries (b) all affected nodes for Distr. 2
ER'08, Barcelona, October Outline Motivation Graph-based modeling & DW Evolution Metrics for data warehouse evolution Evaluation Conclusions
A framework for handling the impact of changes in a DW environment A set of metrics for DW evolution –simple –transitive –entropy-based An extensive experimental evaluation based on both, real and synthetic dataset Platform: Hecataeus –A tool for visualizing and performing what-if analysis for evolution scenarios ER'08, Barcelona, October
ER'08, Barcelona, October Gracias! Hecataeus: A tool for visualizing and performing what-if analysis for evolution scenarios
ER'08, Barcelona, October Questions?
ER'08, Barcelona, October Gracias ! Sources: