Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation.

Similar presentations


Presentation on theme: "Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation."— Presentation transcript:

1 Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation by Aaron St.Clair

2 Outline  What is lineage tracing?  Why is tracing lineage data important?  How can we find lineage data?  Performance results

3 Data Warehouses  Integrate data from multiple sources  Data undergoes series of transformations  Transformations vary in complexity Data Source 1 Data Source 2 Data Source N … Transformation Summarized Data

4 Lineage Tracing  Identifying the specific data items in the sources that derive a given data item in the warehouse  Allows  In-depth data analysis  Data mining  Authorization management  View update  Efficient warehouse recovery

5 An Example Selects items whose last quarter sales are more than twice the average of the last three quarter’s sales

6 An Example

7 Lineage Granularity  Coarse-Grained  Schema-level, attribute mapping  Fine-Grained  Set of source data items

8 Existing Work  Mostly coarse-grained lineage  Existing methods for fine-grained lineage  Extra annotation  Developer-defined weak inverses  Statistical estimation  Can’t handle complex procedural transformations

9 Tracing Lineage - Definitions  Data set – set of data items without duplicates  Transformation – any procedure that takes data sets as input and produces data sets as output  Stable (no spurious output)  Deterministic (under some conditions)  Lineage of a data item – set of input data items that contribute to that item

10 Determining Contributions Need to find relevant data items – Easy for simple relational operators – Difficult for procedural transformations Select positives vs. Aggregation and sum

11 Lineage Tracing Use of hierarchical model – Transformation classes – Schema mappings – Defined inverses

12 Transformation Classes  Transformation class defines procedure lineage determination  For a dispatcher:  Iteratively apply transformation to inputs  If T(I) is in output set add I to lineage of the output set

13 Schema Mappings  Defined schema for input and output of a transformation Backward key-maps – A key  g(B) – T1  Forward key-maps  f(A)  B key  T4  Backward total-maps  A  g(B)  T5

14 Provided Inverses/Tracing Procedures  Best case; someone has defined a function mapping output items to their deriving lineage items  Know nothing about efficiency of function

15 Property Hierarchy

16 Finding Lineage Recursively apply algorithms based on the transformation type until we reach top level

17 Optimizations  Indexing input data set improves performance  Functional index using the schema optimizes queries of the form F(i) = v  Store auxiliary or intermediate views in the warehouse  Reduce number by composing transformations

18 Transformation Graphs  Create a tracing sequence for each path from input to output in the graph  Combine the results of each sequence

19 Performance 1 GB warehouse Schema mapping better than transformation class- specific algorithms Indexing helps Combining attributes reduces trace time

20 Questions?


Download ppt "Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation."

Similar presentations


Ads by Google