Presentation on theme: "Tracing the Lineage of View Data in a Warehousing Environment Seminar : “Digital Information Curation“ Winterterm 2005/2006 Siniša Avramović"— Presentation transcript:
Tracing the Lineage of View Data in a Warehousing Environment Seminar : “Digital Information Curation“ Winterterm 2005/2006 Siniša Avramović email@example.com
Outline - The Lineage Problem : Introduction and Definition -Motivating Examples -Basic Assumptions and Definitions -Derivation Tracing Algorithms for SPJ and ASPJ Views
Outline II -Derivation Tracing in a DWH Environment * handling intermediate results (Materialization vs. Recomputing) * The Multisource Problem * Example :DWH with Derivation Tracing Support - Discussion
Introduction Data Warehouse : integration of data from multiple, heterogeneous, distributed, and possibly very large data sources. Queries posed to the DW : - for analytical purposes (OLAP,OLAM) - often of high complexity (join, aggregation) Materialized Views : intermediate results of query processing stored in DW for query efficiency improvement.
Introduction II In general : view definition - mapping from base data to view data
Introduction III In general : view definition - mapping from base data to view data View definition
Introduction II View Data Lineage : inverse mapping from view of data item to the base relation data that produced it View data lineage * In this presentation, we deal with SPJ and ASPJ views
Introduction III Unfortunately : inverse mapping not as straightforward as view computing. Here not only view definition needed, but also some additional information required. Data Warehousing Environment introduces some additional challenges to the lineage problem. Questions to be Answered : * how to trace lineage in a distributed database environment * how to deal with inaccessible or inconsistent sources
Motivating Examples σ state=‘ca‘ π store_name, item_name, num_sold California item(item_id,item_name,category); store(store_id,store_name,city,state); sales(store_id,item_id,price,num_sold); CREATE VIEW California AS SELECT store.store_name,item.item_name, sales.num_sold FROM store,item,sales WHERE sales.store_id = store.store_id AND sales.item_id = item.item_id AND store.sate = “CA“
Basic Assumptions Class of views defined over base relations using relation algebra operation ( σ,π,, ⍺ ) considered SPJ and ASPJ Views Set Semantics considered Basic Definitions Derivation of view tupels : tupels of base relations which produce the View tupel. Those tupel said to contribute to t* * t : materialized view tupel
Basic Definitions: Tupel Derivation for an Operator σ state=‘ca‘ π store_name, item_name, num_sold California T1T2T3 - OP : any relational operator ( σ,π, ⋈, ⍺ ) over T1,T2,T3 -T=Op(T1,..,T3); -t ∊ T ; t‘ s Derivation in T1…T3 : Op -1 〈T1,..,T3〉 (t)=〈T 1 *,…. T 3 *〉 ; T 1 *,…. T 3 * maximal subsets of T1,…,T3 such that: a) the derivation tupel sets T i derive exactly t b) Each tuple in derivation in fact contribute something to t
Basic Definitions: Tupel Derivation for Aggregation
Basic Definitions: Tupel Derivation for Views Derivation of view tuple set T contains all tuples that contribute to any view tuple in set T Any View V with complex definition can be broken into intermediate views, and compute tuples derivation recursively by tracing the hierarchy of intermediate views.
SPJ View Derivation Tracing Derivation tracing queries queries written for specific view definition, such that if applied to database D, it return t‘s derivation in D. Given database D, with base relations R 1,…R m ; view definition v over R 1,…R m and tuple t ∊ v ( R 1,…R m ); TQ tv is the derivation tracing for t and v if : TQ tv (D)=v -1 D (t) * v -1 D (t) : t‘s derivation over D according to v
SPJ View Derivation Tracing SPJ canonical form : using a sequence of algebraic transformations- all SPJ views can be transformed into the form : ∏ A (σ C ( R 1 ⋈ …. ⋈ R m )) Split Operator : given table T with schema T, Split breaks T into list of tables, each table in the list is projection onto A ⊆ T
ASPJ View Derivation Tracing Problems with ASPJ Views : - most ASPJ Views not traceable without storing additional information (intermediate results). - no simple canonical form, since selection operators cannot be pushed above or below aggregation operators. Basic Idea : ASPJ canonical form transform general ASPJ views into sequences of ⍺, π, σ, ⋈ by commuting some SPJ operators ASPJ Segment. Each segment include aggregation operators. View defined by one ASPJ segment – one-level ASPJ view.
ASPJ View Derivation Tracing Given a one–level ASPJ view V : and t ∊ T ; t‘s derivation :
Recursive Derivation Tracing Algorithm for Multi-Level ASPJ Views Problem : multiple queries required for tracing computations. Idea : - transfer the view into ASPJ canonical form -Divide into set of ASPJ segments -Define intermediate views for each segment
Recursive Derivation Tracing Algorithm for Multi-Level ASPJ Views Problem : multiple queries required for tracing computations. Idea : -Trace recursively trough hierarchy of intermediate Views top-down list -Use at each level a one-level ASPJ query to compute derivations -Concatenate local results to form derivation of whole view tuple list
Derivation Tracing in a Warehousing Environment Until now, always assumed that all base relations and intermediate results accessible for tracing computations. By looking at DWH as integration of heterogeneous and distributed data sources, following problems may arise: - Efficiency problem : querying remote sources with join and aggregation operation, recomputing intermediate results for each query very inefficient in the distributed environment. - Consistence problem : view refreshing and view maintenance problems - Legacy sources : Views defined on inaccessible, legacy sources not traceable
Derivation Tracing in a Warehousing Environment Solutions to the intermediate aggregation results problem: The “Recomputation Approach“: -Recompute the intermediate results for each query. -No permanent extra storage, tracing process takes much longer, especially in distributed environments. The “Materialized View“ Approach -Maintain auxiliary materialized views with intermediate results in the DWH. -Less computation required but.. -extra storage and maintenance costs
Storing Derivation Views in a Multi-Source DWH Various strategies for storing auxiliary views : simple extreme solution : store all base relations or store less information with higher tracing computation cost. Goal : intermediate scheme with low tracing cost and modest extra storage and maintenance costs. Idea : Derivation Views -break down view definition again to ASPJ segments. -Only view defined by lowest segments directly over base relations -Those views can be computed by a simple selection with split operator
Storing Derivation Views in a Multi-Source DWH All_clothing DV_All_clothing
Warehousing System Supporting Derivation Tracing I - Auxiliary view AllClothing maintained for tracing tuples from Clothing - AllClothing records the intermediate aggregation results - to trace tuples from AllClothing, derivation view DV_AllClothing is maintained.
Warehousing System Supporting Derivation Tracing I
Tracing the Lineage of View Data in a Warehousing Environment Questions ?