“Lineage/Provenance” Workgroup Report Birgitta, Amol, Ihab, Thomas, Anish, Martin, Matthias.

2 Why lineage? Users – Want lineage – Trace results Huge data warehouses, complicated queries/views Understand processes/workflows (Biomedical databases, Genomics databases, etc.) Systems – Need lineage Closed & complete representation models – Typically: Boolean constraints among tuples

3 Lineage in Uncertain and Probabilistic Databases Closedness of operations Complete representation models Capture semantics of relational operations w/constraints on the data – Extensional semantics Identify “safe” plans – Intensional semantics Need to track constraints – Recursive vs. transitive Query processing issues/opportunities – Highly system-specific

4 Approximations Granularities of lineage – Schema-level, record-level, external Avoid expensive cases – E.g.: WHERE count(*)=3  (A & B & C) OR (B & C & D) OR ….  Approximate lineage distributions for “expensive” predicates  Convolution-like summary of the impact of input tuples into the output distribution

5 Uncertainty in the Lineage Itself Not sure where information comes from (  external source), attach confidences to lineage? Uncertainty in data integration “Probabilistic rules” Anonymize lineage/show multiple explanations Aggregate lineage/granularity

6 Privacy Issues May not be allowed to expose exact lineage Query lineage, explain lineage, or use summaries/approximations

7 Relation to Graphical Models Encoding issues – E.g., Bayesian Nets, additional CPT’s Qualitative issues – Changes in the uncertainty, inference – Exploit metadata/relationships between input variables – Updates in the lineage

8 Presentation of Lineage Navigate through different granularities Aggregate lineage/show summaries

9 Data Integration w/Lineage Patch lineage pointers? Identify regularities/common patterns in lineage to reduce uncertainty Detect dependencies among data items from different databases Supporting data mining tasks, lineage as additional metadata

