Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department.

Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department of Computer Science & Engineering University of Nebraska-Lincoln

An illustrative example  Join query SELECT R1.A,R1.B,R1.C FROM R1,R2 WHERE R1.A=R2.A AND R1.B=R2.B AND R1.C=R2.C  10 tuples in 3 nested tuples

Advantages  Direct Savings of number of tuple comparisons Savings in I/O for next operator Space reduction of materialized join queries  Future applications Use for query size estimation Assist in high-level analysis of data & in data mining

Our contributions  A new representation of a join query as a Constraint Satisfaction Problem (CSP)  A new sorting-based bundling algorithm Suitable for CSPs with fewer and larger constraints (i.e., join) Improves memory usage  A new sort-merge join algorithm for producing (dynamically) bundled tuples Yields compact representation, saves memory space  Identification of possible applications Data analysis Materialized views Assisting query-size estimation  Suggested, not yet demonstrated

Constraint Satisfaction Problem  Given P = ( V, D, C ) V = {V i }, a set of variables D = {D Vi }, the set of their respective domains C is a set of constraints restricting the acceptable combination of values for variables. Solution is a consistent assignment of values to variables  Query: find 1 solution, all solutions, etc. V3V3 {d} {a, b, d}{a, b, c} {c, d, e, f} V4V4 V2V2 V1V1

Solving CSPs  Typically, DFS & backtracking  Improvement Static bundling [Freuder 91] Dynamic bundling [our group] –Based on dynamically identifying symmetries –Guaranteed never less efficient than non-bundling, static bundling Without bundlingStatic bundling S cd, e, f d V1V1 V2V2 Dynamic bundling ce, fd d V1V1 V2V2 S cefd d V1V1 V2V2 S V3V3 {d} {a, b, d}{a, b, c} {c, d, e, f} V4V4 V2V2 V1V1

Modeling Join as a CSP  Attributes of relations  CSP variables  Attribute values  variable domains  Relations  relational constraints  Join conditions  join-condition constraints SELECT R1.A,R1.B,R1.C FROM R1,R2 WHERE R1.A=R2.A AND R1.B=R2.B AND R1.C=R2.C

Sorting-based bundling  Heuristic for variable ordering Place variables linked by join conditions as close to each other as possible R1.A R2.A R1.B R2.B R1.C R2.C R1 R2  Sort relations using above ordering  Next: Compute bundles of variable ahead in variable ordering ( R1.A )

Bundling an attribute  Partition of a constraint Tuples of the relation having the same value of R1.A  Compare projected tuples of first partition with those of another partition  Compare with every other partition to get complete bundle Partition Unequal partitions Symmetric partitions Bundle {1, 5}

Join using dynamic bundling Select next- variable Compute next valid bundle Found bundle? Last variable? Move to previous variable Undo previous assignment 1 st in Ordering? No Yes Output one tuple Start Stop Yes No Assign bundle

Finding the valid bundle {1, 5, x} {1, 5, y, z} Common {1, 5} 1.Compute a bundle for the attribute 2.Check bundle validity with future constraints 3.If no common value found GOTO 1  Assign variable with the surviving values in the bundle

Analysis of overheads  For Bundling Additional data structures: 2 arrays, 1 pointer Only 1 array may become cumbersome  Array size is largest when all the values of a variable are in one bundle But, this case also leads to best savings!  Improved implementation Use of Bitmaps?

Progressive Merge Join  PMJ: A sort-merge algorithm by [Dittrich et al. 03]  Provides early results Assists in query size-estimation  Two main phases Sorting: starts producing results in this phase Merging phase: merges sorted runs  We use the framework of the PMJ for our external join.  Implemented & evaluated with the XXL library We use the same library for our implementation

Preliminary experiments  Data sets Random: 2 relations R1, R2 with same schema as example –Each relation: 10’000 tuples –Memory size: 4’000 tuples –Page size 200 tuples Real-world problem: 3 relations, 4 attributes  Compaction rate achieved Random problem: 1.48 –Savings compensate for even worst case (of the current experimental implementation) Real-world problem: 2.26 (69 tuples in 32 nested tuples)

Related work  Join algorithms Well established algorithms Do not focus on exploiting symmetry  Database compression Output results are not compressed Compression at value level, not tuple level

Related work (contd)  [Mamoulis & Papadias 1998] Join using FC for spatial DB Restricted to binary constraints No compaction of solution space  [Bayardo et al. 1996] Reduce the number of the intermediate tuples of a sequence of joins  [Rich et al. 1993] Do not compact join attribute values Does not detect redundancy present in the grouped sub- relations

Future work  Refine implementation Use of lighter data structures  Test usefulness in the context of Constraint DBs Values are continuous intervals, e.g. spatial database  Conduct thorough evaluations of overall performance & overhead (memory & CPU) on different data distributions  Investigate benefit of using bundling query size estimation materialized views

Research supported by CAREER Award #0133568 from NSF

DB vs. CSP terminology

Bundling relations: Data structures  Considering the portion of the relation in memory  Current-Inst: To store the current instantiations of past variables V p of R1.  Current-Constraint: selection of R’: Past variable values equal Current-Inst Current variable V c > all previous instantiations of V c

Bundling relations: Computing bundles (Algorithm 1)  NEXT-PARTITION( p ) returns the first unchecked partition in Current-Constraint following the partition p.  Sorted constraints Checking equality of tuples is efficient

Bundling relations: Data structures  Processed-Values : Cumulatively stores non- representative values of bundles  Computing bundles of V c  Values of V c in it are ignored  Partition p is marked as checked when: Value(p) is in an instantiation bundle p is selected for comparing with other partitions to check for bundles

Join computation: In memory  Two subsets of relations (some pages) in memory: Algorithm to find result of joining the two. Join computed as a search –Finding all solutions After finding one solution, search resumes from same depth –Algorithm shown can be entered at any “depth” in the search Uses Algorithm 1 to find bundles for assigning to variables

Join computation: In memory  Join as a search (Algo. 2)  BACKTRACK Variable[depth] in Current-Inst reset Processed-Values for the variable emptied Value in Current- Solution reset Current-Constraint re- computed  Undoes the effects of the previous instantiation. Expanded on next slide

Join computation: In memory  COMMON(b i, bundles) subset of b i consistent using join-condition constraints  For equality COMMON Intersection  Empty result of COMMON inconsistency BACKTRACK

Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department.

Similar presentations

Presentation on theme: "Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department.

Similar presentations

Presentation on theme: "Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal & Berthe Y. Choueiry Constraint Systems Laboratory Department."— Presentation transcript:

Similar presentations

About project

Feedback