# Top k Knapsack Joins and Closure Early Results Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France

## Presentation on theme: "Top k Knapsack Joins and Closure Early Results Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France"— Presentation transcript:

Top k Knapsack Joins and Closure Early Results Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France witold.litwin@dauphine.frwitold.litwin@dauphine.fr Santa Clara U., CA, tjschwarz@scu.edutjschwarz@scu.edu 1

Knapsack Join (KS-Join) The join defined by the sum of the join attributes being at most some constant Father of 4 kids wishing to buy toys for at most 100€ total A person wishing to buy a computer tower, a screen, a printer and a desk for at most 1000 € …. 2

Knapsack Join (KS-Join) Traditional join – R1 Join R2 on c1 = c2 KS - join – R1 Join R2 on c1 + c2 ≤ C Syntax legal for FROM clause in Access, SQL Server… 3

Top k Knapsack Join (KS-Join) Top k items with respect to the descending order on the constant Usually, only a few items the most close to the constant are of interest Select TOP 1 * from Toys T1, Toys T2, Toys T3, Toys T4 Where T1.Price + T2.Price + T3.Price + T4.Price ≤ 100 and T1.Id < T2.Id and T2.Id < T3.Id and T3.Id < T4.Id Order by T1.Price + T2.Price + T3.Price + T4.Price Desc; 4

Top k Knapsack Join (KS-Join) Top k Knapsack joins are of obvious interest How DBMSs deal with ? Nested loop – To our best knowledge Result: execution time makes the SQL capability useless for a larger data set – Consider our example for just 1000 toys to choose from – FYI, 1K-tuple table & 3-way KS-join killed SQL Server 5

Our Goal : Optimizing Top k KS-Joins Algorithms provably faster than usual nested loop – Formulate the algorithm – Prove the complexity, storage & processing costs KS-optimized Nested Loop Self-join Nested Loop Sort Merge KS – Join Indices Distributed KS – Join Indices 6

Our Goal : Optimizing Top k KS-Joins Early Results Only for Top k KS-Joins (TkKS-Joins) Only the formal analysis as yet Many variants of TkKS-Join queries left for future work – See the paper 7

Knapsack Problem (KP) NP-hard optimization problem Among most studied Input: – A set O of objects {o 1,...o n } – An m-d subspace called knapsack K with – values b i, 1 ≤ i ≤ m, represent each the i-th dimension's capacity of the knapsack – Vector c j represents the benefit of the object j if in the knapsack 8

Knapsack Problem (KP) Input (continued): – The knapsack's constraints matrix with entries a i,j ; 1 ≤ j ≤ n ; – Each entry stores the constraint value for each object j in each dimension i (price, size, volume...). Output: – A set O' of objects stored in the knapsack. 9

Knapsack Problem (KP) Binary variable x j ; x j  {0, 1}, indicates the selection of the object j into the knapsack – (x j = 1) for object j in and (x j = 0) otherwise – x j is 0–1 decision variable 10

Knapsack Problem (KP) Select the elements of O’ which maximize the total profit of the selected objects Provided the match of the knapsack constraints 11

Knapsack Problem (KP) Formally, maximize: Subject to: 12

Knapsack Problem (KP) The most frequently investigated case is the 1-d one – I.e., i = 1 Often, or perhaps even the most often, the KP concept designates implicitly this case. Frequently, in addition, one also sets every c j to c j = a j. Both conditions are ours below – unless we state otherwise The m-d one is referred to then, if needed, as multidimensional (MKP). 13

Knapsack Problem (KP) The general research orientation for KP and MKP Find a heuristic providing acceptable approximate result – For the possibly largest data set – In the fastest time, – Or acceptable time – Given necessary constraints on the computer system used. 14

KP / TkKS -Join Our research orientation follows the database approach Find an exact result – For a reasonably practical problem subspace – For a database size data Say, 1Ktuples per table at least 15

KP / TkKS -Join Find an exact result (continued) – In the fastest time – Or acceptable time Minutes at most – Given necessary constraints on the computer system used Mainly storage cost 16

Knapsack Problem (KP) Our reasonably practical problem subspace at present: – As we already stated c j = a j – 1-d space – Fixed # of objects for the knapsack Join instead of closure 17

Knapsack Problem (KP) Our reasonably practical problem subspace at present (continued): – One tuple = one potential selection – One object = one tuple with distinct ID No objects selected twice in a tuple for the knapsack Closure, MKP… left for the future 18

Nested loop TkKS-Join Basic cost for tables with n 1 …n m tuples – O (n 1 *…*n m ) To accelerate the calculus start with: – Evaluation of the restrictions t i < C – Evaluation of t i ≤ C – (Min 1 +…+Min j +…+Min m ) for any j ≠ i DBMS may easily maintain the Min j statistics Cost can be O(m) or even O (1) only – Idem for C ≥ Max 1 +…+Max m ? 19

Nested loop TkKS-Join Self-joint of a table with its copies Since KS-join is commutative one may avoid doubles – E.g. if we have tuple (t 1, t 2 ) then we should not have the tuple tuples (t 2, t 1 ) – In general, we need only one tuple from all its permutations The optimizing cuts the complexity and calculus time by half, at least Final word: we may have – O (n 1 *…*n m /S), where S ≥ 1 20

Sort-Merge TkKS-Join 2-way join 21 C =150 150

Sort-Merge TkKS-Join Processing cost of 3-way TkKS-Join – O (n 1 +n 2 ) in general – O ((n 1 +n 2 )/2) for self-join For n-way TkKS-Join – O (n m *…n 3 (n 2 + n 1 )) in general – For self-join ? E.g. For 16K-tuple R1 and R2 tables m-way join accelerates 8K times – 1sec instead of 2+ hours See the paper 22

KS-Join Index A relational table I KS with at least the attributes (C, t 1.Id,…, t m.Id) – Here C = t 1.c+…+t m.c – Also t 1.Id <… < t m.Id Can be also seen as a materialized view Some or all t i.c should be useful as well E.g. for queries with additional restrictions on individual prices 23

KS-Join Index I KS should be implemented as file sorted on C first – Then, on other key or non-key attributes of interest – E.g., a B-tree or trie… Storage cost: – O (n 1 *…*n m ) in general – Half of it or less for copies of the same table 3-way indices may be in RAM More should be typically on flash or disk 24

KS-Join Index Processing cost – O (Log p (n 1 *…*n m ) ) or less, according to the storage cost, where p is the tree fan-out Expected practical figures – ms for RAM, e.g., 3-way KS-Join index for 1K-tuple tables – under 10 ms for flash – under 100 ms for the disk, e.g., 4-way KS-Join index for our 1K-tuple tables 25

KS-Join Index Maintainance cost – High processing cost – E.g., 1 insert into our 1K tables generates 1M new entries Main drawback of KS-Indices at present Efficient processing is an open problem 26

Composing KS-Join Indices TkKS-Join calculus can compound existing KS- Indices m-way & n-way indices may speed up (m+n)- way TkKS Join Through the sort-merge algorithm applied to both indices Seconds may suffice for up to 6-way joins – E.g., for our 1K relations 27

Scalable-Distributed TkKS-Join Index Speeds up the calculus of even larger joins Using the parallel distributed processing Dozens of seconds may suffice for an 8-way join – Over our favorite 1Ktuple relations – With two 4-way KS-Indices – Each being distributed over 1K nodes – Through, e.g., RP* SDDS Maintainance time speeds-up as well 28

Scalable-Distributed TkKS-Join Index C = 900 ; arrows show nodes to join in parallel 29 100350800 9900 1050450700

Conclusion TkKS-Joins are potentially useful Our optimizations may speed up the processing by orders of magnitude Queries with TkKS-Joins become then practical With all the usual disclaimers, the results appear ready for mainstream DBMSs 30

Future Work Deeper formal analysis Experiments More TkKS-Join query types – See the paper Thank You for Your Attention 31

Download ppt "Top k Knapsack Joins and Closure Early Results Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France"

Similar presentations