Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University.

Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University of Jerusalem 25th International Conference on Data Engineering Shanghai, 2009 ICDE2009

Large Query Results ? timeRESULT = huge #answers Bad answers? Maybe a new query? ……… Can’t you be faster? Many answers!

Tutorial Goal In today’s world users are not willing to wait for answers –Online querying: provide some (“top-k”) results, the use paging for remaining results Previous work on returning top-k often do not guarantee: –Fast runtime –Best results In this talk: –Goal is not to present solutions to specific problems –Goal is to present general techniques for efficient (ranked) enumeration with guarantees Heuristics

OverviewIntroduction Lawler-Murty’s Ranked Enumeration Maximal Answers under Hereditary Properties Additional Techniques Summary & Concluding Remarks

Tractability of Enumeration x Yes | No y = opt{ z | property x (z) } x 1 bit usually, O(|x|) bits x a 1, a 2, a 3,…, a 2 |x|,…, a mOptimization algorithm Decision Enumeration Enumeration algorithm Efficient: polynomial time, linear time, log-space, … What is “efficient”?

Standard Notions of Tractability Combined complexity:Combined complexity: input = data + query  often, implies that any algorithm must be exponential in the worst case –That doesn’t help! What meaning can be given to the notion of efficient? What about special cases where the output happens to be small? Data complexity:Data complexity: fix the query; input = data  often, implies a poly. bound on the size of the output –But then, the core problem is missed: The output is no longer “huge” What else?

Tractability of Enumeration start time polynomial total time Running time is polynomial in input + output incremental polynomial time Delay before answer i is polynomial in input + i start time start time polynomial delay Delay between successive answers is poly(input) If answers are ranked, we prefer enumeration in ranked order ⇑ ⇑

Examples of Complexity Results start time polynomial total time Running time is polynomial in input + output incremental polynomial time Delay before answer i is polynomial in input + i start time start time polynomial delay Delay between successive answers is poly(input) Acyclic CQs Acyclic CQs [Yannakakis81] Acyclic CQs Acyclic CQs w/ monotonic ORDER BY [KS06] Not general CQs! [ChandraMerlin77] Full Disjunctions Full Disjunctions [KanzaS03] Full Disjunctions Full Disjunctions [CS05] Full Disjunctions Full Disjunctions [C. et al. 06] Max Cliques Max Cliques [Johnson et al. 88] Loopless paths by inc. weight Loopless paths by inc. weight [Yen71] Horn-clause solutions Horn-clause solutions [CreignouHebrard97]

Intuition: What’s the Problem? We need to create all the answers without needless repetition  That is, when we print an answer to the output, we need to validate that it hasn’t been previously printed Recursive algorithm executions can take an exponential time (many sub-answers which may lead to empty results, can’t wait that long…) When enumerating in ranked order, we cannot generate all answers and then sort  Otherwise, we get neither polynomial delay nor incremental polynomial time next 

Often, quite simple (not always!) Bottom Line: Lawler-Murty gives a general reduction: Enumeration in ranked order Optimization under constraints Find top answer under inclusion & exclusion constraints if poly. time then poly. delay

Problem Formulation O = A collection of objects A = score() 2131282717 score( a ) is high  a is a high-quality answer Huge, implicitly described by a constraint over O ’s subsets… Goal:Enumerate all the answers in ranked order Goal: Enumerate all the answers in ranked order … 32 31 28 start Answers a ⊆ O (that is, by decreasing score) input Polynomial delay Required complexity: Polynomial delay Top-k Answers Special case:

Example 1: Graph Search O = A = … The nodes of the graph G Data graph G Set of keywords K Data graph G Set of keywords K Node sets a o f size |K| that contain all the keywords of K score( a ): 1 min size of a subtree containing a

Example 2: k-Best Perfect Matchings O = Edges of the graph G Weighted, bipartite graph G Matchings: Sets of edges—pairwise-disjoint & cover all nodes score( a ): A = … ∑ e ∈ ae ∈ a weight(e)

Example 3: Ranked Queries O = Mappings: (Query symbol → DB item) Database D Query Q Database D Query Q Matches a o f the query in the database IR / O RDER BY / … score( a ): A = …

What’s the Problem? O = 32 start 1 st (top) answer Optimization problem Assumption: Efficiently solvable 31 2 nd answer ?... 17 k th answer ≠ previous (k-1) answers best among remaining answers Conceivably, much more complicated than finding 1 st ? How to handle this constraint? Moreover, k may be very large!

Lawler-Murty’s Method A start [K. G. Murty, 1968] [E. L. Lawler, 1972]

1. Find & Print the Top Answer A start In principle, at this point we should find the second-best answer But Instead…

2. Partition the Remaining Answers A start simple constraints Each partition is defined by a distinct set of simple constraints

3. Find the Top of each Set A start

4. Find & Print the Second Answer A start Best among all the top answers in the partitions Next answer: Best among all the top answers in the partitions

5. Further Divide the Chosen Partition A start … and so on …...

A Partition is Defined by Constraints Two types of constraints: Inclusion constraint: “Must contain ” Exclusion constraint: “Must not contain ” A partition is defined by a set I ∪ E of inclusion and exclusion constraints Recall: Partition I ∪ E a How to further partition after removing a ? next 

EI Partitioning a Partition a = top(partition) ✗✗✗✗✗ ✓✓✓ Partition I ∪ E - { a }

EI Partitioning a Partition a = top(partition) ✗✗✗✗✗ ✓✓✓ Partition I ∪ E - { a } P 1 =( I 1, E 1 ) E I ✗✗✗✗✗ ✓✓✓ ✗

EI Partitioning a Partition a = top(partition) ✗✗✗✗✗ ✓✓✓ Partition I ∪ E - { a } P 1 =( I 1, E 1 ) E I ✗✗✗✗✗ ✓✓✓ ✗ P 2 =( I 2, E 2 ) E I ✗✗✗✗✗ ✓✓✓✓ ✗

EI Partitioning a Partition a = top(partition) ✗✗✗✗✗ ✓✓✓ Partition I ∪ E - { a } P 1 =( I 1, E 1 ) E I ✗✗✗✗✗ ✓✓✓ ✗ P 2 =( I 2, E 2 ) E I ✗✗✗✗✗ ✓✓✓✓ ✗ P 3 =( I 3, E 3 ) E I ✗✗✗✗✗ ✓✓✓✓ ✗ ✓

EI Partitioning a Partition a = top(partition) ✗✗✗✗✗ ✓✓✓ Partition I ∪ E - { a } P 1 =( I 1, E 1 ) E I ✗✗✗✗✗ ✓✓✓ ✗ P 2 =( I 2, E 2 ) E I ✗✗✗✗✗ ✓✓✓✓ ✗ P 4 =( I 4, E 4 ) E I ✗✗✗✗✗ ✓✓✓✓✓✓✓ P 3 =( I 3, E 3 ) E I ✗✗✗✗✗ ✓✓✓✓ ✗ ✓

EI Partitioning a Partition a = top(partition) ✗✗✗✗✗ ✓✓✓ Partition I ∪ E - { a } P 1 =( I 1, E 1 ) E I ✗✗✗✗✗ ✓✓✓ ✗ P 2 =( I 2, E 2 ) E I ✗✗✗✗✗ ✓✓✓✓ ✗ P 4 =( I 4, E 4 ) E I ✗✗✗✗✗ ✓✓✓✓✓✓✓ P 3 =( I 3, E 3 ) E I ✗✗✗✗✗ ✓✓✓✓ ✗ ✓ P 5 =( I 5, E 5 ) EI ✗✗✗✗✗ ✓✓✓✓✓✓✓ ✗

Complementary Details A partitioned is represented as a triple ( I, E, a ) –I and E are sets of inclusion and exclusion constraints, resp. (lists of objects); a is the top answer of the partition Current triples ( I, E, a ) are stored in a priority queue Q, prioritized by score( a ) –Initially, Q = { ( ∅,∅, a opt ) } In each iteration, 1. 1. The top triple ( I t, E t, a t ) is extracted from Q 2. 2. The answer a t is printed 3. 3. The new nonempty sub-partitions ( I i, E i, a i ) are inserted into Q … until Q is empty –Top-k: until k answers have been printed

Often, quite simple (not always!) Enumeration  Optimization In the bottom line, Lawler-Murty gives a reduction: Enumeration in ranked order Optimization under constraints Find top answer under inclusion & exclusion constraints if poly. time then poly. delay Example: Perfect matchings by decreasing weight 

Perfect Matchings by Dec. Weight edges have weights (not specified) Top matching: A perfect matching, such that the total sum of edge weights is maximal Efficiently solvable: Hungarian Algorithm (1955), Blossom Algorithm (1965), …

Max. Perfect Matching w/ Constrains ✗ ✓ ✓ edges have weights (not specified)

Handling Exclusion Constraints ✗ ✓ ✓ edges have weights (not specified) Excluded edges are simply removed!

Handling Inclusion Constraints ✓ ✓ edges have weights (not specified) Non-inclusion edges incident to nodes of inclusion edges are removed

That’s All! edges have weights (not specified) It is now the original problem (w/o constraints)! So, we can use Hungarian Algorithm, Blossom Algorithm, …  Perfect matchings by decreasing weight  Perfect matchings by decreasing weight

More: Keyword Proximity Search Lawler-Murty’s was used in [KS06] for solving the problem of keyword proximity search –Input: Data graph G, set of keywords K –Answers: Non-redundant subtrees of G that contain K –Score: 1/(total weight) In other words, “top-k Steiner trees” 2 problems: 1. 1. Opt. w/ constraints is NP-hard, even for 2 kw’s Solution: constraints are carefully constructed so that only tractable constraints are generated 2. 2. No bound on #kw’s → NP-hard even w/o constraints ▪ A bound on K is often reasonable (data complexity) ▪ Otherwise, approximations can be used

31 Ranked vs. Approximate Order If Then score( ) ≥ score( ) 32 28 2721 12 Ranked order start

31 Ranked vs. Approximate Order If Then score( ) ≥ score( ) 32 28 2721 12 31 32 28 2721 12 If Then ≤ C score( ) Ranked order C-approximate order start

Generalized Lawler-Murty Lawler-Murty’s reduction can be generalized: Enumeration in a C-approximate ranked order Approximate Optimization Find a C-approximation of the top answer under inclusion & exclusion constraints if poly. time then poly. delay

Often, quite simple (not always!) Bottom Line In the bottom line, Hereditary Properties algorithm gives a reduction: Enumeration Input Restricted Enumeration if poly. time then poly. delay if inc. poly. time then inc. poly. time if poly. total time then poly. total time

Problem Formulation O = A collection of objects A =… Goal:Enumerate all the maximal answers efficiently Goal: Enumerate all the maximal answers efficiently Answers a ⊆ O input P = property: (1) polynomially verifiable, (2) hereditary or connected-hereditary Maximal subsets of O that satisfy the property P

Maximal Answers: Details Given P and O, a subset a of O is a maximal answer if: 1. 1. a satisfies P and 2 2. there is no additional object o that can be added to a while preserving P O = does not satisfy P satisfies P

Maximal Answers: Full Disjunctions = Generalization of Outer-Join Operator CountryClimate Canadadiverse Bahamastropical UKtemperate CountryCityHotelStars CanadaTorontoPlaza4 CanadaLondonRamada3 BahamasNassauHilton CountryCitySite CanadaLondonAir Show CanadaMouth Logan UKLondonBuckingham UKLondonHyde Park Climates Accommodations Sites P = “join consistent and connected” A = … 123123123411121

Types of Properties P is polynomially verifiable if we can check in polynomial time whether a set a satisfies P P is hereditary if a satisfies Pa’ satisfies P,  a’  a  Suppose there is a binary relationship defined over the objects (i.e., they are graph nodes) P is connected hereditary if Suppose there is a binary relationship defined over the objects (i.e., they are graph nodes) P is connected hereditary if a satisfies P a’ satisfies P,  connected a’  a  a is connected and

Examples of Properties Hereditary Properties Connected-hereditary Properties Is a Clique Is a Bipartite Matching Is a Forest Is a Tree Is Join Consistent and Connected Is Homomorphic to a Subtree of a given Labeled Tree Is 3-colorable Not polynomially verifiable

Problem Formulation O = A collection of objects A =… Goal:Enumerate all the maximal answers efficiently Goal: Enumerate all the maximal answers efficiently Answers a ⊆ O input P = property: (1) polynomially verifiable, (2) hereditary or connected-hereditary Maximal subsets of O that satisfy the property P

Example 1: Full Disjunctions O = A = … The tuples of D from the relations in Q FD Query Q Database D FD Query Q Database D Maximal sets a o f join consistent and connected tuples P = “join consistent and connected”

Example 2: Maximal Tree Answers O = A = … The nodes of X Tree Query Q XML doc X Tree Query Q XML doc X Maximal sets a o f nodes, such that a induces a subtree homomorphic to a subtree of Q P = “homomorphic to a subtree of Q”

Example 3: Maximal Bipartite Matchings O = A = … The edges of G Bipartite graph G Maximal matchings P = “is a matching”

Example 4: Maximal Cliques O = A = … The nodes of G Graph G Maximal cliques P = “is a clique”

Strategy Recall: –Lawler-Murty’s reduced the enumeration problem to an optimization problem –Runtime: polynomial delay if the optimization is in polynomial time For this problem: –Reduce enumeration problem to a restricted version –Runtime depends on that of the restricted enumeration problem

The Restricted Version O = A collection of objects A =… Goal:Enumerate all the maximal answers efficiently Goal: Enumerate all the maximal answers efficiently Answers a ⊆ O input P = property: (1) polynomially verifiable, (2) hereditary or connected-hereditary Maximal subsets of O that satisfy the property P A collection of objects that almost satisfies P

Almost Satisfies O almost satisfies P if there is an object o ∈ O, such that O - { o } satisfies P O = ✓

Example: Maximal Bipartite Matchings This set of edges almost satisfies “is a bipartite matching” Enumeration problem: Find all maximal bipartite matchings from this set of edges

Example: Maximal Bipartite Matchings One maximal bipartite matching

Example: Maximal Bipartite Matchings Another maximal bipartite matching For the restricted enumeration problem there are always at most 2 answers, and they can be found in polynomial time

Reduction Complexity Results Given an algorithm that solves the restricted version, we will show an algorithm that solves the general (unrestricted) version Complexity of Restricted Version Complexity of Unrestricted Version Poly. Total Time Inc. Poly. Time PolynomialPoly. Delay

Our Method start O [CKS, To appear]

1. Find & Print & Store a Maximal Answer start One answer can always be found in polynomial time But Instead… O Now we should look for another answer

Add to set of items found to create a restricted version of the problem U 2. For each remaining object not in set start O

U 3. Enumerate Solutions to Restricted Problem start O Enumerate solutions for U (This is the reduction) start

U 4. For each Solution to Restricted Problem start O Maximally extend to get an answer to the original problem (extending is always polynomial) start … and so on … Continue to enumerate answers to Continue to enumerate answers to U…. Continue to add other nodes from to form new sets for Continue to add other nodes from O to form new sets for U

Some More Details Each answer generated must be stored, so that we do not repeat answers –Use an index structure Printing actually happens at different points depending on the parity of the level of recursion –This allows for polynomial delay Memory efficient versions –For hereditary properties we have a memory efficient version

DB Problems for which this is Useful In the context of incomplete information –Look for maximal answers, not complete answers Full disjunctions in poly delay –A generalization of the outer-join to any number of relations Maximal matches to tree queries in poly delay

Ranked Order Algorithm returns answers in arbitrary order Cannot return answers with more objects first –Ranking function is number of objects in set Famous result on NP-hardness of node-deletion for hereditary and connected-hereditary properties [Lewis, Yannakakis, STOC 78] Can return in ranked order for monotonically c- determined ranking functions (details omitted) Question: Can we return answers efficiently in ranked order? Answer: In general, no

Often, quite simple (not always!) Enumeration  Restricted Enumeration In the bottom line, Hereditary Properties algorithm gives a reduction: Enumeration Input Restricted Enumeration if poly. time then poly. delay if inc. poly. time then inc. poly. time if poly. total time then poly. total time

Often, quite simple (not always!) Bottom Line In the bottom line, technique shown next gives a reduction: Enumeration Decision of Non-emptiness under constraints if poly. time then poly. delay

Recursive Partition of the Output O = A collection of objects A =… Answers a ⊆ O input Enumeration Algorithm: Choose an object o ∈ O Enumerate all the answers that contain o Enumerate all the answers that do not contain o ✓ ✗ We need an algorithm for a generalized problem: ?

Generalized Enumeration Enumerate( I, E ): If I ∪ E = O, then print( I ) and return; otherwise: Choose an object o ∈ O – ( I ∪ E ) If ≥1 answers satisfy I ∪ { o }, E  E Enumerate( I ∪ { o }, E ) If ≥1 answers satisfy I, E ∪ { o }  E Enumerate( I, E ∪ { o } ) E I ✗✗✗✗✗ ✓✓✓ O = { } Goal: Enumerate all the answers a, s.t. I ⊆ a and a ∩ E = ∅ Can be empty! Exponential delay! I & E are satisfiable!

Reduction: Enumeration  Non-emptiness Enumerate( I, E ): If I ∪ E = O, then print( I ) and return; otherwise: Choose an object o ∈ O – ( I ∪ E ) If ≥1 answers satisfy I ∪ { o }, E  Enumerate( I ∪ { o }, E ) If ≥1 answers satisfy I, E ∪ { o }  Enumerate( I, E ∪ { o } ) Poly. time? Polynomial delay! In the bottom line, we get a reduction: Enumerate A with polynomial delay Decide if an answer that satisfies I and E exists Often, o should be carefully chosen…

Often, quite simple (not always!) Enumeration  Decision In the bottom line, technique gives a reduction: Enumeration Decision of Non-emptiness under constraints if poly. time then poly. delay

Comparison with Lawler-Murty Recursive Partition Lawler-Murty Enumeration in ranked order No order (except for very specific cases) Polynomial delay (usually shorter delay!!) Reduces to optimization under constraints Reduces to nonemptiness under constraints Space cost can be linear in the output possibly exp(input) PSPACE

Recursion into Sub-Problems O = A collection of objects input

Recursion into Sub-Problems O = A collection of objects input exp(input) Problem! Problem! ??

Iterators over Poly-Delay Algorithms Recursive calls enumerate many sub-answersProblem: Recursive calls enumerate many sub-answers (which are not final answers) –We cannot let the recursive method call terminate! Idea: Enumeration algorithm as an iterator (also called a co-routine) –iterator.first(): Start the execution until the first answer is generated, and yield –iterator.next(): Resume execution from the last output, until next output (or termination); then yield Now, instead of recursive method calls, use recursive iterators …

Sub-Problems + Iterators O = A collection of objects input Iterator 1Iterator 2 first() next()

Past Uses of Techniques Keyword Proximity Search: Unranked enumeration [KS05] –Used both techniques discussed –Can be combined with heuristics to get efficient heuristically ranked enumeration Full disjunctions: [C. et al. 06] –Used iterators Maximally joining probabilistic relations: [KS07] –Recursive partition (in a non-trivial fashion)

Summary Complexity classes –Polynomial total time –Incremental polynomial time –Polynomial delay

Summary General frameworks for solving enumeration problems –Lawler-Murty: Reduction to optimization –Hereditary properties: Reduction to special case of enumeration –Recursive partition: Reduction to decision problem –Iterators

In theory, theory and practice are the same. In practice, they are not.

These have been implemented and work! ? time Examples: Full disjunctions, Keyword proximity search (approximate, ranked)

Conclusion: Take Home Message 1 Analyze enumeration problems using complexity classes appropriate for enumeration

Conclusion: Take Home Message 2 Frameworks presented may be usable for your problem – plug and play –Allows you to focus on solving “standard” types of problems

Thank you! Questions?

Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University.

Similar presentations

Presentation on theme: "Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University.

Similar presentations

Presentation on theme: "Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University."— Presentation transcript:

Similar presentations

About project

Feedback