Presentation is loading. Please wait.

Presentation is loading. Please wait.

AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.

Similar presentations


Presentation on theme: "AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration."— Presentation transcript:

1 AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

2 2 Data Integration Challenge Find Olympus cameras on sale and their reviews (Brand, Cameras) (Olympus, C-3000) (Cameras, Reviews) (C-3000, review-article-1) TARGET.COM WAL-MART.COM BESTBUY.COM EPINIONS.COM DPREVIEW.COM CONSUMER REPORTS.ORG

3 3 Architecture of a Data Integration System Query Optimizer Execution Engine Query Reformulator Find Olympus cameras on sale and their reviews logical query plans physical query execution plans TARGET.COM EPINIONS.COM TARGET.COM DPREVIEW.COM TARGET.COM CONSUMERREPORTS.COM WAL-MART.COM EPINIONS.COM BESTBUY.COM CONSUMERREPORTS.COM Answers = UNION of outputs of all logical query plans Must execute multiple plans!

4 4 Ordering Query Plans Time to & quality of first answers is important! –executing all plans is expensive or infeasible –plans tend to vary significantly in their utility –coverage, execution time, monetary cost,... Solution –find query plans in decreasing order of utility –execute best plans first –abort query execution as soon as –satisfactory answer is found, or –resource limits have been reached

5 5 Our Contributions Formally defined plan-ordering problem –does not assume any specific utility measure –models dependencies among plans Developed three efficient solutions –GREEDY: exploits utility monotonicity –iDRIPS: exploits source similarity –STREAMER: exploits source similarity, plan independence utility-diminishing returns –work with a broad range of utility measures –find the best plans very fast

6 6 Problem Definition Utility measure –plan coverage: number of new answers returned by a plan –execution time, monetary fee –plan utility depends on plans previously executed! Plan-ordering problem –modify query reformulator so that given user query and utility measure, it outputs –best plan p1 –next best plan p2, assuming p1 has been executed –next best plan p3, assuming p1 & p2 have been executed,... –focus on finding first few best plans

7 7 Current Query Reformulator: the Bucket Algorithm [Levy et al., VLDB-96] Collect sources into buckets –sources in a bucket can return answer to a certain part of query Take cross product of buckets –to form logical query plans Find Olympus cameras on sale and their reviews Bucket B 1 Bucket B 2 V 1 : TARGET V 4 : EPINIONS V 2 : WAL-MART V 5 : DPREVIEW V 3 : BESTBUY V 6 : CONSUMERREPORT V 1 V 4 V 1 V 5... V 3 V 5 V 3 V 6

8 8 Our GREEDY Algorithm Properties –linear run time –broadly applicable –many practical utility measures are monotonic [Yerneni et al., EDBT-98] Utility monotonicity –if replacing a source by a “better” source yields a better plan –e.g., cost(V i V j ) = cost(V i ) + cost(V j ) Finds best plan –by local comparison of sources Removes best plan & finds next best plan,... V5V5 V6V6 V4V4 V2V2 V3V3 V1V1 B 1 B 2 V 1 V 5

9 9 Source Similarity Two sources are similar –if replacing one by the other changes plan utility very little Large domains often have many similar sources –similar in monetary fee, access time, coverage, etc Key idea –similar sources can be grouped and treated as a single source V 1 : time = 2, fee = 3 V 2 : time = 3, fee = 4 V 4 : time = 1, fee = 2 V 1 V 4 : time = 3, fee = 5 V 2 V 4 : time = 4, fee = 6 V 12 : time = [2,3], fee = [3,4] V 12 V 4 : time = [3,4], fee = [5,6] utility(V 1 V 4 ) = 0.5 utility(V 2 V 4 ) = 0.7 utility(V 12 V 4 ) = [0.4,0.7] Abstract source Abstract plan

10 10 Grouping Sources to Find Best Plan: DRIPS Algorithm [Haddawy et al., UAI-95] V 456 V 56 V4V4 V5V5 V6V6 V 123 V 12 V3V3 V1V1 V2V2 V 123 V 456 V 12 V 456 V 3 V 456 V 2 V 456 V 3 V 4 V 3 V 56 [0.5, 0.8] [0.4, 0.6] [0.1, 0.3] 0.8[0.6, 0.7] [0.1, 0.7] Branch & Bound Search Source Grouping V 3 V 4 V 3 V 56 V 1 V 456 V 2 V 456 V5V5 V6V6 V4V4 V2V2 V3V3 V1V1 B 1 B 2 Dominance graph

11 11 Extending DRIPS: iDRIPS & STREAMER iDRIPS (iterative DRIPS) –applies DRIPS to find best plan –removes best plan, re-groups sources –applies DRIPS to find second best plan,... Observation –iDRIPS may re-establish dominance relations many times Challenge: recycle dominance relations Solution: STREAMER –applicable when utility-diminishing returns holds –exploits plan independence V 2 V 456 V 3 V 4 V 3 V 56 V 1 V 456

12 12 The STREAMER Algorithm Second IterationFirst Iteration V 2 V 456 V 3 V 56 V 1 V 456 V 3 V 56 V 2 V 456 V 1 V 456 V 3 V 4 V2V4V2V5V2V6V2V4V2V5V2V6 V1V4V1V5V1V6V1V4V1V5V1V6 still true if utility-diminishing returns holds + V 3 V 4 is independent of V 1 V 456

13 13 Summary & Experiments Empirical evaluation of iDRIPS and STREAMER –seven non-monotonic utility classes –for five classes: source grouping worked –both algorithms found first 100 plans very fast –STREAMER outperformed iDRIPS (when it is applicable) Algorithms Applicable when Evaluation GREEDY utility monotonicity O(nm 2 k 2 ) iDRIPS source similarity empirical STREAMER source similarity empirical utility-diminishing returns plan independence

14 14 Related Work Query reformulation algorithms –BUCKET [Levy et al., VLDB-96] INVERSE-RULE [Duschka&Genesereth, PODS-97] MINICON [Pottinger&Levy, VLDB-00] –our solutions generalize to all of these Ordering query plans –[Levy et al., AAAI-96][Florescu et al., VLDB-97] [Naumann et al., VLDB-99][Leser&Naumann, FQAS-00],... –only considered in restricted settings Query optimization –many works at all levels –most works optimize cost to get all answers

15 15 Conclusions Ordering query plans is important & difficult Contributions –formally defined problem –identified interesting problem properties –utility monotonicity –source similarity –plan independence –utility-diminishing returns –developed 3 solutions: GREEDY, iDRIPS, STREAMER –solutions can handle a broad range of utility measures –showed that solutions find best plans very fast


Download ppt "AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration."

Similar presentations


Ads by Google