1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine.

Slides:



Advertisements
Similar presentations
CS 245Notes 141 CS 245: Database System Principles Notes 14: Coping with Limited Capabilities of Sources Hector Garcia-Molina.
Advertisements

ICDT'2001, London, UK1 Minimizing View Sets without Losing Query-Answering Power Chen Li Stanford University joint work with Mayank Bawa and Jeff Ullman.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
Lecture 24 MAS 714 Hartmut Klauck
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
SECTION 21.5 Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
EE 553 Integer Programming
The Theory of NP-Completeness
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
ICDT'2001, London, UK1 On Answering Queries in the Presence of Limited Access Patterns Chen Li Stanford University joint work with Edward Chang, UC Santa.
Polynomial Time Approximation Schemes Presented By: Leonid Barenboim Roee Weisbert.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Efficient Query Evaluation on Probabilistic Databases
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Generating Efficient Plans for Queries Using Views Chen Li Stanford University with Foto Afrati (National Technical University of Athens) and Jeff Ullman.
Cs44321 CS4432: Database Systems II Query Optimizer – Cost Based Optimization.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Near-Optimal Network Design with Selfish Agents By Elliot Anshelevich, Anirban Dasgupta, Eva Tardos, Tom Wexler STOC’03 Presented by Mustafa Suleyman CIFTCI.
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara.
Rada Chirkova (North Carolina State University) and Chen Li (University of California, Irvine) Materializing Views With Minimal Size To Answer Queries.
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.
Presenter: Dongning Luo Sept. 29 th 2008 This presentation based on The following paper: Alon Halevy, “Answering queries using views: A Survey”, VLDB J.
Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
1 Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
SCUHolliday - COEN 17814–1 Schedule Today: u Query Processing overview.
P, NP, and Exponential Problems Should have had all this in CS 252 – Quick review Many problems have an exponential number of possibilities and we can.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
CS6321 Query Optimization Over Web Services Utkarsh Kamesh Jennifer Rajeev Shrivastava Munagala Wisdom Motwani Presented By Ajay Kumar Sarda.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
CS4432: Database Systems II Query Processing- Part 2.
 2005 SDU Lecture13 Reducibility — A methodology for proving un- decidability.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Efficient Evaluation of Queries in a Mediator for WebSources Louiqa Raschid University of Maryland Joint work with Zadorozhny, Vidal, Urhan, Bright.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Database Management System
Chapter 12: Query Processing
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Overview of Query Evaluation
Chen Li Information and Computer Science
Materializing Views With Minimal Size To Answer Queries
Complexity Theory: Foundations
Presentation transcript:

1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine

2 Information integration Legacy database Plain text files Biblio sever Support seamless access to autonomous and heterogeneous information sources.

3 Mediation architecture Mediator Wrapper Source 1 Wrapper Source 2 Source n

4 Limited Source Capabilities Ullman DBSI Knuth TeX … … author title Get all books. support “select * from R” queries Traditional DBs

5 Limited source capabilities (cont) Ullman DBSI Knuth TeX … … author title Given an author, return the books. However, in many environments, complete scans of relations may not be possible Reasons: – Legacy databases or structured files: limited interfaces – Security/Privacy – Performance concerns

6 Example: Web search forms

7 Research summary Some of my work on query processing in the presence of limited source capabilities: Generating a feasible plan for a query: demo SIGMOD'1998.demo SIGMOD'1998 Optimizing large-join queries: ICDT’1999.ICDT’1999 Describing source capabilities and computing capabilities of mediators: SIGMOD'1999.SIGMOD'1999 Computing maximal answers to a query by borrowing information from sources not in the query : ICDE'2000, TODS 2001.ICDE'2000TODS 2001 Deciding whether all the answers to a query can be computed and testing relative query containment: ICDT'2001, Journal of VLDB.ICDT'2001Journal of VLDB Other work: e.g., answering queries using views with binding patterns [RSU95], query rewriting for semi-structured data [PV99], etc.

8 Attribute adornments: f: free b: bound u: unspecified o[S]: optional, if chosen, must be from a list S of constants c[S]: chosen from a list S of constants A search form is represented as multiple templates: (Title, Author, ISBN, Format, Subject) u u b u u b f u u u f b u u u o[] u u o[] o[] [SIGMOD’99] Describing source capabilities

9 Binding patterns Common source limitations Attributes with adornments: — b: bound — f: free Example: R(Author, Title) — Given an author, return the books. — R(Author b, Title f ) A relation can have multiple binding patterns.

10 Part I: Optimizing large-join queries Part II: Deciding whether all the answers to a query can be computed Rest of the talk

11 Given a query Q on relations with restrictions: Can we answer Q? — I.e., find an executable plan to answer the query while observing the source access patterns How to answer Q efficiently? Part I: Optimizing large-join queries

12 Example: three movie sources R(Star, Movie) Reeves The Matrix Connery The Rock … … The Matrix Warner Bros American Beauty …… DreamWorks S(Movie,Studio) The Rock 1996 The Matrix …… 1999 T(Movie,Year)

13 “Find the movies made by Warner Bros. in 1999 and in which Keanu Reeves starred?” SQL: SELECT Movie FROM R JOIN S JOIN T WHERE Star = ’reeves’ AND Studio = ’warner’ AND Year = 1999; R(Star, Movie) S(Movie,Studio) T(Movie,Year) reeves 1999 warner Query Q

14 Answer Q Plan P 0 Star=reeves R(Star, Movie) Studio=warner S(Movie,Studio) Year=1999 T(Movie,Year) Movie

15 What if limited source capabilities? R(Star b, Movie f ): requires a Star name S(Movie b,Studio f ): requires a Movie title T(Movie b,Year f ): requires a Movie title

16 Does P 0 work? No! Since S and T do not support the queries. Star=reeves Studio=warner Year=1999 R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f )

17 A feasible plan P1 Star=reeves R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) Reeves movies by warner of 1999 Reeves movies by warner

18 Another feasible plan P2 Star=reeves R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) Reeves movies Reeves movies of 1999 Reeves movies of 1999 by warner

19 Question 1: query answerability Given Q on relations with restrictions, can we process its conditions by accessing relations with legal patterns?

20 Consider SPJ Queries Select-project-join queries (conjunctive queries): q(X) :- g 1 (X 1 ),…, g n (X n ) — subgoal g i (X i ): g i is a relation, X i is a tuple of variables/constants Example: SELECT Movie FROM R JOIN S JOIN T WHERE Star = ’reeves’ AND Studio = ’warner’ AND Year = 1999; q(M) :- R(reeves,M),S(M,warner),T(M,1999)

21 Algorithm “Inflationary”: Testing answerability of Q R(Star b, Movie f ), S(Movie b,Studio f ), T(Movie b,Year f ) Check what subgoals can be processed given B More subgoals can be processed All subgoals are answerable, so Q is answerable q(M) :- R(reeves,M),S(M,warner),T(M,1999) More positions become bound: add {M} to B M MM Bound positions: B = {reeves, warner, 1999} reeves warner1999

22 Question 2: generate efficient plans? Number of source accesses: = 17 Plan P 1 R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) Consider number of source accesses. 1 Star=reeves 12 reeves movies by warner of reeves movies by warner

23 Cost of plan P2 Plan P 2 1 star=reeves R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) 12 reeves movies 1 of 1999 reeves movies by warner of 1999 Number of source accesses: = 14

24 How to generate efficient plans? Challenges: — Often source statistics hard to get — Search space different from a traditional optimizer (e.g., System-R) Ordering subgoals Considering left-deep trees versus bushy trees Deciding join methods (e.g., hash join, nested-loop join) Need to consider feasible plans! Cost model: total number of source accesses — Reason: each source access is expensive! network traffic/delay, dynamic source availability, source charges — Results extendable to more general cost models, e.g.:

25 Left-deep trees versus bushy trees S R T U T R S U Result: Left-deep trees guarantee to include an optimal plan. Left-deep tree Bushy tree

26 Complexity The problem of finding the optimal feasible plans is NP-hard — Proof: by reduction from the Vertex Cover Problem Since the number of subgoals could be large, we want approximation algorithms for finding near-optimal plans quickly

27 Case 1: no source statistics Algorithm CHAIN: — Greedy approach — At each step, find a subgoal with the lowest cost

28 CHAIN R(Studio b, Movie f, Star f ) S(Movie b,Year f ) T(Star b,Addr f ) Movie Star 2 movies 3 stars Choose S next! Collect source information as we process subgoals. Murphy Diaz Reeves Shrek Matrix

29 CHAIN: Properties Does not need source statistics Polynomial time: O(n 2 ), n is the number of subgoals — Only needs results returned from the sources It is n-competitive: Cost(PLAN chain ) <= n * Cost(PLAN opt )

30 Case 2: source statistics available Algorithm: PARTITION — Grouping approach: group subgoals into clusters — Find an optimal subplan within each cluster — Combine subplans to construct a plan

31 PARTITION answerable subgoals given initial constants in Q new answerable subgoals new bound args group 1 group 2 new answerable subgoals new bound args group k … optimal subplan p1 optimal subplan p2 optimal subplan pk complete plan

32 PARTITION: Properties Need source statistics Guarantees an optimal plan if 1 or 2 clusters No bound when missing optimal plans

33 Performance analysis Test bed: — 15 source relations; number of subgoals: 1 – 10 — For each number of subgoals, ran 1000 random queries — Other factors considered: number of binding patterns for a relation number of constants in a query cardinalities of attributes Results: — Both algorithms generate good plans — PARTITION takes more time than CHAIN — PARTITION generates better plans than CHAIN

34 Average probability of missing optimal plans: CHAIN <= 25% PARTITION <= 5% Probability of missing optimal plans

35 Average difference: CHAIN < 5% PARTITION < 2% Difference from the optimal plans

36 Extending to other cost models Different sources have different costs: — CHAIN is still n-competitive — PARTITION still finds an optimal plan if the number of groups is <= 2 Consider size of data transferred: — CHAIN is still n-competitive — PARTITION may miss the optimal plans even if the number of groups is <= 2

37 Given a query Q on relations with restrictions: Deciding whether all the answers to a query can be computed Part II

38 Take Conjunctive queries (CQ’s) as an example If the Inflationary algorithm terminates while some subgoals are not answerable, can we say there is no way to compute Q’s answers? — No! Motivation

39 r(Star, Movie) s(Movie, Award) Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… Example: A movie database Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

40 r(Star b, Movie f ) s(Movie b, Award f ) Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… Limited access patterns Should provide a star.Should provide a movie.

41 Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… Answering Q given the restrictions r(Star b, Movie f ) s(Movie b, Award f ) Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

42 Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… The answer is complete We did not retrieve all the tuples from the relations. Still we computed all tuples in the answer to the query. r(Star b, Movie f ) s(Movie b, Award f ) Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

43 Run Inflationary algorithm Every subgoal would be answerable r(Star b, Movie f ) s(Movie b, Award f ) Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

44 How about Q’? Q’(Award) :- r(henry fonda,Movie), s(Movie,Award),r(Star,Movie) Inflationary will find that subgoal r(start,Movie) is not answerable — variable Star cannot be bound Can we say Q’ cannot be answered? — No! — It is essentially equivalent to the old query Q — Thus we can answer Q’ by answering Q!

45 Observations of binding patterns If a relation does not have an “all-free” binding pattern, then after certain queries are sent to this relation, there can always be some tuples that have not been retrieved.

46 General questions Given a query on relations with limited access patterns, can we compute its complete answer by accessing the relations with legal patterns? — If so, called “Stable” queries The example shows that the solution is more than running the Inflationary algorithm

47 General questions (cont) Given a query Q, if Inflationary claims that some subgoals are not answerable, how do we know whether there is another equivalent query Q’, such that Inflationary “succeeds” on Q’? Notice that there are infinite number of equivalent queries of Q Furthermore, even if Inflationary were able to say that all these equivalent queries have some unanswerable subgoal, how do we know if there isn’t any “magical” plan that can compute all the answers to Q?

48 Query stability A query Q on relations with binding patterns is stable if for any database, we can compute its complete answer by accessing the relations with legal patterns. The complete answer is the computable answer if we could retrieve all the tuples from the relations. Use partial tuples to derive the complete answer: we need reasoning!

49 Feasible CQ’s A CQ is feasible if it has a feasible (i.e., executable or answerable) order of all its subgoals. Lemma: A feasible CQ is stable. Testing feasibility of a CQ: Inflationary algorithm

50 What if Q is not feasible? Our example shows that an infeasible query could still be stable

51 Testing stability of a CQ Theorem: A CQ Q is stable iff its minimal equivalent Q m is feasible. Minimal equivalent query Q m Q m is unique

52 Main idea of the proof Construct two databases of the relations They have the same observable tuples, but yield different answers to the query Thus, we cannot tell whether the computed answer is complete or not Same observable tuples Database D1 Database D2 Different answers to Q

53 Another way to test CQ stability Q: q(X) :- g 1 (),…, g k (), g k+1 (), …, g n () Q’: q(X) :- g 1 (),…, g k () Compute all executable (answerable) subgoals of Q, wlog, denoted as g 1 (),…, g k () If all subgoals become executable, then Q is stable Otherwise, test equivalence between Q and Q’ Theorem: Q is stable iff Q and Q’ are equivalent

54 Advantage of the second approach Could be extended to other query classes — E.g., CQ’s with comparisons (“year > 1995”) Notice that in these classes, “minimal equivalent query” of a query might not be unique or hard to find.

55 Two algorithms for testing stability of CQ’s Algorithm CQStable — Minimize Q, get its minimal equivalent Q m — Test feasibility of Q m by calling Inflationary Algorithm CQStable* — Compute “executable” Q’ from Q — If all subgoals become executable, then Q is stable — Otherwise, test equivalence between Q and Q’ CQStable* is more efficient than CQStable Testing stability of a CQ is NP-complete.

56 Other classes of queries Unions of CQs: — Still we need to “minimize” the query first — two algorithms for testing stability

57 CQs with arithmetic comparisons More complicated due to potential equality among variables Need to consider all total ordering An algorithm for the testing stability

58 Datalog queries Testing stability undecidable Give a sufficient condition for stability of Datalog

59 Dynamic computability of complete answer to CQs For a nonstable CQ Q, for certain database, its complete answer might be computed.

60 An example Q1: ans(B) :- r(a,B,C),s(C,D) Not stable For the following database, we can still compute Q1’s complete answer: {b1,b2}. d1 d2 … r(A b, B f, C f ) ab1 …… c1 ab2c2 ab2c3 … d1 … c1 d2c2 … s(C f, D b )p(D f )

61 Change the head argument Q2: ans(D) :- r(a,B,C),s(C,D) Still not stable For the database, we cannot compute Q2’s complete answer. d1 d2 … r(A b, B f, C f ) ab1 …… c1 ab2c2 ab2c3 … d1 … c1 d2c2 … s(C f, D b )p(D f )

62 Difference between Q1 and Q2 b f f f b Q1: ans(B) :- r(a,B,C),s(C,D) Q2: ans(D) :- r(a,B,C),s(C,D) Q1’s head argument B is bound by the executable subgoal r(a,B,C). Q2’s head argument D is not bound by the executable subgoal r(a,B,C).

63 Generalization q(X) :- g 1 (X 1 ), …, g k (X k ), g k+1 (X k+1 ), …, g n (X n ) Executable subgoals: E = g 1 (X 1 ),…, g k (X k ) If all arguments in X are bound in E: — we might compute its complete answer. — The computability is database dependent. If some arguments in X are not bound in E: — we can never compute its complete answer. — Unless the relation after the subgoals in E is empty.

64 A decision tree It guides the planning process of computing the complete answer to a query. Two approaches while traversing the tree: — optimistic — pessimistic

65

66 Conclusion We worked on research problems on query processing and optimization in the presence of limited query capabilities There are still research issues in this area

67 Peer-based distributed data integration and sharing Data Cleansing to improve information quality Other On-going research projects

68 Network Peer User Interface Wrapper Metadata Manager Query Engine Data Repository Peer Passive Source User Passive Source User Peer-based data integration and sharing

69 The RACCOON Project on Peer-based Data Integration and sharing

70 The FLAMINGO PROJECT: CLEANSING DATA TO IMPROVE INFORMATION QUALITY