1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine.

1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine

2 Information integration Legacy database Plain text files Biblio sever Support seamless access to autonomous and heterogeneous information sources.

3 Mediation architecture Mediator Wrapper Source 1 Wrapper Source 2 Source n

4 Limited Source Capabilities Ullman DBSI Knuth TeX … … author title Get all books. support “select * from R” queries Traditional DBs

5 Limited source capabilities (cont) Ullman DBSI Knuth TeX … … author title Given an author, return the books. However, in many environments, complete scans of relations may not be possible Reasons: – Legacy databases or structured files: limited interfaces – Security/Privacy – Performance concerns

6 Example: Web search forms www.imdb.com

7 Research summary Some of my work on query processing in the presence of limited source capabilities: Generating a feasible plan for a query: demo SIGMOD'1998.demo SIGMOD'1998 Optimizing large-join queries: ICDT’1999.ICDT’1999 Describing source capabilities and computing capabilities of mediators: SIGMOD'1999.SIGMOD'1999 Computing maximal answers to a query by borrowing information from sources not in the query : ICDE'2000, TODS 2001.ICDE'2000TODS 2001 Deciding whether all the answers to a query can be computed and testing relative query containment: ICDT'2001, Journal of VLDB.ICDT'2001Journal of VLDB Other work: e.g., answering queries using views with binding patterns [RSU95], query rewriting for semi-structured data [PV99], etc.

8 Attribute adornments: f: free b: bound u: unspecified o[S]: optional, if chosen, must be from a list S of constants c[S]: chosen from a list S of constants A search form is represented as multiple templates: (Title, Author, ISBN, Format, Subject) u u b u u b f u u u f b u u u o[] u u o[] o[] [SIGMOD’99] Describing source capabilities

9 Binding patterns Common source limitations Attributes with adornments: — b: bound — f: free Example: R(Author, Title) — Given an author, return the books. — R(Author b, Title f ) A relation can have multiple binding patterns.

10 Part I: Optimizing large-join queries Part II: Deciding whether all the answers to a query can be computed Rest of the talk

11 Given a query Q on relations with restrictions: Can we answer Q? — I.e., find an executable plan to answer the query while observing the source access patterns How to answer Q efficiently? Part I: Optimizing large-join queries

12 Example: three movie sources R(Star, Movie) Reeves The Matrix Connery The Rock … … The Matrix Warner Bros American Beauty …… DreamWorks S(Movie,Studio) The Rock 1996 The Matrix …… 1999 T(Movie,Year)

13 “Find the movies made by Warner Bros. in 1999 and in which Keanu Reeves starred?” SQL: SELECT Movie FROM R JOIN S JOIN T WHERE Star = ’reeves’ AND Studio = ’warner’ AND Year = 1999; R(Star, Movie) S(Movie,Studio) T(Movie,Year) reeves 1999 warner Query Q

14 Answer Q Plan P 0 Star=reeves R(Star, Movie) Studio=warner S(Movie,Studio) Year=1999 T(Movie,Year) Movie

15 What if limited source capabilities? R(Star b, Movie f ): requires a Star name S(Movie b,Studio f ): requires a Movie title T(Movie b,Year f ): requires a Movie title

16 Does P 0 work? No! Since S and T do not support the queries. Star=reeves Studio=warner Year=1999 R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f )

17 A feasible plan P1 Star=reeves R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) Reeves movies by warner of 1999 Reeves movies by warner

18 Another feasible plan P2 Star=reeves R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) Reeves movies Reeves movies of 1999 Reeves movies of 1999 by warner

19 Question 1: query answerability Given Q on relations with restrictions, can we process its conditions by accessing relations with legal patterns?

20 Consider SPJ Queries Select-project-join queries (conjunctive queries): q(X) :- g 1 (X 1 ),…, g n (X n ) — subgoal g i (X i ): g i is a relation, X i is a tuple of variables/constants Example: SELECT Movie FROM R JOIN S JOIN T WHERE Star = ’reeves’ AND Studio = ’warner’ AND Year = 1999; q(M) :- R(reeves,M),S(M,warner),T(M,1999)

21 Algorithm “Inflationary”: Testing answerability of Q R(Star b, Movie f ), S(Movie b,Studio f ), T(Movie b,Year f ) Check what subgoals can be processed given B More subgoals can be processed All subgoals are answerable, so Q is answerable q(M) :- R(reeves,M),S(M,warner),T(M,1999) More positions become bound: add {M} to B M MM Bound positions: B = {reeves, warner, 1999} reeves warner1999

22 Question 2: generate efficient plans? Number of source accesses: 1 + 12 + 4 = 17 Plan P 1 R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) Consider number of source accesses. 1 Star=reeves 12 reeves movies by warner of 1999 4 reeves movies by warner

23 Cost of plan P2 Plan P 2 1 star=reeves R(Star b, Movie f ) S(Movie b,Studio f ) T(Movie b,Year f ) 12 reeves movies 1 of 1999 reeves movies by warner of 1999 Number of source accesses: 1 + 12 + 1 = 14

24 How to generate efficient plans? Challenges: — Often source statistics hard to get — Search space different from a traditional optimizer (e.g., System-R) Ordering subgoals Considering left-deep trees versus bushy trees Deciding join methods (e.g., hash join, nested-loop join) Need to consider feasible plans! Cost model: total number of source accesses — Reason: each source access is expensive! network traffic/delay, dynamic source availability, source charges — Results extendable to more general cost models, e.g.:

25 Left-deep trees versus bushy trees S R T U T R S U Result: Left-deep trees guarantee to include an optimal plan. Left-deep tree Bushy tree

26 Complexity The problem of finding the optimal feasible plans is NP-hard — Proof: by reduction from the Vertex Cover Problem Since the number of subgoals could be large, we want approximation algorithms for finding near-optimal plans quickly

27 Case 1: no source statistics Algorithm CHAIN: — Greedy approach — At each step, find a subgoal with the lowest cost

28 CHAIN R(Studio b, Movie f, Star f ) S(Movie b,Year f ) T(Star b,Addr f ) Movie Star 2 movies 3 stars Choose S next! Collect source information as we process subgoals. Murphy Diaz Reeves Shrek Matrix

29 CHAIN: Properties Does not need source statistics Polynomial time: O(n 2 ), n is the number of subgoals — Only needs results returned from the sources It is n-competitive: Cost(PLAN chain ) <= n * Cost(PLAN opt )

30 Case 2: source statistics available Algorithm: PARTITION — Grouping approach: group subgoals into clusters — Find an optimal subplan within each cluster — Combine subplans to construct a plan

31 PARTITION answerable subgoals given initial constants in Q new answerable subgoals new bound args group 1 group 2 new answerable subgoals new bound args group k … optimal subplan p1 optimal subplan p2 optimal subplan pk complete plan

32 PARTITION: Properties Need source statistics Guarantees an optimal plan if 1 or 2 clusters No bound when missing optimal plans

33 Performance analysis Test bed: — 15 source relations; number of subgoals: 1 – 10 — For each number of subgoals, ran 1000 random queries — Other factors considered: number of binding patterns for a relation number of constants in a query cardinalities of attributes Results: — Both algorithms generate good plans — PARTITION takes more time than CHAIN — PARTITION generates better plans than CHAIN

34 Average probability of missing optimal plans: CHAIN <= 25% PARTITION <= 5% Probability of missing optimal plans

35 Average difference: CHAIN < 5% PARTITION < 2% Difference from the optimal plans

36 Extending to other cost models Different sources have different costs: — CHAIN is still n-competitive — PARTITION still finds an optimal plan if the number of groups is <= 2 Consider size of data transferred: — CHAIN is still n-competitive — PARTITION may miss the optimal plans even if the number of groups is <= 2

37 Given a query Q on relations with restrictions: Deciding whether all the answers to a query can be computed Part II

38 Take Conjunctive queries (CQ’s) as an example If the Inflationary algorithm terminates while some subgoals are not answerable, can we say there is no way to compute Q’s answers? — No! Motivation

39 r(Star, Movie) s(Movie, Award) Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… Example: A movie database Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

40 r(Star b, Movie f ) s(Movie b, Award f ) Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… Limited access patterns Should provide a star.Should provide a movie.

41 Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… Answering Q given the restrictions r(Star b, Movie f ) s(Movie b, Award f ) Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

42 Harrison FordAir Force One Henry FondaOn Golden Pond Kevin SpaceyAmerican Beauty …… On Golden PondOscar, Best Actor On Golden Pond Oscar, Best Actress American BeautyOscar, Best Picture …… The answer is complete We did not retrieve all the tuples from the relations. Still we computed all tuples in the answer to the query. r(Star b, Movie f ) s(Movie b, Award f ) Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

43 Run Inflationary algorithm Every subgoal would be answerable r(Star b, Movie f ) s(Movie b, Award f ) Q(Award) :- r(henry fonda,Movie), s(Movie,Award)

44 How about Q’? Q’(Award) :- r(henry fonda,Movie), s(Movie,Award),r(Star,Movie) Inflationary will find that subgoal r(start,Movie) is not answerable — variable Star cannot be bound Can we say Q’ cannot be answered? — No! — It is essentially equivalent to the old query Q — Thus we can answer Q’ by answering Q!

45 Observations of binding patterns If a relation does not have an “all-free” binding pattern, then after certain queries are sent to this relation, there can always be some tuples that have not been retrieved.

46 General questions Given a query on relations with limited access patterns, can we compute its complete answer by accessing the relations with legal patterns? — If so, called “Stable” queries The example shows that the solution is more than running the Inflationary algorithm

47 General questions (cont) Given a query Q, if Inflationary claims that some subgoals are not answerable, how do we know whether there is another equivalent query Q’, such that Inflationary “succeeds” on Q’? Notice that there are infinite number of equivalent queries of Q Furthermore, even if Inflationary were able to say that all these equivalent queries have some unanswerable subgoal, how do we know if there isn’t any “magical” plan that can compute all the answers to Q?

48 Query stability A query Q on relations with binding patterns is stable if for any database, we can compute its complete answer by accessing the relations with legal patterns. The complete answer is the computable answer if we could retrieve all the tuples from the relations. Use partial tuples to derive the complete answer: we need reasoning!

49 Feasible CQ’s A CQ is feasible if it has a feasible (i.e., executable or answerable) order of all its subgoals. Lemma: A feasible CQ is stable. Testing feasibility of a CQ: Inflationary algorithm

50 What if Q is not feasible? Our example shows that an infeasible query could still be stable

51 Testing stability of a CQ Theorem: A CQ Q is stable iff its minimal equivalent Q m is feasible. Minimal equivalent query Q m Q m is unique

52 Main idea of the proof Construct two databases of the relations They have the same observable tuples, but yield different answers to the query Thus, we cannot tell whether the computed answer is complete or not Same observable tuples Database D1 Database D2 Different answers to Q

53 Another way to test CQ stability Q: q(X) :- g 1 (),…, g k (), g k+1 (), …, g n () Q’: q(X) :- g 1 (),…, g k () Compute all executable (answerable) subgoals of Q, wlog, denoted as g 1 (),…, g k () If all subgoals become executable, then Q is stable Otherwise, test equivalence between Q and Q’ Theorem: Q is stable iff Q and Q’ are equivalent

54 Advantage of the second approach Could be extended to other query classes — E.g., CQ’s with comparisons (“year > 1995”) Notice that in these classes, “minimal equivalent query” of a query might not be unique or hard to find.

55 Two algorithms for testing stability of CQ’s Algorithm CQStable — Minimize Q, get its minimal equivalent Q m — Test feasibility of Q m by calling Inflationary Algorithm CQStable* — Compute “executable” Q’ from Q — If all subgoals become executable, then Q is stable — Otherwise, test equivalence between Q and Q’ CQStable* is more efficient than CQStable Testing stability of a CQ is NP-complete.

56 Other classes of queries Unions of CQs: — Still we need to “minimize” the query first — two algorithms for testing stability

57 CQs with arithmetic comparisons More complicated due to potential equality among variables Need to consider all total ordering An algorithm for the testing stability

58 Datalog queries Testing stability undecidable Give a sufficient condition for stability of Datalog

59 Dynamic computability of complete answer to CQs For a nonstable CQ Q, for certain database, its complete answer might be computed.

60 An example Q1: ans(B) :- r(a,B,C),s(C,D) Not stable For the following database, we can still compute Q1’s complete answer: {b1,b2}. d1 d2 … r(A b, B f, C f ) ab1 …… c1 ab2c2 ab2c3 … d1 … c1 d2c2 … s(C f, D b )p(D f )

61 Change the head argument Q2: ans(D) :- r(a,B,C),s(C,D) Still not stable For the database, we cannot compute Q2’s complete answer. d1 d2 … r(A b, B f, C f ) ab1 …… c1 ab2c2 ab2c3 … d1 … c1 d2c2 … s(C f, D b )p(D f )

62 Difference between Q1 and Q2 b f f f b Q1: ans(B) :- r(a,B,C),s(C,D) Q2: ans(D) :- r(a,B,C),s(C,D) Q1’s head argument B is bound by the executable subgoal r(a,B,C). Q2’s head argument D is not bound by the executable subgoal r(a,B,C).

63 Generalization q(X) :- g 1 (X 1 ), …, g k (X k ), g k+1 (X k+1 ), …, g n (X n ) Executable subgoals: E = g 1 (X 1 ),…, g k (X k ) If all arguments in X are bound in E: — we might compute its complete answer. — The computability is database dependent. If some arguments in X are not bound in E: — we can never compute its complete answer. — Unless the relation after the subgoals in E is empty.

64 A decision tree It guides the planning process of computing the complete answer to a query. Two approaches while traversing the tree: — optimistic — pessimistic

66 Conclusion We worked on research problems on query processing and optimization in the presence of limited query capabilities There are still research issues in this area

67 Peer-based distributed data integration and sharing Data Cleansing to improve information quality Other On-going research projects

68 Network Peer User Interface Wrapper Metadata Manager Query Engine Data Repository Peer Passive Source User Passive Source User Peer-based data integration and sharing

69 The RACCOON Project on Peer-based Data Integration and sharing http://www-db.ics.uci.edu/pages/raccoon/

70 The FLAMINGO PROJECT: CLEANSING DATA TO IMPROVE INFORMATION QUALITY http://www-db.ics.uci.edu/pages/flamingo/

1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine.

Similar presentations

Presentation on theme: "1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine.

Similar presentations

Presentation on theme: "1 Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine."— Presentation transcript:

Similar presentations

About project

Feedback