Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara.

Similar presentations


Presentation on theme: "1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara."— Presentation transcript:

1 1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara

2 2 Heterogeneous information sources on the WWW Information-integration systems Limited query capabilities –Music stores: amazon.com, cdnow.com. –Must specify a value of Artist or Title. –The sources do not answer queries such as “Give me all your information about CDs.” Motivation

3 3 Sources View SchemasMust Bind 1 v1(Song, CD)Song 2 v2(CD, Artist, Price)CD 3 v3(CD, Artist, Price)Artist Query: “Find the prices of CDs containing a song titled Friends.” Example v1( Friends, CD) v2(CD, Artist, Price) v1( Friends, CD) v3(CD, Artist, Price)

4 4 Source tuples v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price) Not all the tuples could be retrieved from the sources due to the restrictions.

5 5 Traditional approach: consider each join at a time. v1 v2: {$15} v1 v3: empty, no binding for Artist. v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price)

6 6 Our approach: retrieve as many tuples as possible. X X X X X X This approach could save the user $15 - $10 = $5! v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price) v1 v2: {$15} v1 v3: {$10}

7 7 Access views not in a join to retrieve bindings; Recursive process; Some tuples in the answer cannot be retrieved. X X X X X X v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price) Observations

8 8 How to compute the maximal answer? When should we access sources not in a query? What sources should be accessed? Questions

9 9 Source views A set of source views V with binding patterns: – b: a value must be specified for the attribute – f: free Each view schema uses a set of global attributes CDArtistPrice Song b f v 1 (Song, CD) b f f v 2 (CD, Artist, Price) f b f v 3 (CD, Artist, Price) Hypergraph representation:

10 10 A query Q includes: –Input attributes: I; –Output attributes: O. Queries Input attribute: {Song} Output attribute: {Price} CDArtistPrice Song v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price)

11 11 Connection: a set of views that connect I and O in Q. Meaning: natural join of the views. Universal-relation-like assumptions, but connections can be generated in various ways. Connections T 1 ={v 1,v 2 }, T 2 ={v 1,v 3 } CDArtistPrice Song v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price)

12 12 Question 1: Computing the maximal answer Translate a query and source views into a Datalog program. Borrowed the idea from Duschka and Levy [IJCAI-97]. –We eliminate useless source accesses. Why Datalog programs? Recursion.

13 13 Constructing program  (Q,V) Connection rules: ans(P) :- V 1 (s 1, C) & V 2 (C, A, P) ans(P) :- V 1 (s 1, C) & V 3 (C, A, P) Fact rule: song(s 1 ) :- } v 1 (Song, CD)  -rule : V 1 (S, C) :- song(S) & v 1 (S,C) Domain rule: cd(C) :- song(S) & v 1 (S, C) } v 2 (CD, Artist, Price) } v 3 (CD, Artist, Price) V 2 (C, A, P) :- cd(C) & v 2 (C, A, P) artist (A) :- cd(C) & v 2 (C, A, P) price (P) :- cd(C) & v 2 (C, A, P) V 3 (C, A, P) :- artist(A) & v 3 (C, A, P) cd(C) :- artist(A) & v 3 (C, A, P) price(P) :- artist(A) & v 3 (C, A, P)

14 14 Binding assumptions: –A binding for an attribute is from the attribute’s domain; –Do not allow the “strategy” of trying all the possible strings to “test” the source (may not terminate); –Any binding is either obtained from the query, or from a tuple returned by a source query. The program  (Q,V) computes the maximal answer.

15 15 A B CD EF f f b v 2 (A, B, C) b f v 3 (C, D) b f v 1 (A, C) b f v 5 (E, F) f f v 4 (C, E) Query: Input: A = a 1 Output: D = ? Connections: T 1 = {v 1,v 3 }, T 2 = {v 2,v 3 } Not all the views need to accessed. Question 2: when to access off-query sources?

16 16 T 1 : accessing outside T 1 sources is NOT necessary. A C v 3 (C, D)v 1 (A, C) D T 2 : accessing outside T 2 sources is necessary to get C bindings. A B C D v 2 (A, B, C) v 3 (C, D)

17 17 Independent connections A connection T is independent if all the views in T can be queried starting from the input attributes as the initial bindings and using only the views in T. T 2 is not independent, it needs C bindings. A B C D v 2 (A, B, C) v 3 (C, D) T 1 is independent. A C v 3 (C, D)v 1 (A, C) D Theorem: off-connection source accesses are only necessary for nonindependent connections.

18 18 A view v is relevant to connection T if we may miss some answers to T when v is not used. A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E) The relevant views of T 2 are: v 2, v 3, v 1, v 4. How to find all the relevant views of a nonindependent connection? Question 3: what sources should be accessed?

19 19 Kernel A kernel of a connection is a minimal set of attributes that need to be initially bound in addition to the input attributes to query the full connection. A connection may have multiple kernels. T 1 has one kernel: {} A C v 3 (C, D)v 1 (A, C) D T 2 has one kernel: {C} A B C D v 2 (A, B, C) v 3 (C, D)

20 20 Algorithm FIND_REL: Finding relevant views of a connection Find all the relevant views of connection T 2 = {v 2,v 3 }: A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E) (1) Compute queryable views: {v 1,v 2,v 3,v 4,v 5 }; (2) Find a kernel K of T 2 : K = {C}; (4) Return R  T 2 = {v 1,v 2,v 3,v 4 }. (3) Compute all the views that can help produce bindings for the attributes in K: R = {v 1,v 2,v 4 } ;

21 21 Constructing an efficient program Compute the relevant views for each connection; Take the union of all these relevant source views; Use these views to construct a new program; Remove useless rules.

22 22 Conclusions A query-planning framework to compute the maximal answer to a query (Duschka and Levy [IJCAI-97]). Techniques for telling when to access off-query views; Algorithms: –finding all the relevant sources for a query; –constructing an efficient program.

23 23 Other related work Rajaraman, Sagiv, and Ullman [PODS-95]: –Shows how to find an equivalent query rewriting using views with binding restrictions; –We give the maximal rewriting of a query. Optimizing conjunctive queries with binding restrictions: –Yerneni, Li, Garcia-Molina, and Ullman [ICDT-99]; –Florescu et al. [SIGMOD-99]. Testing connection containment: –Li [Stanford-CS-TR 2000], using results of monadic programs to prove the problem is decidable.

24 24 Predicates EDB predicatesIDB predicates v 1 (S, C)V 1 (S, C) v 2 (C, A,P)V 2 (C, A, P) v 3 (C, A, P)V 3 (C, A, P) cd(C) song(S) artist(A) price(P) ans(P) }  -predicates } domain predicates

25 25 Evaluating program  (Q,V) Assume the right side of an  -rule or a domain rule is: domA 1 (A 1 ), …, domA p (A p ), v i (A 1,…, A m ) Once we have bindings for domA 1 (A 1 ), …, domA p (A p ), evaluate the rule and populate the domain predicates and  -predicate. Repeat until no more facts can be derived. Compute the maximal answer to the query.

26 26 Forward-closure Given views W  V, and attributes X, the forward-closure of X given W, denoted f-closure(X,W), is the the set of views in W that can be eventually queried by using the views in W, starting from the initial bindings X. f-closure({A},{v 1,v 2,v 3 }) = {v 1,v 2,v 3 } f-closure({D},{v 1,v 2,v 3 }) = {} A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E)

27 27 Backward-closure of a set of attributes X: b-closure(X), is the set of views that can help retrieve bindings for X. Backward-closure Lemma: All backward-closures of a connection are the same. b-closure(C) = {v 1,v 2,v 4 } A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E)

28 28 BF-chain: Backward-closure: BF-chain, backward-closure free bound free A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E) b-closure(C) = {v 1,v 2,v 4 }

29 29 Other possibilities of obtaining bindings Cached data: For a cached tuple t i (a 1,a 2 ) for view v i (A 1,A 2 ), add the following rules to the program  (Q, V): v i (a 1,a 2 ) :- domA 1 (a 1 ) :- domA 2 (a 2 ) :- Domain knowledge: –student(name, dept, GPA). –dept = CS, Physics, Chemistry, etc.

30 30 Computing a partial answer Independent connections: complete answers are computable. Nonindependent connections: access some relevant views. May terminate evaluating the program after some results are computed.


Download ppt "1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara."

Similar presentations


Ads by Google