1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

1 Datalog: Logic Instead of Algebra. 2 Datalog: Logic instead of Algebra Each relational-algebra operator can be mimicked by one or several Database Logic.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Deco Query Processing Hector Garcia-Molina, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Jennifer Widom Stanford and UCSC Scoop The Stanford –
CS848: Topics in Databases: Foundations of Query Optimization Topics covered  Introduction to description logic: Single column QL  The ALC family of.
ICDT'2001, London, UK1 Minimizing View Sets without Losing Query-Answering Power Chen Li Stanford University joint work with Mayank Bawa and Jeff Ullman.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
Query Folding Xiaolei Qian Presented by Ram Kumar Vangala.
WIMS 2014, June 2-4Thessaloniki, Greece1 Optimized Backward Chaining Reasoning System for a Semantic Web Hui Shi, Kurt Maly, and Steven Zeil Contact:
Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Logic.
SECTION 21.5 Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
ICDT'2001, London, UK1 On Answering Queries in the Presence of Limited Access Patterns Chen Li Stanford University joint work with Edward Chang, UC Santa.
Efficient Query Evaluation on Probabilistic Databases
1 Answering Queries Using Views Alon Y. Halevy Based on Levy et al. PODS ‘95.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Generating Efficient Plans for Queries Using Views Chen Li Stanford University with Foto Afrati (National Technical University of Athens) and Jeff Ullman.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
CS246 Query Translation. Mind Your Vocabulary Q: What is the problem? A: How to integrate heterogeneous sources when their schema & capability are different.
2005lav-iii1 The Infomaster system & the inverse rules algorithm  The InfoMaster system  The inverse rules algorithm  A side trip – equivalence & containment.
Slides adapted from A. Silberschatz et al. Database System Concepts, 5th Ed. SQL - part 2 - Database Management Systems I Alex Coman, Winter 2006.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Applications of Regular Closure. 2 The intersection of a context-free language and a regular language is a context-free language context free regular.
Rada Chirkova (North Carolina State University) and Chen Li (University of California, Irvine) Materializing Views With Minimal Size To Answer Queries.
Induction and recursion
Presenter: Dongning Luo Sept. 29 th 2008 This presentation based on The following paper: Alon Halevy, “Answering queries using views: A Survey”, VLDB J.
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Database Management 9. course. Execution of queries.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
A Query Translation Scheme for Rapid Implementation of Wrappers Presented By Preetham Swaminathan 03/22/2007 Yannis Papakonstantinou, Ashish Gupta, Hector.
Slide 1 Propositional Definite Clause Logic: Syntax, Semantics and Bottom-up Proofs Jim Little UBC CS 322 – CSP October 20, 2014.
1 Relational Algebra and Calculas Chapter 4, Part A.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Hippo a System for Computing Consistent Query Answers to a Class of SQL Queries Jan Chomicki University at Buffalo Jerzy Marcinkowski Wroclaw University.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
Answering Queries Using Views: The Last Frontier.
CS6321 Query Optimization Over Web Services Utkarsh Kamesh Jennifer Rajeev Shrivastava Munagala Wisdom Motwani Presented By Ajay Kumar Sarda.
CS848: Topics in Databases: Information Integration Topics covered  Databases  QL  Query containment  An evaluation of QL.
CS848 Presentation Heng YU (Henry)
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
SchemaLog – A Visual Perspective CPSC 534B Laks V.S. Lakshmanan UBC (names of schema components abbreviated.)
A Semantic Caching Method Based on Linear Constraints Yoshiharu Ishikawa and Hiroyuki Kitagawa University of Tsukuba
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Parallel Evaluation of Conjunctive Queries Paraschos Koutris and Dan Suciu University of Washington PODS 2011, Athens.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
Answering Queries Using Views Presented by: Mahmoud ELIAS.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
CS589 Principles of DB Systems Fall 2008 Lecture 4c: Query Language Equivalence Lois Delcambre
Relational Database Schema Designer Using Bernstein’s Algorithm
Answering Queries using Templates with Binding Patterns
Goal for this lecture Demonstrate how we can prove that one query language is more expressive than (i.e., “contained in” as described in the book) another.
Computing Full Disjunctions
Proper Refinement of Datalog Clauses using Primary Keys
Chen Li Information and Computer Science
Materializing Views With Minimal Size To Answer Queries
Applications of Regular Closure
Probabilistic Ranking of Database Query Results
Presentation transcript:

1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara

2 Heterogeneous information sources on the WWW Information-integration systems Limited query capabilities –Music stores: amazon.com, cdnow.com. –Must specify a value of Artist or Title. –The sources do not answer queries such as “Give me all your information about CDs.” Motivation

3 Sources View SchemasMust Bind 1 v1(Song, CD)Song 2 v2(CD, Artist, Price)CD 3 v3(CD, Artist, Price)Artist Query: “Find the prices of CDs containing a song titled Friends.” Example v1( Friends, CD) v2(CD, Artist, Price) v1( Friends, CD) v3(CD, Artist, Price)

4 Source tuples v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price) Not all the tuples could be retrieved from the sources due to the restrictions.

5 Traditional approach: consider each join at a time. v1 v2: {$15} v1 v3: empty, no binding for Artist. v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price)

6 Our approach: retrieve as many tuples as possible. X X X X X X This approach could save the user $15 - $10 = $5! v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price) v1 v2: {$15} v1 v3: {$10}

7 Access views not in a join to retrieve bindings; Recursive process; Some tuples in the answer cannot be retrieved. X X X X X X v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price) Observations

8 How to compute the maximal answer? When should we access sources not in a query? What sources should be accessed? Questions

9 Source views A set of source views V with binding patterns: – b: a value must be specified for the attribute – f: free Each view schema uses a set of global attributes CDArtistPrice Song b f v 1 (Song, CD) b f f v 2 (CD, Artist, Price) f b f v 3 (CD, Artist, Price) Hypergraph representation:

10 A query Q includes: –Input attributes: I; –Output attributes: O. Queries Input attribute: {Song} Output attribute: {Price} CDArtistPrice Song v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price)

11 Connection: a set of views that connect I and O in Q. Meaning: natural join of the views. Universal-relation-like assumptions, but connections can be generated in various ways. Connections T 1 ={v 1,v 2 }, T 2 ={v 1,v 3 } CDArtistPrice Song v 1 (Song, CD) v 2 (CD, Artist, Price) v 3 (CD, Artist, Price)

12 Question 1: Computing the maximal answer Translate a query and source views into a Datalog program. Borrowed the idea from Duschka and Levy [IJCAI-97]. –We eliminate useless source accesses. Why Datalog programs? Recursion.

13 Constructing program  (Q,V) Connection rules: ans(P) :- V 1 (s 1, C) & V 2 (C, A, P) ans(P) :- V 1 (s 1, C) & V 3 (C, A, P) Fact rule: song(s 1 ) :- } v 1 (Song, CD)  -rule : V 1 (S, C) :- song(S) & v 1 (S,C) Domain rule: cd(C) :- song(S) & v 1 (S, C) } v 2 (CD, Artist, Price) } v 3 (CD, Artist, Price) V 2 (C, A, P) :- cd(C) & v 2 (C, A, P) artist (A) :- cd(C) & v 2 (C, A, P) price (P) :- cd(C) & v 2 (C, A, P) V 3 (C, A, P) :- artist(A) & v 3 (C, A, P) cd(C) :- artist(A) & v 3 (C, A, P) price(P) :- artist(A) & v 3 (C, A, P)

14 Binding assumptions: –A binding for an attribute is from the attribute’s domain; –Do not allow the “strategy” of trying all the possible strings to “test” the source (may not terminate); –Any binding is either obtained from the query, or from a tuple returned by a source query. The program  (Q,V) computes the maximal answer.

15 A B CD EF f f b v 2 (A, B, C) b f v 3 (C, D) b f v 1 (A, C) b f v 5 (E, F) f f v 4 (C, E) Query: Input: A = a 1 Output: D = ? Connections: T 1 = {v 1,v 3 }, T 2 = {v 2,v 3 } Not all the views need to accessed. Question 2: when to access off-query sources?

16 T 1 : accessing outside T 1 sources is NOT necessary. A C v 3 (C, D)v 1 (A, C) D T 2 : accessing outside T 2 sources is necessary to get C bindings. A B C D v 2 (A, B, C) v 3 (C, D)

17 Independent connections A connection T is independent if all the views in T can be queried starting from the input attributes as the initial bindings and using only the views in T. T 2 is not independent, it needs C bindings. A B C D v 2 (A, B, C) v 3 (C, D) T 1 is independent. A C v 3 (C, D)v 1 (A, C) D Theorem: off-connection source accesses are only necessary for nonindependent connections.

18 A view v is relevant to connection T if we may miss some answers to T when v is not used. A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E) The relevant views of T 2 are: v 2, v 3, v 1, v 4. How to find all the relevant views of a nonindependent connection? Question 3: what sources should be accessed?

19 Kernel A kernel of a connection is a minimal set of attributes that need to be initially bound in addition to the input attributes to query the full connection. A connection may have multiple kernels. T 1 has one kernel: {} A C v 3 (C, D)v 1 (A, C) D T 2 has one kernel: {C} A B C D v 2 (A, B, C) v 3 (C, D)

20 Algorithm FIND_REL: Finding relevant views of a connection Find all the relevant views of connection T 2 = {v 2,v 3 }: A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E) (1) Compute queryable views: {v 1,v 2,v 3,v 4,v 5 }; (2) Find a kernel K of T 2 : K = {C}; (4) Return R  T 2 = {v 1,v 2,v 3,v 4 }. (3) Compute all the views that can help produce bindings for the attributes in K: R = {v 1,v 2,v 4 } ;

21 Constructing an efficient program Compute the relevant views for each connection; Take the union of all these relevant source views; Use these views to construct a new program; Remove useless rules.

22 Conclusions A query-planning framework to compute the maximal answer to a query (Duschka and Levy [IJCAI-97]). Techniques for telling when to access off-query views; Algorithms: –finding all the relevant sources for a query; –constructing an efficient program.

23 Other related work Rajaraman, Sagiv, and Ullman [PODS-95]: –Shows how to find an equivalent query rewriting using views with binding restrictions; –We give the maximal rewriting of a query. Optimizing conjunctive queries with binding restrictions: –Yerneni, Li, Garcia-Molina, and Ullman [ICDT-99]; –Florescu et al. [SIGMOD-99]. Testing connection containment: –Li [Stanford-CS-TR 2000], using results of monadic programs to prove the problem is decidable.

24 Predicates EDB predicatesIDB predicates v 1 (S, C)V 1 (S, C) v 2 (C, A,P)V 2 (C, A, P) v 3 (C, A, P)V 3 (C, A, P) cd(C) song(S) artist(A) price(P) ans(P) }  -predicates } domain predicates

25 Evaluating program  (Q,V) Assume the right side of an  -rule or a domain rule is: domA 1 (A 1 ), …, domA p (A p ), v i (A 1,…, A m ) Once we have bindings for domA 1 (A 1 ), …, domA p (A p ), evaluate the rule and populate the domain predicates and  -predicate. Repeat until no more facts can be derived. Compute the maximal answer to the query.

26 Forward-closure Given views W  V, and attributes X, the forward-closure of X given W, denoted f-closure(X,W), is the the set of views in W that can be eventually queried by using the views in W, starting from the initial bindings X. f-closure({A},{v 1,v 2,v 3 }) = {v 1,v 2,v 3 } f-closure({D},{v 1,v 2,v 3 }) = {} A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E)

27 Backward-closure of a set of attributes X: b-closure(X), is the set of views that can help retrieve bindings for X. Backward-closure Lemma: All backward-closures of a connection are the same. b-closure(C) = {v 1,v 2,v 4 } A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E)

28 BF-chain: Backward-closure: BF-chain, backward-closure free bound free A B C D EF v 2 (A, B, C) v 3 (C, D)v 1 (A, C) v 5 (E, F)v 4 (C, E) b-closure(C) = {v 1,v 2,v 4 }

29 Other possibilities of obtaining bindings Cached data: For a cached tuple t i (a 1,a 2 ) for view v i (A 1,A 2 ), add the following rules to the program  (Q, V): v i (a 1,a 2 ) :- domA 1 (a 1 ) :- domA 2 (a 2 ) :- Domain knowledge: –student(name, dept, GPA). –dept = CS, Physics, Chemistry, etc.

30 Computing a partial answer Independent connections: complete answers are computable. Nonindependent connections: access some relevant views. May terminate evaluating the program after some results are computed.