Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine.

Similar presentations


Presentation on theme: "1 Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine."— Presentation transcript:

1 1 Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine

2 Seminar 22 Motivation Legacy database Plain text files Biblio sever Support seamless access to autonomous and heterogeneous information sources.

3 Seminar 23 Comparison Shopping Lowest price of the DVD: “The Matrix”? Applications Comparison shopping Supply-chain management Supplier 2 … Integrator Supplier M Supplier 1 Buyer 2 Buyer M Buyer 1 …

4 Seminar 24 Mediation architecture Mediator Wrapper Source 1 Wrapper Source 2 Wrapper Source n TSIMMIS (Stanford), Garlic (IBM), Infomaster (Stanford), Disco (INRIA), Information Manifold (AT&T), Hermes(UMD), Tukwila (UW), InfoSleuth (MCC), …

5 Seminar 25 Sources are heterogeneous: –Different data models: relational, object-oriented, XML, … –Different schemas and representations: “Keanu Reeves” or “Reeves, Keanu” or “Reeves, K.” etc. Describe source contents Use source data to answer queries Sources have limited query capabilities Data quality … Challenges

6 Seminar 26 Outline Basics: theories of conjunctive queries Global-as-view (GAV) approach to data integration Local-as-view (LAV) approach to data integration

7 Seminar 27 Basics: conjunctive queries Reading: Ashok K. Chandra and Philip M. Merlin, “Optimal implementation of conjunctive queries in relational data bases,” STOC, 77-90, 1977. Fundamental for data integration –Source content description –Query description –Plan formulation

8 Seminar 28 Conjunctive Queries (CQ’s) Most common form of query; equivalent to select- project-join (SPJ) queries Useful for data integration Form:q(X) :- p 1 (X 1 ),p 2 (X 2 ),…,p n (X n ) Head q(X) represents the query answers Body p 1 (X 1 ),p 2 (X 2 ),…,p n (X n ) represents the query conditions –Each pi(Xi) is called a subgoal –Shared variables represent join conditions –Constants represent “Attribute=const” selection conditions –A relation can appear in multiple predicates (subgoals)

9 Seminar 29 Conjunctive Queries: example student(name,courseNum), course(number,instructor) SELECT name FROM student, course WHERE student.courseNum=course.number AND instructor=‘Li’; Equal to: ans(SN) :- student(SN, CN), course(CN,’Li’) –Predicates student and course correspond to relations names –Two subgoals: student(SN, CN) and course(CN,’Li’) –Variables: SN, CN. Constant: ‘Li’ –Shared variable, CN, corresponds to “student.courseNum=course.number” –Variable SN in the head: the answer to the query

10 Seminar 210 Answer to a CQ For a CQ Q on database D, the answer Q(D) is set of heads of Q if we: –Substitute constants for variables in the body of Q in all possible ways –Require all subgoals to be true Example:ans(SN) :- student(SN, CN), course(CN,’Li’) –Tuples are also called “EDB” (external database) facts: student(Jack, 184), student(Tom,215), …, course(184,Li), course(215,Li), … –Answer “Jack”: SN  Jack,CN  184 –Answer “Tom”: SN  Tom,CN  215 –Answer “Jack”: SN  Jack,CN  215 (duplicate eliminated) Student Course

11 Seminar 211 Query containment For two queries Q 1 and Q 2, we say Q 1 is contained in Q 2, denoted Q 1  Q 2, if any database D, we have Q 1 (D)  Q 2 (D). We say Q 1 and Q 2 are equivalent, denoted Q 1  Q 2, if Q 1 (D)  Q 2 (D) and Q 1 (D)  Q 2 (D). Example: Q 1 : ans(SN) :- student(SN, CN), course(CN,’Li’) Q 2 : ans(SN) :- student(SN, CN), course(CN,INS) We have: Q 1 (D)  Q 2 (D).

12 Seminar 212 Another example Q 1 : p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q 2 : p(X,Y) :- r(X,W), b(W,W), r(W,Y) We have: Q 2  Q 1 Proof: –For any DB D, suppose p(x,y) is in Q 2 (D). Then there is a w such that r(x,w), b(w,w), and r(w,y) are in D. –For Q 1, consider the substitution: X  x, W  w, Z  w, Y  y. –Thus the head of Q 1 becomes p(x,y), meaning that p(x,y) is also in Q 1 (D). In general, how to test containment of CQ’s? –Containment mappings –Canonical databases

13 Seminar 213 Containment mappings Mapping from variables of CQ Q2 to variables of CQ Q1, such that: –Head of Q2 becomes head of Q1 –Each subgoal of Q2 becomes some subgoal of Q2  It is not necessary that every subgoal of Q1 is the target of some subgoal of Q2. Example: Q 1 : p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q 2 : p(X,Y) :- r(X,W), b(W,W), r(W,Y) –Containment mapping from Q1 to Q2: X  X, Y  Y, W  W, Z  W –No containment mapping from Q2 to Q1:  For b(W,W) in Q2, its only possible target in Q1 is b(W,Z)  However, we cannot have a mapping W  W and W  Z, since each variable cannot be mapped to two different variables

14 Seminar 214 Example of containment mappings Example: C 1 : p(X) :- a(X,Y), a(Y,Z), a(Z,W) C 2 : p(X) :- a(X,Y), a(Y,X) Containment mapping from C1 to C2: X  X, Y  Y, Z  X, W  Y No containment mapping from C2 to C1. Proof: –For the two heads, the mapping must have X  X –For a(X,Y) in C2, its target in C1 can only be a(X,Y) (since X  X). Thus Y  Y. –However, for a(Y,X) in C2, its target, which must be a(Y,X), does not exist in C1.

15 Seminar 215 Theorem of Containment Mappings Theorem: Q 1  Q 2 iff there is a containment mapping from Q 2 to Q 1. Notice: the direction is the “opposite” Proof (“If”): –Suppose  is a containment mapping from Q 2 to Q 1 –For any DB D, let tuple t is in Q 1 (D) –t is produced by a substitution  on the variables of Q 1 that makes all Q 1 ’s subgoals facts in D. –Therefore,    is a substitution for variables of Q 2 that produces t –Thus each t in Q 1 (D) must be in Q 2 (D) Q 1 : p(X) :- G1, G2, … Gk Q 2 : p(X) :- H1, H2, … Hj   Q 1 : p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q 2 : p(X,Y) :- r(X,W), b(W,W), r(W,Y)

16 Seminar 216 Proof (only if) Key idea: frozen CQ Use a unique constant to replace a variable Frozen Q is a DB consisting of all the subgoals of Q, with the chosen constants substituted for variables This DB is called a “canonical database” of the query. Example: –Q1: p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) –Frozen Q1: X replaced by constant x 0, W by constant w 0, Z by z 0, Y by y 0 –Result: DB with {r(x 0, w 0 ), b(w 0, z 0 ), r(z 0, y 0 )}

17 Seminar 217 Proof (only if) -- cont Let Q 1  Q 2. Let D be the frozen Q 1. Let  be the substitution from those constants to the variables in Q1. –Since we chose a unique constant for each variable, this substitution exists. Since Q 1  Q 2 the “frozen” head of Q 1 must be in Q 2 (D). Thus there is a substitution  from Q 2 to D. We can show that    is a containment mapping from Q 2 to Q 1 –The head of Q2 is mapped to the head of Q1. –Each subgoal in Q2 is mapped to a subgoal in Q2. Q 1 : p(X) :- G1, G2, … Gk Q 2 : p(X) :- H1, H2, … Hj  Q 1 : p(X,Y) :- r(X,W), b(W,Z), r(Z,Y) Q 2 : p(X,Y) :- r(X,W), b(W,W), r(W,Y) 

18 Seminar 218 Testing query containment To test Q 1  Q 2. : –Get a canonical DB D of Q1. –Compute Q2(D) –If Q2(D) contains the frozen head of Q1, then Q 1  Q 2. otherwise not. Testing containment between CQ’s is NP-complete. Some polynomial-time algorithms exist in special cases.

19 Seminar 219 Extending CQ’s CQ’s with built-in predicates: –We can add more conditions to variables in a CQ. –Example: student(name, GPA, courseNum), course(number,instructor,year) ans(SN) :- student(SN, G, CN), course(CN,’Li’), G>=3.5 ans(SN) :- student(SN, G, CN), course(CN,’Li’, Y), G>=3.5, Y < 2002 –More results on CQ’s with built-in predicates Datalog queries: –a (possibly infinite) set of CQ’s with (possibly) recursion –Example: r(Parent, Child) –Query: finding all ancestors of Tom ancestor(P,C) :- r(P, C) ancestor(P,C) :- ancestor(P,X), r(X, C) result(P) :- ancestor(P, ‘tom’)

20 Seminar 220 Further Reading Jeff Ullman, “Principles of Database and Knowledge Systems,” Computer Science Press, 1988, Volume 2.

21 Seminar 221 Outline Basics: theories of conjunctive queries Global-as-view (GAV) approach to data integration Local-as-view (LAV) approach to data integration

22 Seminar 222 GAV approach to data integration Readings: –Jeffrey Ullman, Information Integration Using Logical Views, ICDT 1997.Information Integration Using Logical Views –Ramana Yerneni, Chen Li, Hector Garcia- Molina, and Jeffrey Ullman, Computing Capabilities of Mediators, SIGMOD 1999.Computing Capabilities of Mediators

23 Seminar 223 Global-as-view Approach Mediator Mediator exports views defined on source relations med(Dealer,City,Make,Year) = R1 R2 A query is posted on mediator views: SELECT * FROM med WHERE Year = ‘2001’;ans(D,C,M) :- med(D,C,M,‘2001’) Mediator expands query to source queries: SELECT * FROM R1, R2 WHERE Year = ‘2001’;ans(D,C,M,Y) :- R1(D,C), R2(D,M,2001) R1(Dealer,City) R2(Dealer, Make, Year) med(Dealer,City,Make,Year) = R S

24 Seminar 224 Project: TSIMMIS at Stanford Advantages: –User queries easy to define –Plan generation is straightforward Disadvantages: –Not all source information is exported:  What if users want to get dealers that may not the city information?  Those dealers are not “visible.” –Not easily scalable: every time a new source is added, mediator views need to be changed Research issues –Efficient query execution? –Deal with limited source capabilities? GAV Approach (cont)

25 Seminar 225 Limited source capabilities Complete scans of relations not possible Reasons: – Legacy databases or structured files: limited interfaces – Security/Privacy – Performance concerns Example 1: legacy databases with restrictive interfaces Ullman DBMS Knuth TeX … … author title Given an author, return the books.

26 Seminar 226 Another example: Web search forms www.imdb.com

27 Seminar 227 Problems How to describe source restrictions? How to compute mediator restrictions from sources? How to answer queries efficiently given these restrictions? How to compute as many answers as possible to a query? …

28 Seminar 228 Describe source capabilities: using attribute adornments. f: free b: bound u: unspecified c[S]: chosen from a list S of constants, e.g., “state” o[S]: optional; if chosen, must be from a list S of constants A search form is represented as multiple templates: (Title, Author, ISBN, Format, Subject) b f u u u  1 f b u u u  1 u u u o[] o[]  2 u u b u u  3 1 2 3

29 Seminar 229 Computing mediator restrictions Motivation: do not want users to be frustrated by submitting a query that cannot be answerable by the mediator Example: –Source 1: book(author, title, price)  Capability: “bff”  I.e., we must provide a title, and can get author and price info –Source 2: review(title, reviewer, rate)  Capability: “bff”  I.e., we must provide a book title, and can get other info –Mediator view: MedView(A,T,P,RV,RT) :- book(A,T,P),review(T,RV,RT) –Query on the mediator view:  Ans(RT) :- MedView(A, ‘db’, P, RV, RT).  I.e., “find the review rates of DB books” –But the mediator cannot answer this query, since we do not know the authors. We want to tell the user beforehand what queries can be answered

30 Seminar 230 Solutions: Compute mediator capabilities Need algorithms that do the following: Given –Source relations with restrictions. –Mediator views defined on source relations:  Union  Join  Selection  Projection Main idea of the algorithms –compute restrictions on mediator views –minimize number of view templates

31 Seminar 231 “Union” views Assumption: –MedView :- V1  V2 –We want to get all tuples from two sources that satisfy a query condition –No mediator post-processing power Table to compute view adornments –E.g., “f, o[s3]  o[s3]” –“c[s2], o[s3]  c[s2  s3]” –Invalid combination: “b,u  -” V1 V2

32 Seminar 232 “Union” views with postprocessing Mediator can postprocess results from a source, and check if the results satisfy certain conditions Thus some entries are more “relaxing” –Essentially: “o” can be treated as “f”, and “u” can be treated as “f” –E.g., “f, o[s3]  f” instead of “o[s3]” –“c[s2], o[s3]  c[s2]” instead of “c[s2  s3]” –“b,u  b” instead of “invalid combination” V1 V2

33 Seminar 233 “Join” views with passing bindings Assumption: –MedView :- V1 JOIN V2 –The mediator can pass bindings from V1 to V2 –So the join order matters V2 V1

34 Seminar 234 Other views Union Join Selection Projection Multiple views

35 Seminar 235 Concise template description Some adornments subsume other adornments E.g.: “f” subsumes “b”, since every query supported by “b” is also supported by “f” Adornment graph: “subsumption” relationships Use the graph to “compress” templates: experiments shrank 26  8 templates f b c o u Adornment graph n1 n2 Adornment n1 is at least as restrictive as adornment n2 n1 n2 Adornment n1 is at least as restrictive as adornment n2, if the constant set of n1 is a subset of that of n2

36 Seminar 236 Outline Basics: theories of conjunctive queries Global-as-view (GAV) approach to data integration Local-as-view (LAV) approach to data integration

37 Seminar 237 Local-as-view (LAV) approach Mediator There are global predicates, e.g., “car,” “person,” “book,” etc. They can been seen as mediator views The content of each source is described using these global predicates A query to the mediator is also defined on the global predicates The mediator finds a way to answer the query using the source contents sources

38 Seminar 238 Example Mediator Global predicates: Loc(Dealer,City),Sell(Dealer,Make,Year) Source content defined on global predicates: S1(Dealer,City) :- Loc(Dealer,City); S2(Dear,Make,Year) :- Sell(Dear,Make,Year) In general, each definition could be more complicated, rather than direct copies. Queries defined on global predicates. Q: ans(D,M,Y) :- Loc(D,’irvine’), Sell(D,M,Y) –Users do not know source views. The mediator decides how to use source views to answer queries. –“Answering queries using views”: ans(D,M,Y) :- S1(D,’irvine’), S2(D,M,Y) S1(Dealer,City) S2(Dealer,Make,Year)

39 Seminar 239 Another LAV Example Mediator predicates: car(C), sell(Car, Dealer), loc(dealer, city) Views: –v1(x) :- car(x) –v2(x) :- car(x), sell(x, d) –v3(x,d) :- sell(x, d), loc(d, ’la’) –v4(x) :- sell(x, d), loc(d, ’la’) Query: q(x) :- car(x), sell(x, d), loc(d, ’la’)

40 Seminar 240 Open-world assumption (OWA) and Close-world assumption (CWA) W1(Make, Dealer) :- car(Make, Dealer) W2(Make, Dealer) :- car(Make, Dealer) All car tuples W1 = W2 = CWA – W1 and W2 have all car tuples. – E.g.: W1 and W2 are computed from the same car table in a database. – W1 and W2 have some car tuples. – E.g.: W1 and W2 are from two different web sites. OWA W1 W2

41 Seminar 241 Projects: Information Manifold, Infomaster, Tukwila, … Advantages: –Scalable: new sources easy to add without modifying the mediator views –All we need to do is to define the new source using the existing mediator views (predicates) Disadvantages: –Hard to decide how to answer a query using views Projects using the LAV approach

42 Seminar 242 Reading Alon Halevy, Answering Queries Using Views: A Survey.Answering Queries Using Views: A Survey

43 Seminar 243 Answering queries using views Mediator Source views can be complicated: SPJs, arithmetic comparisons,… Not easy to decide how to answer a query using source views Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y). Rewriting: ans(D,M) :- V(D,‘irvine’, M,Y) –“Equivalent rewriting”: compute the “same” answer as the query –A rewriting can join multiple source views This problem exists in many other applications: –data warehousing –web caching –query optimizations V(D,C,M,Y) :- Loc(D,C),Sell(D,M,Y) Query

44 Seminar 244 Arithmetic comparisons Mediator Comparisons can make the problem even trickier Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y). Rewriting: ans(D,M) :- V(D,‘irvine’, M,Y) Contained rewriting: only retrieve cars before 1970. Query: ans(D,M) :- Loc(D, 'irvine'), Sell(D,M,Y), Y < 1960 Rewriting: ans(D,M) :- V(D,‘irvine’,M,Y), Y < 1960 V(D,C,M,Y):- Loc(D,C),Sell(D,M,Y),Y<1970

45 Seminar 245 Dropping attributes in views Mediator Drop “Year” in the view: V(D,C,M):- Loc(D,C),Sell(D,M,Y),Y<1970 A variable in a CQ is called: –“distinguished”: if it appears in the query’s head –“nondistinguished”: otherwise The problem becomes even harder when we have nondistinguished variables. Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y), Y<1960 No rewriting! Since we do not have “Year” information. Query: ans(D,M) :- Loc(D,'irvine'), Sell(D,M,Y), Y<1980 Contained rewriting: ans(D,M) :- V(D, ‘irvine’, M)

46 Seminar 246 Problems How to answer a query using views? We will focus on the case where both the query and views are simply conjunctive. Query Source views

47 Seminar 247 Query Expansion For each query P on views, we can expand P using the view definitions, and get a new query, denoted as P exp, on the base tables. P exp can be considered to be the “real” meaning of the query. Example: –View: V(D,C,M) :- Loc(D,C), Sell(D,M,Y) –A query P using V: ans(D,M) :- V(D,’la’,M) –Expansion: ans(D,M) :- Loc(D,’la’), Sell(D,M,Y) Query P: ans() :- v1(), v2(), …, vk() Expansion P exp : ans():- p 1,1 (),…,p 1,i1 (),…, p k,1 (),…,p k,ik ()

48 Seminar 248 Rewritings Given a query Q and a set of views V: –A conjunctive query P is called a “rewriting” of Q using V if P only uses views in V, and P computes a partial answer of Q. That is: P exp  Q. A rewriting is also called a “contained rewriting” (CR). –A conjunctive query P is called an “equivalent rewriting” (ER) of Q using V if P only uses views in V, and P computes the exact answer of Q. That is: P exp  Q. –A query P is called a “maximally-contained rewriting” of Q using V if P is a union of CRs of Q using V, and for any CR P1of Q, the answer to P contains the answer to query P1, that is, P1 exp  P exp. See earlier slides for examples Notice that all these definitions depend on the language of the rewriting considered. Here we consider “conjunctive queries.”

49 Seminar 249 Focus: MiniCon algorithm MiniCon Algorithm: Rachel Pottinger and Alon Levy, “A scalable algorithm for answering queries using views,” VLDB 2000. See also: The Shared-variable-bucket algorithm by Prasenjit Mitra: "An Algorithm for Answering Queries Efficiently Using Views"; in Proceedings of the Australasian Database Conference, Jan 2001.An Algorithm for Answering Queries Efficiently Using Views Formulation: –Input: a conjunctive query Q and a set V of conjunctive views –Output: an maximally-contained rewriting (MCR) of Q using V Main idea: –For each query subgoal and for each view  Check if the view can be used to “answer” the query subgoal, and if so, in what “form”  Some “shared” variables are treated carefully –Combine views to answer all query subgoals  Reduced to a set-cover problem

50 Seminar 250 Example Query: q(x) :- car(x), sell(x, d), loc(d, ’la’) Views: –v1(x) :- car(x) –v2(x) :- car(x), sell(x, d) –v3(x,d) :- sell(x, d), loc(d, ’la’) –v4(x) :- sell(x, d), loc(d, ’la’)

51 Seminar 251 MCDs (“enhanced Buckets”) For query subgoal car(x), its MCD includes all views that can answer this subgoal: –v1(x), v2(x) MCD of query subgoal sell(x,d) : –v3(x,d) only –but not v2(x)! Because:  Variable d is nondistinguished, i.e., it is not exported.  Variable d is shared by another query subgoal, loc(d,’la’). If we were to use v2(x) to answer query subgoal sell(x,d), we cannot get the dealer info to join with the other view to answer loc(d,’la’). MCD of query subgoal loc(d,’la’) –v3(x,d)

52 Seminar 252 Multi-subgoal MCD MCD of query subgoals: sell(x,d),loc(d,’la’) –v4(x) –If v4(x) is used to answer query subgoal sell(x,d), then the query subgoal loc(d,’la’) must be answered using v4(x) as well. –The reason is that d is shared by two query subgoals, and the corresponding variable in v4(x) is not exported.

53 Seminar 253 General rules For a query subgoal G and a view subgoal H in view W, the MiniCon algorithm considers a mapping from G to H In this mapping, a query variable X is mapped to a view variable A Four possible cases: –Case 1: X is dist., A is dist.. OK.  A is exported, so can join with other views. –Case 2: X is nondist., A is dist.. OK.  Same as above –Case 3: X is dist., A is nondist.. NOT OK.  X needs to be in the answer, but A is not exported. –Case 4: X is nondist., A is nondist..  Then all the query subgoals using X must be able to be mapped to other subgoals in view W.  Reason: since A is not exported in W, it’s impossible for W to join with other views to answer conditions involving X.  I.e., “either NONE or ALL.”

54 Seminar 254 Combine MCDs to cover query subgoals Problem: –q(x) :- car(x), sell(x,d), loc (d,“la") –v1(x) :- car(x) –v2(x) :- car(x), sell(x,d) –v3(x,d) :- sell(x,d), loc(d,“la") –v4(x) :- sell(x,d), loc (d,“la") MCDs: –car(x) : v1(x), v2(x) –sell(x,d) : v3(x,d) –loc(d,"ca") :v3(x,d) –sell(x,d),loc(d,“la") : v4(x) Contained rewritings - using MCDs to cover all query subgoals, without overlap –P1: q(x) :- v1(x), v3(x,d), v3(x,d) –P2: q(x) :- v2(x), v3(x,d), v3(x,d) –P3: q(x) :- v1(x), v4(x) –P4: q(x) :- v2(x), v4(x) MCR: union of these four contained rewritings.

55 Seminar 255 Related references Other algorithms on AQUV: –Bucket, Inverse-rule Generating efficient equivalent rewritings of queries using views: –CoreCover algorithm: [Afrati, Li, Ullman, SIGMOD’01] Handling arithmetic comparisons and dropped attributes: –[Afrati, Li, Mitra, PODS’02] –[Afrati, Li, Mitra, EDBT’04] Query Source views


Download ppt "1 Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine."

Similar presentations


Ads by Google