Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico 2007-2008 Matteo Palmonari Query Processing in Data Integration Materiale.

Similar presentations


Presentation on theme: "1 Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico 2007-2008 Matteo Palmonari Query Processing in Data Integration Materiale."— Presentation transcript:

1 1 Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico Matteo Palmonari Query Processing in Data Integration Materiale organizzato da presentazioni di: L. Tanca, S. Costantini, J. Sun & L. Zhao, M. Van Der Wielen, Ir. R. Vdovjak, Kambhampati & Knoblock, S. Bergamaschi

2 2 An introduction to data integration Prof. Letizia Tanca Politecnico di Milano Technologies for Information Systems

3 3 An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” ? Information Integration addall.com“One-World”Mediation amazon.com A1books.com half.com barnes&noble.com

4 4 What is a Data Integration System? –Uniform (same query interface to all sources) –Access to (queries; eventually updates too) –Multiple (we want many, but 2 is hard too) –Autonomous (DBA doesn’t report to you) –Heterogeneous (data models are different) –Structured (or at least semi-structured) –Data Sources (not only databases). A system providing:

5 5 Virtual Integration Architecture Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structured files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulator Optimizer Execution engine

6 6 Query Model in Virtual Integration User formulates query in terms of his/her ontology on the mediated (or “global”) schema System reformulates queries in terms of sub-queries for each source (“local” schema) Structure of the query model should be more intuitive for the user

7 7 Reformulation Problem Given: –A query Q posed over the mediated schema –Descriptions of the data sources Find: –A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers from to Q given the sources.

8 8 Mapping between the global logical schema and the single source schemata (logical view definition) Two basic approaches – GAV (Global As View) – LAV (Local As View) Can be used also in case of different data models In that case a model transformation is required

9 9 GAV (Global As View) Up to now we supposed that the global schema be derived from the integration process of the data source schemata Thus the global schema is expressed in terms of the data source schemata Such approach is called the Global As View approach

10 10 The other possible ways… LAV (Local As View) The global schema has been designed independently of the data source schemata The relationship (mapping) between sources and global schema is obtained by defining each data source as a view over the global schema GLAV (Global and Local As View) The relationship (mapping) between sources and global schema is obtained by defining a set of views, some over the global schema and some over the data sources

11 11 GGlobal schema G schemata SSource schemata S Mapping M between sources and global schema: a set of assertionsMapping M between sources and global schema: a set of assertions q S  q G q G  q S Intuitively, the first assertion specifies that the concept represented by a view (query) q S over a source schema S corresponds to the concept specified by q G over the global schema. Viceversa for the second assertion. Mapping between data sources and global schema

12 12 A data integration system is a triple (G, S, M)A data integration system is a triple (G, S, M) The query to the integrated system are posed in terms of G and specify which data of the virtual database we are interested inThe query to the integrated system are posed in terms of G and specify which data of the virtual database we are interested in The problem is understanding which real data (in the data sources) correspond to those virtual dataThe problem is understanding which real data (in the data sources) correspond to those virtual data Mapping between data sources and mediated schema

13 13 GAV A GAV mapping is a set of assertions, one for each element g of G g  q S That is, the mapping specifies g as a query q S over the data sources. This means that the mapping tells us exactly how the element g is computed.  OK for stable data sources  Difficult to extend with a new data source

14 14 GAV example SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Category(ID, Name, Description) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) SOURCE 2 Product(Code, Name, Size, Color, Description, Type, Price, Q.ty) Tipe(TypeCode, Name, Description) n.b.: WE DO NOT CARE ABOUT DATA TYPES…

15 15 SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) SOURCE 2 Product(Code, Name, Size, Color, Description, Type, Price, Q.ty) GLOBAL SCHEMA CREATE VIEW GLOB-PROD AS SELECT Code AS PCode, VersionCode as VCode, Version.NAme AS Name, Size, Color, Version.Description as Description, CatID, Version.Price, Stock FROM SOURCE1.Product, SOURCE1.Version WHERE Code = ProductCode UNION SELECT Code AS PCode, null as VCode, Name, Size, Color, Description,Type as CatID, Price, Q.ty AS Stock FROM SOURCE2.Product

16 16 GAV Suppose now we introduce a new sourceSuppose now we introduce a new source The simple view we have just created is to be modifiedThe simple view we have just created is to be modified In the simplest case we only need to add a union with a new SELECT-FROM- WHERE clauseIn the simplest case we only need to add a union with a new SELECT-FROM- WHERE clause This is not true in general, view definitions may be much more complexThis is not true in general, view definitions may be much more complex

17 17 GAV Quality depends on how well we have compiled the sources into the global schema through the mapping Whenever a source changes or a new one is added, the global schema needs to be reconsidered Query processing can be based on some sort of unfolding Example: one already seen

18 18 LAV The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema. Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.

19 19 LAV A mapping LAV is a set of assertions, one for each element s of each source S s  q G Thus the content of each source is characterized in terms of a view q G over the global schema OK if the global schema is stable, e.g. based on a domain ontology or an enterprise modelOK if the global schema is stable, e.g. based on a domain ontology or an enterprise model It favours extensibilityIt favours extensibility Query processing much more complexQuery processing much more complex

20 20 LAV Quality depends on how well we have characterized the sources High modularity and extensibility (if the global schema is well designed, when a source changes or is added, only its definition is to be updated) Query processing needs reasoning

21 21 SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) SOURCE 2 Product(Code, Name, Size, Color, Description, Type, Price, Q.ty) GLOBAL SCHEMA GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock) In this case we have to express the sources as views over the global schema

22 22 GLOBAL SCHEMA GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock) SOURCE 2 Product (Code, Name, Size, Color, Description, Type, Price, Q.ty) CREATE VIEW SOURCE2.Product AS SELECT PCode AS Code, Name, Size, Color, Description, CatID as Type, Price, Stock AS Q.ty FROM GLOB-PROD

23 23 GLOBAL SCHEMA GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock) SOURCE 1 Product(Code, Name, Description, Warnings, Notes, CatID) Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price) CREATE VIEW SOURCE1.Product AS SELECT Pcode AS Code, ? Name ?, ? Description ?, ? Warnings ?, ? Notes ?, CatID FROM GLOB-PROD CREATE VIEW SOURCE1. Version AS SELECT Pcode AS ProductCode, VCode as VersionCode, Size, Color, Name, ? Description ?, Stock, Price) FROM GLOB-PROD N.B.: Some information is lacking: either we don’t know where description has to be taken from, or there is no correspondent value (e.g.Warnings, Notes, etc..).  A difficult job

24 24 The GLAV approach Is a combination of LAV and GAV. Part of the mappings are of LAV type, part of GAV type. It combines the advantages of the two approaches.

25 Kambhampati & KnoblockInformation Integration on the Web (MA-1)25 Approaches for relating source & Mediator Schemas Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations Local-as-view (LAV): express the source relations as views over the mediated schema. Can be combined…? Virtual vs Materialized “View” Refresher Let’s compare them in a movie Database integration scenario.. Differences minor for data aggregation…

26 Kambhampati & KnoblockInformation Integration on the Web (MA-1)26 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title Express mediator schema relations as views over source relations

27 Kambhampati & KnoblockInformation Integration on the Web (MA-1)27 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title Express mediator schema relations as views over source relations Mediator schema relations are Virtual views on source relations

28 Kambhampati & KnoblockInformation Integration on the Web (MA-1)28 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2 Null values Express mediator schema relations as views over source relations

29 Kambhampati & KnoblockInformation Integration on the Web (MA-1)29 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies? “Lossy Mediation” Express mediator schema relations as views over source relations

30 Kambhampati & KnoblockInformation Integration on the Web (MA-1)30 Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” Sources are “materialized views” of mediator schema Express source schema relations as views over mediator relations

31 Kambhampati & KnoblockInformation Integration on the Web (MA-1)31 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” Creat Source S4 AS select cinema, genre from Movie m, Schedule s where m.title=s.title S4(Cinema,Genre) Express source schema relations as views over mediator relations

32 Kambhampati & KnoblockInformation Integration on the Web (MA-1)32 Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title Now if we want to find which cinemas are playing comedies, there is hope! Express source schema relations as views over mediator relations

33 Kambhampati & KnoblockInformation Integration on the Web (MA-1)33 GAV vs. LAV Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Lossy mediation

34 Kambhampati & KnoblockInformation Integration on the Web (MA-1)34 GAV vs. LAV Not modular –Addition of new sources changes the mediated schema Can be awkward to write mediated schema without loss of information Query reformulation easy –reduces to view unfolding (polynomial) –Can build hierarchies of mediated schemas Best when –Few, stable, data sources –well-known to the mediator (e.g. corporate integration) Garlic, TSIMMIS, HERMES, MOMIS Modular--adding new sources is easy Very flexible--power of the entire query language available to describe sources Reformulation is hard –Involves answering queries only using views (can be intractable—see below) Best when –Many, relatively unknown data sources –possibility of addition/deletion of sources Information Manifold, InfoMaster, Emerac, Havasu

35 35 Views: Sound & Completeness The terms sound, exact and complete are used to express to what degree the extent of a view corresponds to its definition. A view defined over some data source is sound if it provides a subset of the available data in the data source that corresponds to the definition. –It delivers only, but not necessarily all answers to its definition. The answers it does deliver might be incomplete, but they are correct (sound). If a view is complete, it provides a superset of the available data in the data source that corresponds to the definition. It delivers all answers to its definition, and maybe more. –Since the set of answers might contain more than the answers corresponding to the views definition, but does contain all answers corresponding to the views definition, the term complete is being used. A view is exact if it provides all and only data corresponding to the definition.

36 36 Query answering in GAV Query answering in the Global-as-View happens through query unfolding. Unfolding queries to the local sources is relatively easy compared to the Local as-View approach in which queries posted to the global schema have to be rewritten before being able to answer the query.

37 37 Query answering in GAV Because each global entity is defined as a view over the local schemas, the total number of GAV-rules necessary to define the mappings in a data- integration system corresponds to the number of entities in the global schema. the extent of a global schema defined by GAV-rules is assumed to be exact.

38 38 Query answering in LAV The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema. Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.

39 39 Query answering in LAV Each entity in the local schemas is defined as a view over the global schema. Therefore, the total number of LAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the local schemas. Processing queries in the Local-as-View approach is difficult because the only knowledge of the global-schema available is through the views representing the sources. Such view only provides partial information about the data. Since the mapping associated to each source as a view over the global schema it is not clear how to use the sources in order to answer queries expressed over the global schema. Therefore extracting information from an integration system using the Local-as-View approach is a complex task because one has to answer queries with incomplete information. Views defined by LAV-rules might be sound or complete, in terms of 3.1.

40 40 Introduction l Traditional Integration Formalisms F Global as View (GAV)  Mediated schema as views over data sources F Local as View (LAV)  Data sources as views of mediated schema Med. Schema T GAV: T :- S1, S2, S3 LAV: S1  S1S2S3

41 41 Some DB Theory Schema Conjunctive queries For example: Give me students together with their advisors who took courses given by their advisors after the 1st trimester 2003.

42 42 DB Theory SQL SELECT Advises.prof, Advises.student FROM Registered, Teaches, Advises WHERE Registered.c-number = Teaches.c-number and Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1” DATALOG Q(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester), Advises(prof, student), trimester > “2003\1” head body

43 43 Problem Definition l GAV-like Definition – Definition in Datalog 9DC : SkilledPerson(PID, “Doctor”) : - H : Doctor(SID, h, l, s, e) 9DC : SkilledPerson(PID, “EMT”): - H : EMT(SID, h, vid, s, e) 9DC : SkilledPerson(PID, “EMT”): - FS : Schedule(PID, vid), FS : FirstResponse(vid, s, l, d), FS : Skills(PID, “medical”)

44 44 Problem Definition l LAV-like Definition – Inclusion Definition LH : CritBed(bed, hosp, room, PID, status)  H : CritBed(bed, hosp, room), H : Patient(PID, bed, status) LH : EmergBed(bed, hosp, room, PID, status)  H : EmergBed(bed, hosp, room), H : Patient(PID, bed, status)

45 Dipartimento di Informatica Università degli Studi di L’Aquila 45 Deductive Databases » Tables viewed as predicates. » Ops on tables expressed as “datalog” rules › (Horn clauses, without function symbols) Enames(Name) :- Employe(Name, SSN) [Projection] Wealthy-Employee(Name):- Employee(Name,SSN), Salary(SSN,Money),Money> 10 [Selection] Ed(Name, Dname):- Employee(Name, SSN), E_Dependents(SSN, Dname) [Join] ERelated(Name,Dname) :- Ed(Name,Dname) ERelated(Name,Dname) :- Ed(Name,D1), ERelated(D1,D2) [Recursion]

46 Dipartimento di Informatica Università degli Studi di L’Aquila 46 More datalog terminology A datalog program is a set of datalog rules. A program with one rule is a conjunctive query. We distinguish EDB predicates and IDB predicates » EDB’s are stored in the database, › appear only in rule bodies » IDB’s are intensionally defined, › appear in both bodies and heads.

47 Dipartimento di Informatica Università degli Studi di L’Aquila 47 Approaches to Specifying Schema Descriptions » Global-as-View: express the mediated schema relations as a set of views over the data source relations » Local-as-View: express the source relations as views over the mediated schema.

48 48 DB Theory, query containment A query Q’ is contained in Q if for all databases D, the set of tuples computed for Q’ is a subset of those computed for Q, i.e., Q’  Q iff  D Q’(D)  Q(D) A query Q’ is equivalent to Q iff Q’  Q and Q  Q’ Example : Q’(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester), Advises(prof, student), trimester > “2003\2” is contained in Q(prof, student) :- Registered(student,course, trimester), Teaches(prof,course, trimester), Advises(prof, student), trimester > “2003\1”

49 49 Query containment check Create a canonical database D that is the “frozen” body of Q’ Compute Q(D) If Q(D) contains the frozen head of Q’ then Q’  Q otherwise not Example : Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y) Q: path(x,y) :- arc(x,y) path(x,y) :- path(x,z), path(z,y) 1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively: Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3) D={arc(0,1), arc(1,2), arc(2,3)} 2. Compute Q(D): Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)} 3. The frozen head of Q’ = (0,3) is in the Q(D)  Q’  Q

50 50 Query reformulation using views Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff Q’ refers only to views in V, Q’ is equivalent to Q Q’ is a Maximally-contained Rewriting of Q using V iff Q’ refers only to views in V, Q’  Q, there is no such rewriting Q1 that Q’  Q1  Q, Q1  Q

51 Kambhampati & KnoblockInformation Integration on the Web (MA-1)51 Reformulation in LAV Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? –Notice the use of materialized (pre-computed) views ONLY materialized views should be used… Approaches –Bucket algorithm [Levy; 96] –Inverse rules algorithm [Duschka, 99] –Hybrid versions SV-Bucket [2001], MiniCon [2001]

52 Kambhampati & KnoblockInformation Integration on the Web (MA-1)52 Reformulation in LAV: The issues Query: Find all the years in which Zhang Yimou released movies. Select year from movie M where M.dir=yimou Q(y) :- movie(T,D,Y,G),D=yimou Not executable Q(y) :- S1(T,D,Y,G), D=yimou (1) Q(y) :- S1(T,D,Y,G), D=yimou Q(y) :- S5(T,D,Y), D=yimou (2) Which is the better plan? What are we looking for? --equivalence? --containment? --Maximal Containment --Smallest plan?

53 Kambhampati & KnoblockInformation Integration on the Web (MA-1)53 Maximal Containment Query plan should be sound and complete –Sound implies that Query should be contained in the Plan (I.e., tuples satisfying the query are subset of the tuples satisfying the plan –Completeness? –Traditional DBs aim for equivalence Query contains the plan; Plan contains the query –Impossible We want all query tuples that can be extracted given the sources we have –Maximal containment (no other query plan, which “contains” this plan is available) P contains Q if P |= Q (exponential even for “conjunctive” (SPJ) query plans)

54 Kambhampati & KnoblockInformation Integration on the Web (MA-1)54 The world of Containment Consider Q 1 (.) :- B 1 (.) Q 2 (.) :- B 2 (.) Q 1  Q 2 (“contained in”) if the answer to Q 1 is a subset of Q 2 –Basically holds if B 1 (x) |= B 2 (x) Given a query Q, and a query plan Q 1, –Q 1 is a sound query plan if Q 1 is contained in Q –Q 1 is a complete query plan if Q is contained in Q 1 –Q 1 is a maximally contained query plan if there is no Q 2 which is a sound query plan for Q 1, such that Q 1 is contained in Q 2

55 Kambhampati & KnoblockInformation Integration on the Web (MA-1)55 Computing Containment Checks Consider Q 1 (.) :- B 1 (.) Q 2 (.) :- B 2 (.) Q 1  Q 2 (“contained in”) if the answer to Q 1 is a subset of Q 2 –Basically holds if B 1 (x) |= B 2 (x) –(but entailment is undecidable in general… boo hoo) Containment can be checked through containment mappings, if the queries are “conjunctive” (select/project/join queries, without constraints) [ONLY EXPONENTIAL!!!--aren’t you relieved?] m is a containment mapping from Vars(Q 2 ) to Vars(Q 1 ) if ¶m maps every subgoal in the body of Q 2 to a subgoal in the body of Q 1 ·m maps the head of Q 2 to the head of Q 1 Eg: Q1(x,y) :- R(x), S(y), T(x,y) Q2(u,v) :- R(u), S(v) Containment mapping: [u/x ; v/y]

56 Kambhampati & KnoblockInformation Integration on the Web (MA-1)56 Reformulation Algorithms Bucket algorithm –Cartesian product of buckets –Followed by “containment” check Inverse Rules –plan fragments for mediator relations V1V2 Q(.) :- V1() & V2() S11() :- V1() S12 :- V1() S21() :- V2() S22 :- V2() S00() :- V1(), V2() S11 S12 S00 S21 S22 S00 V1() :- S11() V1() :- S12() V1() :- S00() V2() :- S21() V2() :- S22() V2() :- S00() Q(.) :- V1() & V2() Bucket Algorithm Inverse Rules [Levy] [Duschka] P 1 contains P 2 if P 2 |= P 1

57 Kambhampati & KnoblockInformation Integration on the Web (MA-1)57 Movie(T,D,Y,G) :- S1(T,D,Y,G) Movie(T,D,?y,?g ) :- S3(T,D) Movie(T,D,Y,?g) :- S5(T,D,Y), y>1960 Movie(?t,?d,?y,G) :- S4(C,G) Schedule(C,?t,?ti) :- S4(C,G) Movie(T,D,Y,G) S1(T,D,Y,G) S3(T,D) S5(T,D,Y) S4(C,G) Skolemize Sf s5-1 (T,D,Y) Sched(C,T,Ti) Q(C,G) :- Movie(T,D,Y,G), Schedule(C,T,Ti) Skolemize Sf s4-1 (G,C)

58 Kambhampati & KnoblockInformation Integration on the Web (MA-1)58 Take-home vs. In-class (The “Tastes Great/Less Filling” debate of CSE 494) More time consuming –To set, to take and to grade Caveat: people tend to over-estimate the time taken to do the take-home since the don’t factor out the “preparation” time that is interleaved with the exam time..but may be more realistic Probably will be given on Th 6 th and will be due by 11 th evening. (+/-) Less time-consuming –To take (sort of like removing a bandage..) –and definitely to grade… Scheduled to be on Tuesday, May 11 th 2:40— 4:30

59 Kambhampati & KnoblockInformation Integration on the Web (MA-1)59 Interactive Review on 5/4 A large (probably most?) of the class on 5/4 will be devoted to interactive semester review Everyone should come prepared with at least 3 topics/ideas that they got out of the class –You will be called in random order and you should have enough things to not repeat what others before you said. What you say will then be immortalized on the class homepage –See the *old* acquired wisdom page on the class page.

60 Kambhampati & KnoblockInformation Integration on the Web (MA-1)60 Bucket Algorithm: Populating buckets ¶For each subgoal in the query, place relevant views in the subgoal’s bucket Inputs: Q(x):- r 1 (x,y) & r 2 (y,x) V 1 (a):-r 1 (a,b) V 2 (d):-r 2 (c,d) V 3 (f):- r 1 (f,g) & r 2 (g,f) r 1 (x,y) V 1 (x),V 3 (x) r 2 (y,x) V 2 (x), V 3 (x) Buckets:

61 Kambhampati & KnoblockInformation Integration on the Web (MA-1)61 Combining Buckets ËFor every combination in the Cartesian products from the buckets, check containment in the query Candidate rewritings: Q’ 1 (x) :- V 1 (x) & V 2 (x) X Q’ 2 (x) :- V 1 (x) & V 3 (x) X Q’ 3 (x) :- V 3 (x) & V 2 (x) X Q’ 4 (x) :- V 3 (x) & V 3 (x) * r 1 (x,y) V 1 (x),V 3 (x) r 2 (y,x) V 2 (x), V 3 (x) Bucket Algorithm will check all possible combinations r 1 (x,y)r 2 (y,x) Buckets: Q(x):- r 1 (x,y) & r 2 (y,x) R 1 (x,b) & R 2 (c,x)

62 Kambhampati & KnoblockInformation Integration on the Web (MA-1)62 Complexity of finding maximally contained plans in LAV Complexity does change if the sources are not “conjunctive queries” –Sources as unions of conjunctive queries (NP-hard) Disjunctive descriptions –Sources as recursive queries (Undecidable) Comparison predicates Complexity is less dependent on the query –Recursion okay; but inequality constraints lead to NP-hardness Complexity also changes based on Open vs. Closed world assumption True source contents (of Big Two) Advertised description “All cars” [Abiteboul & Duschka, 98] You can “reduce” the complexity by taking a conjunctive query that is an upperbound.  This just pushes the complexity to minimization phase Advertised description “Toyota” U “Olds” Big Two

63 Kambhampati & KnoblockInformation Integration on the Web (MA-1)63 Query Optimization Challenges -- Deciding what to optimize --Getting the statistics on sources --Doing the optimization We will first talk about reformulation level challenges

64 Kambhampati & KnoblockInformation Integration on the Web (MA-1)64 Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? True source contents Advertised description Guarantees (LCW; Inter-source comparisons) Problems: 1. Sources may not be interested in giving these!  Need to learn  hard to learn! 2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless

65 Dipartimento di Informatica Università degli Studi di L’Aquila Data Integration: Global-as- View in Answer Set Programming S. Costantini Università degli Studi di L’Aquila

66 Dipartimento di Informatica Università degli Studi di L’Aquila 66 Computing Answers in Data Integration Systems » Answer set programming (ASP) gives a semantics to datalog (logic) programs with negation (and possibly disjunction in the heads) » ASP-based specification of a data integration system » ASP-based specification of extensions (plausible answers, repairs)

67 Dipartimento di Informatica Università degli Studi di L’Aquila 67 Computing Answers in Data Integration Systems Methodology works for first-order queries (and Datalog extensions), and many ICs › LAV: Bravo and Bertossi, IJCAI’03 › GAV: Costantini and Formisano, ASP’03 (experiments and a working program)

68 Dipartimento di Informatica Università degli Studi di L’Aquila 68 GaV in ASP » Instead of unfolding w.r.t. each query, helper model so as to get a general inference engine » Easily express integrity constraints » Cope with incomplete information (also resulting form adding new sources)

69 Dipartimento di Informatica Università degli Studi di L’Aquila 69 GaV: an example

70 Dipartimento di Informatica Università degli Studi di L’Aquila 70 Front-end: Global Schema through a Helper Model person(X):- entity(1,X).% meta-concepts + naming organization(X):- entity(2,X). member(X,Y):- relat(1,X,Y). student(X):- entity(3,X). university(X):- entity(4,X). enrolled(X,Y):- relat(2,X,Y). age(X,Y):- attr(1,X,Y). % Constraints on student and enrolled enrolledE(X):- student(X),university(Y),enrolled(X,Y). :- not student(X),enrolled(X,Y). :- student(X),not enrolledE(X).

71 Dipartimento di Informatica Università degli Studi di L’Aquila 71 Back-end: Mapping to the Source Schemas entity(1,X):- s(1,X). entity(2,X):- s(2,X). entity(4,X):- s(5,X). % student is described implicitly as the domain of an unknown relationship entity(3,X):- s(3,X,Y). % student is described implicitly as the domain of one of the sources of enrolled entity(3,X):- s(4,X,Z).

72 Dipartimento di Informatica Università degli Studi di L’Aquila 72 Back-end: Mapping to the Source Schemas % Relationships relat(1,X,Y):- s(7,X,Z),s(8,Z,Y). relat(2,X,Y):- s(4,X,Y). relat(2,X,Y):-s(9,X,Y). attr(1,X,Y):- s(3,X,Y). attr(1,X,Y):- s(6,X,Y,Z).

73 Dipartimento di Informatica Università degli Studi di L’Aquila 73 Helper Model: Relational Inference Engine » Meta-rules for is_a chaining: relat(M,X,Y):- schema(N,E1,E2), entity(E1,X), entity(E2,Y), is_a_r(N,M), relat(N,X,Y). entity(M,X):- is_a_e(N,M), entity(N,X).

74 Dipartimento di Informatica Università degli Studi di L’Aquila 74 Dealing with Incomplete Information » Assume to add a new source for Enrolled, that in the back-end becomes, e.g.,: relat(2,X,Y):- s(4,X,Y). relat(2,X,Y):- s(9,X,Y). » No connection to student: but, one who is enrolled to a University should be a student

75 MOMIS – 75 MOMIS (Mediator envirOnment for Multiple Information Sources) The MOMIS Query Manager A query execution example

76 MOMIS – 76 The MOMIS Query Manager reformulates the query taking into account the mappings defined among the local classes and the global classes of the GVV (Global Virtual View). The mappings are defined by using a GAV (Global as View) approach: each global class of the GVV is expressed by means of the full-disjunction operator over the local classes. Query Execution Query rewriting Query rewriting  GAV approach:  GAV approach: the query is processed by means of unfolding Fusion and Reconciliation Fusion and Reconciliation of the local answers into the global answer  Object Identification  Object Identification : Join conditions among local classes  Inconsistencies:  Inconsistencies: Resolution functions to deal with conflits

77 MOMIS – 77 Query rewriting and execution Global Virtual View (GVV) query q 0 = scqG1  scqG2 Hotel s Service s Hotel s Service s single class query scqG1 single class query scqG2 L3scqG 1 L1scqG 2 L2scqG 1 L1scqG 1 L2scqG 2 VENERE hotel s facilitie s map_hotel s Local Schem a SAPERVIAGGIARE hotel s Local Schem a GUIDACAMPEGGI facilitie s map_hotelshotel s facilitie s hotel s facilitie s Local Schem a Query execution on the local sources

78 MOMIS – 78 Query rewriting SELECT H.name, H.address, H.city, H.price, S.facility, S.structure_name, S.structure_city FROM hotels as H, services as S WHERE H.city = S.structure_city and H.name = S.structure_name and H.city = 'rimini‘ and H.price > 50 and H.price < 80 and S.facility = 'air conditioning' order by H.price q 0 = SELECT H.name, H.address, H.city, H.price FROM hotels as H WHERE (H.city = 'rimini' ) and (H.price > 50) and (H.price < 80) scqG1 = SELECT S.facility, S.structure_name, S.structure_city FROM services as S WHERE (S.facility = 'air conditioning') scqG2 = SELECT hotels.name2, hotels.address, hotels.price, hotels.city FROM hotels WHERE ((city) = ('rimini') and ((price) > (50) and (price) < (80))) L3scqG1 = SELECT maps_hotels.hotels_name2, maps_hotels.hotels_city FROM maps_hotels WHERE (hotels_city) = ('rimini') L2scqG1 = SELECT hotels.name, hotels.address, hotels.city FROM hotels WHERE (city) = ('rimini') L1scqG1 = SELECT facilities_hotels.hotel_name2, facilities_hotels.hotels_city, facilities_hotels.facility FROM facilities_hotels WHERE (facility) = ('air conditioning') L1scqG2 = SELECT facilities_campings.campings_name, facilities_campings.campings_city, facilities_campings.name FROM facilities_campings WHERE (name) = ('air conditioning') L2scqG2 = SINGLE CLASS QUERIES UNFOLDING

79 MOMIS – 79 Fusion and Reconciliation L3scqG1 result set L1scqG2 result set L2scqG1 result set L1scqG1 result set L2scqG2 result set VENERE SAPERVIAGGIAREGUIDACAMPEGGI partial results full join full join L1scqG1 full join L2scqG1 full join L3scqG1 full join L1scqG2 full join L2scqG2 scqG1 result set scqG2 result set q 0 result set join scqG1 join scqG2

80 MOMIS – 80 Fusion and Reconciliation scqG2 result set= L1scqG2 full join L2scqG2 guidacampeggi.facilities full outer join venere.facilities on ( (venere.facilities.facility) = (guidacampeggi.facilities.name) AND (venere.facilities.hotels_city) = (guidacampeggi.facilities.campings_city) AND (venere.facilities.hotel_name2) = (guidacampeggi.facilities.campings_name)) scqG1 result set = L1scqG1 full join L2scqG1 full join L3scqG1 saperviaggiare.hotels full outer join venereEn.hotels on ( ((venereEn.hotels.name2) = (saperviaggiare.hotels.name) AND (venereEn.hotels.city) = (saperviaggiare.hotels.city))) full outer join venereEn.maps_hotels on ( ((venereEn.maps_hotels.hotels_name2) = (saperviaggiare.hotels.name) AND (venereEn.maps_hotels.hotels_city) = (saperviaggiare.hotels.city)) OR ((venereEn.maps_hotels.hotels_name2) = (venereEn.hotels.name2) AND (venereEn.maps_hotels.hotels_city) = (venereEn.hotels.city))) SELECT H.name, H.address, H.city, H.price, S.facility, S.structure_name, S.structure_city FROM hotels as H, facilities as S WHERE (H.city = S.structure_city ) AND (H.name = S.structure_name ) ORDER BY H.price ASC q 0 = scqG1 result set join scqG2 result set

81 MOMIS – 81 Query Execution The MOMIS Query Manager at work


Download ppt "1 Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico 2007-2008 Matteo Palmonari Query Processing in Data Integration Materiale."

Similar presentations


Ads by Google