Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Oblivious Querying of Data with Irregular Structure.

Similar presentations


Presentation on theme: "1 Oblivious Querying of Data with Irregular Structure."— Presentation transcript:

1 1 Oblivious Querying of Data with Irregular Structure

2 2 Based on Several Works Queries with Incomplete Answers –Yaron Kanza, Werner Nutt, Shuky Sagiv Flexible Queries –Yaron Kanza, Shuky Sagiv SQL4X –Sara Cohen, Yaron Kanza, Shuky Sagiv Computing Full Disjunctions –Yaron Kanza, Shuky Sagiv

3 3 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

4 4 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

5 5 The Semistructured Data Model Data is described as a rooted labeled directed graph Nodes represent objects Edges represent relationships between objects Atomic values are attached to atomic nodes

6 6 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks A Movie Database Example 36 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia

7 7 Star Wars 1977 Mark Hamill Harrison Ford … Star Wars 1977 Mark Hamill Harrison Ford … XML that Encodes the Semistructured Data

8 8 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release 36 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia What Should be the form of the Query?

9 9 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia The movie has a year attribute Incomplete Data The year of the movie is missing

10 10 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year 1984 24 Year Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia Variations in Structure 11 Movie below actor 29 14 21 Actor below movie

11 11 1 1213 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 33 34 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 35 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 34 Magnolia A movie labelA film label Ontology Variations Dealing with ontology variations is beyond the scope of this talk Dealing with ontology variations is beyond the scope of this talk

12 12 Irregular Data Data is incomplete –Missing values of attributes in objects Data has structural variations –Relationships between objects are represented differently in different parts of the database Data has ontology variations –Different labels are used to describe objects of the same type

13 13 Irregular data does not conform to a strict schema Queries over irregular data should not be rigid patterns Queries over irregular data should not be rigid patterns The schema cannot guide a user in formulating a query The schema cannot guide a user in formulating a query

14 14 The description of the schema is large (e.g., a DTD of XML) The description of the schema is large (e.g., a DTD of XML) It is difficult to use the schema when formulating queries It is difficult to use the schema when formulating queries Data is contributed by many users in a variety of designs Data is contributed by many users in a variety of designs The query should deal with different structures of data The query should deal with different structures of data The structure of the database is changed frequently The structure of the database is changed frequently Queries should be rewritten frequently Queries should be rewritten frequently In Which Cases is it Difficult to Formulate Queries over Semistructured Data? In Which Cases is it Difficult to Formulate Queries over Semistructured Data?

15 15 Can Regular Expressions Help in Querying Irregular Data? In many cases, regular expressions can be used to query irregular data Yet, regular expressions are –Not efficient – it is difficult to evaluate regular expressions –Not intuitive – it is difficult for a naïve user to formulate regular expressions

16 16 More on Using Regular Expressions When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database –For n types of objects, there are n! possible hierarchies –For an object with n attributes, there are 2 n subsets of missing attributes

17 17 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

18 18 Queries with Incomplete Answers We have developed queries that deal with incomplete data in a novel way and return incomplete answers The queries return maximal answers rather than complete answers Different query semantics admit different levels of incompleteness

19 19 Queries with Incomplete Answers Queries with complete answers Queries with AND Semantics Queries with Weak Semantics Queries with OR Semantics Increasing level of incompleteness

20 20 Queries and Matchings The queries are labeled rooted directed graphs Query nodes are variables Matchings are assignments of database objects to the query variables according to –the constraints specified in the query, and –the semantics of the query

21 21 Root Constraint: Satisfied if the query root is mapped to the db root Edge Constraint: Satisfied if a query edge with label l is mapped to a database edge with label l Constraints On Complete Matchings r1 Query Root Database Root x y 12 25 ll

22 22 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name 34 29 27 Movie Director Uncredited Actor 14 May 1944 Date of birth 35 v Name Date of birth George Lucas A Complete Matching A Complete Matching Producer 11227 32 11 35 All the nodes are mapped to non-null values The root constraint and all the edge constraints are satisfied

23 23 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name 34 29 27 Movie Director Uncredited Actor 14 May 1944 Date of birth 35 v Name Date of birth Consider the case where Node 35 is removed from the database 14 May 1944 Date of birth 35 George Lucas No Complete Matching Exists! No Complete Matching Exists! Producer Star Wars 1977

24 24 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name 34 29 27 Movie Director Uncredited Actor v Name Date of birth George Lucas Not Every Partial Assignment is an Incomplete Matching Not Every Partial Assignment is an Incomplete Matching Producer 1 This is not a matching, since the sequence of labels from the database root to Node 31 is different from any sequence of labels that starts at the query root and ends in variable v u NULL z y x 31

25 25 The Reachability Constraint on Partial Matchings A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied Database x z w y l1l1 r v l3l3 l2l2 l5l5 l4l4 l6l6 Query w y r v l3l3 l5l5 v 1 55 5 8 l1l1 1 l3l3 l5l5 v x z r l2l2 l4l4 l6l6 7 9 1 l2l2 l4l4 l6l6

26 26 yx z Director Actor r Producer “And” Matchings A partial matching is an AND matching if –The root constraint is satisfied –The reachability constraint is satisfied by every query node that is mapped to a database node –If a query node is mapped to a database node, all the incoming edge constraints are satisfied

27 27 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie An AND Matching George Lucas Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name 34 29 27 Movie Director Uncredited Actor v Name Date of birth 11227 32 Producer 11 Producer u NULL

28 28 Uncredited Actor Uncredited Actor 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Name 32 Name 34 29 27 Movie Director Uncredited Actor v Name Date of birth Suppose that we remove the edges that are labeled with Uncredited Actor George Lucas Producer In an AND matching, Node z must be null! In an AND matching, Node z must be null!

29 29 Edge Constraint: Is Weakly Satisfied if it is either Satisfied (as defined earlier), or One (or more) of its nodes is mapped to a null value Weak Satisfaction of Edge Constraints x y 12 25 ll x y 12 25 lm null x y 12 25 lm null x y l

30 30 Weak Matchings A partial matching is a weak matching if –The root constraint is satisfied –The reachability constraint is satisfied by every query node that is mapped to a database node –Every edge constraint is weakly satisfied

31 31 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie A Weak Matching George Lucas Director Steven Spielberg Director 12 r yx z u Name 32 Name 34 29 27 Movie Director Uncredited Actor v Name Date of birth 127 32 Producer 11 Producer u NULL y Edges that are weakly satisfied

32 32 x y 12 25 ll x y 12 25 lm null x y l x y 12 25 lm null In a weak matching, all four options are permitted In an AND matching, only the first three options are permitted

33 33 Producer 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Name 32 Name 34 29 27 Movie Director Uncredited Actor v Name Date of birth Consider the case where edges labeled with Producer are removed George Lucas Producer In a weak matching, Node z must be null! In a weak matching, Node z must be null!

34 34 “OR” Matchings A partial matching is an OR matching if –The root constraint is satisfied –The reachability constraint is satisfied by every query node that is mapped to a database node

35 35 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie An OR Matching George Lucas Director Steven Spielberg Director 12 r yx z u Name 32 Name 34 29 27 Movie Director Uncredited Actor v Name Date of birth 127 32 11 Producer u NULL y An edge which is not weakly satisfied

36 36 Increasing Level of Incompleteness A complete matching is an AND matching An AND matching is a weak matching A weak matching is an OR matching

37 37 t 1 =(1, 5, 2, null) t 2 =(1, null, 2, null) Maximal Matchings A tuple t 1 subsumes a tuple t 2 if t 1 is the result of replacing some null values in t 2 by non-null values: A matching is maximal if no other matching subsumes it A query result consists of maximal matchings only Matchings are represented as tuples of oid’s and null values

38 38 On the Complexity of Computing Queries with Incomplete Answers The size of the result can be exponential in the size of the input (database and query) –Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query) Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity

39 39 Input-Output Complexity In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result. In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result.

40 40 The Motivation for Using I/O Complexity Measuring the time complexity with respect to the size of the input does not separate between the following two cases: –An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input –An algorithm that does an exponential amount of work even when the query result is small Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard

41 41 I/O Complexity of Query Evaluation (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Semantics NP- Complete PTIME Complete NP- Complete PTIME AND PTIME Weak PTIME OR Recent Results (PODS’03)

42 42 Filter Constraints Constraints that filter the results (i.e., the maximal matchings) There are –Weak filter constraints (the constraint is satisfied if a variable in the constraint is null) –Strong filter constraints (all variables must be non-null for satisfaction) Existence constraint: !x is true if x is not null

43 43 I/O Complexity of Query Evaluation with Existence Constraints (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Semantics NP- Complete PTIME Complete NP- Complete PTIME AND NP- Complete PTIME Weak NP- Complete PTIME OR

44 44 I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Semantics NP- Complete PTIME Strong NP- Complete PTIME AND NP- Complete PTIME Weak NP- Complete PTIME OR

45 45 Query Containment Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers Q 1  Q 2 if for all database D, every matching of Q 1 w.r.t. to D is subsumed by a matchings of Q 2 w.r.t. to D Query containment (query equivalence) is useful for the development of optimization techniques

46 46 Containment in AND Semantics Homomorphism between the query graphs is necessary and sufficient for containment r y x z l1l1 v l2l2 l2l2 u l3l3 l4l4 Q1Q1 r q p l1l1 v l2l2 u l3l3 l4l4 Q2Q2 homomorphism Deciding whether one query is contained in another is NP-Complete Q 1  Q 2

47 47 Containment in OR Semantics The following is a necessary and sufficient condition for query containment in OR semantics For every spanning tree T 1 of the contained query, there a spanning tree T 2 of the containing query, such that there is a homomorphism from T 2 to T 1 –is in Π P 2 –NP-Complete if the containee is a tree –polynomial if the container is a tree

48 48 Containment in Weak Semantics Similar to containment in OR Semantics, with the following difference Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments –A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root

49 49 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

50 50 Flexible Queries To deal with structural variations in the data, we have developed flexible queries

51 51 Flexible Queries Rigid Queries Semiflexible Queries Flexible Queries Increasing level of flexibility

52 52 A query that finds all pairs of actors that acted in the same movie A query that finds all pairs of actors that acted in the same movie However, if in the database, actors are descendents of movies, the query has to be reformulated However, if in the database, actors are descendents of movies, the query has to be reformulated Instead, we propose new ways of matching queries to databases Instead, we propose new ways of matching queries to databases r yx z Actor Movie Movie Database Example

53 53 Rigid matchings and complete matchings are the same Returning rigid matchings is the usual semantics for queries (e.g., XQuery, Lorel, XML-QL, etc.) Rigid matchings and complete matchings are the same Returning rigid matchings is the usual semantics for queries (e.g., XQuery, Lorel, XML-QL, etc.)

54 54 Root Constraint: Satisfied if the query root is mapped to the db root Edge Constraint: Satisfied if a query edge with label l is mapped to a database edge with label l Constraints On Rigid Matchings r1 Query Root Database Root x y 12 25 ll

55 55 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. SeriesActor Title Name Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie r x y Actor Movie 1 14 29 A Rigid Matching 1 25 12 This is not a Rigid Matching

56 56 A Semiflexible Matching The query root is mapped to the db root y l x 11 l 9 × r1 Query Root DB Root A query node with an incoming label l is mapped to a db node with an incoming label l The image of every query path is embedded in some database path SCC is mapped to SCC

57 57 A Semiflexible Matching The query root is mapped to the db root A query node with an incoming label l is mapped to a db node with an incoming label l The image of every query path is embedded in some database path SCC is mapped to SCC y l x 11 l 9 r1 Query Root DB Root The last two conditions cannot be verified locally, i.e., by considering one query edge at a time The last two conditions cannot be verified locally, i.e., by considering one query edge at a time

58 58 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. SeriesActor Title Name Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie r x y Actor Movie 1 25 12 The Semiflexible Matchings 1 14 29 We get all the actor-movie pairs We get all the actor-movie pairs 1 22 11 1 21

59 59 r y x Actor Movie r x y Actor Movie Under semiflexible semantics, these two queries are equivalent Under semiflexible semantics, these two queries are equivalent The user does not have to know if movies are above or below actors in the database The user does not have to know if movies are above or below actors in the database

60 60 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. SeriesActor Title Name Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie r x y Actor Movie Another Example of a Semiflexible Matching Another Example of a Semiflexible Matching We get pairs of actors that acted in the same movie We get pairs of actors that acted in the same movie z Movie Actor 1 21 11 22 1 11 1 21 22 1 11 1 21 11 Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree

61 61 A Flexible Matching The query root is mapped to the db root r1 Query Root DB Root x 9y11 ll A query node with an incoming label l is mapped to a db node with an incoming label l An edge is mapped to two nodes on one path Notice that a path in the query is not necessarily mapped to a path in the db

62 62 An Example of a Flexible Query r x Director A director y Name The director name z Movie A movie of the director v Title The title of the movie u Actor An actor in the movie Name w The name of the actor

63 63 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x z u Name 32 Name 34 29 27 Movie Name Director 14 May 1944 Date of birth 35 v Title Name George Lucas Producer Actor w 1 29 12 34 263325 A query edge is mapped to two db nodes on one path A query edge is mapped to two db nodes on one path This flexible matching is neither a rigid matching nor a semiflexible matching This flexible matching is neither a rigid matching nor a semiflexible matching

64 64 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x Name 32 Name 34 29 27 Movie Producer 14 May 1944 Date of birth 35 George Lucas Producer 1 Why are semiflexible matchings preferred sometimes to flexible matchings? Why are semiflexible matchings preferred sometimes to flexible matchings? 27 11 In this flexible matching, a producer is given with a movie that he directed but did not produce In this flexible matching, a producer is given with a movie that he directed but did not produce

65 65 99 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x Name 32 Name 34 29 27 Movie Producer 14 May 1944 Date of birth 35 George Lucas Producer 1 99 11 In semiflexible semantics, the problem is solved since the image of a query path is embedded in a database path In semiflexible semantics, the problem is solved since the image of a query path is embedded in a database path Producer

66 66 Differences Between the Semiflexible and Flexible Semantics On a technical level, in flexible matchings –Query paths are not necessarily embedded in database paths –SCC’s are not necessarily mapped to SCC’s On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence –Query paths are embedded in database paths In the flexible semantics, this condition is relaxed: –Query edges are embedded in database paths

67 67 Increasing Level of Flexibility A rigid matching is a semiflexible matching A semiflexible matching is a flexible matching

68 68 Verifying that Mappings are Semiflexible Matchings Is a given mapping of query nodes to database nodes a semiflexible matching? –Not as simple as for rigid matchings (no local test, i.e., need to consider paths rather than edges) In a dag query, the number of paths may be exponential –Yet, verifying is in polynomial time In a cyclic query, the number of paths may be infinite –Yet, verifying is in exponential time

69 69 Verifying that a Mapping is a Semiflexible Matching Cyclic Query DAG Query Tree Query Path Query Query / Database No matchings PTIME Path Database No matchings PTIME Tree Database No matchings PTIME DAG Database coNP PTIME Cyclic Database

70 70 Input-Output Complexity of Query Evaluation for the Semiflexible Semantics Next slide summarizes results about the input-output complexity –Polynomial for a dag query and a tree database (or simpler cases) Rather difficult to prove, even when the query is a tree, since there is no local test for verifying that mappings are semiflexible matchings –Exponential lower bounds for other cases

71 71 I/O Complexity for SF Semantics (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Database Result is empty PTIME Path Database Result is empty PTIME Tree Database Result is empty NP- Complete DAG Database NP-Hard (in  P 2 ) NP-Hard (in  P 2 ) NP- Complete Cyclic Database Data Complexity is Polynomial in all Cases

72 72 Query Evaluation for the Flexible Semantics The database is replaced with a relationship graph which is a graph, such that –The nodes are the nodes of the database –Two nodes are connected by an edge if there is a path between them in the database (the direction of the path is unimportant) The query is evaluated under rigid semantics w.r.t. the relationship graph

73 73 I/O Complexity of Query Evaluation for the Flexible Semantics Results follow from a reduction to query evaluation under the rigid semantics Tree query –Input-Output complexity is polynomial DAG query –Testing for non-emptiness is NP-Complete

74 74 Query Containment Q 1  Q 2 if for all database D, the set of matchings of Q 1 w.r.t. to D is contained in the set of matchings of Q 2 w.r.t. to D We assume that –Both queries have the same set of variables

75 75 Complexity of Query Containment Under the semiflexible semantics, Q 1  Q 2 iff the identity mapping from the variables of Q 2 to the variables of Q 1 is a semiflexible matching of Q 2 w.r.t. Q 1 Thus, containment is –in coNP when Q 1 is a cyclic graph and Q 2 is either a dag or a cyclic graph –in polynomial time in all other cases Under the flexible semantics, query containment is always in polynomial time

76 76 Database Equivalence D 1 and D 2 are equivalent if for all queries Q, the set of matchings of Q w.r.t. to D 1 is equal to the set of matchings of Q w.r.t. to D 2 Both databases must have the same set of objects and the same root

77 77 Complexity of Database Equivalence For the semiflexible semantics, deciding equivalence of databases is –in polynomial time if both databases are dags –in coNP if one of the databases has cycles For the flexible semantics, deciding equivalence of databases is polynomial in all cases

78 78 Database Transformation 1 234 MDB Actor Movie 68 Actor Movie The databases are equivalent under both the flexible and semiflexible semantics HookStar Wars Dustin Hoffman Harrison Ford Mark Hamill A DAG has become a TREE! 1 2 34 MDB Actor Movie 68 Actor Movie Dustin Hoffman Hook Harrison Ford Star Wars Mark Hamill

79 79 Transforming a Database into a Tree Reasons for transforming a database into an equivalent tree database: –Evaluation of queries over a tree database is more efficient –In a graphical user interface, it is easier to represent trees than DAGs or cyclic graphs –Storing the data in a serial form (e.g., XML) requires no references

80 80 Transformation into a Tree There are algorithms for –Testing if a database can be transformed into an equivalent tree database, and –Performing the transformation For the semiflexible semantics –The algorithms are polynomial For the flexible semantics –The algorithms are exponential

81 81 Implementing Flexible Queries Flexible queries were implemented in SQL4X In an SQL4X query, relations and XML documents are queried simultaneously A query result can be either a relation or an XML document

82 82 QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of z An SQL4X Query r y x z Movie Name Director v Title A query under the Flexible Semantics

83 83 QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of x WHERE text(v) = ‘Star Wars’ An SQL4X Query r y x z Movie Name Director v Title A query under the Flexible Semantics Constraints can be added

84 84 QUERY AS RELATION SELECT text(x) as director, text(v) as title, Budget FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of x, FilmBudgets WHERE text(v) = FilmBudgets.Title An SQL4X Query r y x z Movie Name Director v Title A query under the Flexible Semantics Relations and XML Documents can be queried simultaneously BudgetTitle …… …… A relation with data about film budgets FilmBudgets

85 85 Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

86 86 Combining the Paradigms In oblivious querying: –The user does not have to know where data is incomplete –The user does not have to know the exact structure of the data The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined

87 87 Flexible Queries with Incomplete Answers A flexible query w.r.t. a database is actually a rigid query w.r.t. the relationship graph Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w.r.t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers

88 88 1 11 Movie Database Movie 222325 26 Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Movie Director Steven Spielberg Director 12 r y x z u Name 32 Name 34 29 27 Movie Name Director 14 May 1944 Date of birth 35 v Title Name George Lucas Producer Actor w Consider the case where Node 25 and Node 33 are removed Consider the case where Node 25 and Node 33 are removed 25 Actor Name 33 Dustin Hoffman Title Hook

89 89 1 11 Movie Database Movie 2223 26 Actor Name Title 31 Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x z u Name 32 Name 34 29 27 Movie Name Director 14 May 1944 Date of birth 35 v Title Name George Lucas Producer Actor w 1 29 12 34 26 A Flexible matching which is also an incomplete (maximal) matching A Flexible matching which is also an incomplete (maximal) matching u NULL w

90 90 Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

91 91 Full Disjunction Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples Dangling tuples are padded with nulls Only maximal tuples are retained in the full disjunction (as in the case of QwIA)

92 92 languageyeartitlem-id English1983Zelig1 English1998Antz2 English1998Armageddon3 English1940Fantasia4 Movies date-of-birthnamea-id 1/12/1935Woody Allen1 19/3/1955Bruce Willis2 28/10/1967Julia Roberts3 Actors rolem-ida-id Zelig11 Z21 Harry32 Acted-in m-ida-id 11 Actors-that-Directed roleDate-of-birthnamea-idlanguageyeartitlem-id Zelig1/12/1935Woody Allen1English1983Zelig1 Z1/12/1935Woody Allen1English1998Antz2 Harry19/3/1955Bruce Willis2English1998Armageddon3  English1940Fantasia4  28/10/1967Julia Roberts3  The Full Disjunction of the Given Relations

93 93 The Full Disjunction of the Given Relations roleDate-of-birthnamea-idlanguageyeartitlem-id Zelig1/12/1935Woody Allen1English1983Zelig1 Z1/12/1935Woody Allen1English1998Antz2 Harry19/3/1955Bruce Willis2English1998Armageddon3  English1940Fantasia4  28/10/1967Julia Roberts3  roleDate-of-birthnamea-idlanguageyeartitlem-id  English1983Zelig1 The full disjunction does not include subsumed tuples languageyeartitlem-id English1983Zelig1 English1998Antz2 English1998Armageddon3 English1940Fantasia4 Movies This tuple will not be in the full disjunction

94 94 languageyeartitlem-id English1983Zelig1 English1998Antz2 English1998Armageddon3 English1940Fantasia4 Movies date-of-birthnamea-id 1/12/1935Woody Allen1 19/3/1955Bruce Willis2 28/10/1967Julia Roberts3 Actors rolem-ida-id Zelig11 Z21 Harry32 Acted-in m-ida-id 11 Actors-that-Directed roleDate-of-birthnamea-idlanguageyeartitlem-id Zelig1/12/1935Woody Allen1English1983Zelig1 Z1/12/1935Woody Allen1English1998Antz2 Harry19/3/1955Bruce Willis2English1998Armageddon3  English1940Fantasia4  28/10/1967Julia Roberts3  The Full Disjunction of the Given Relations roleDate-of-birthnamea-idlanguageyeartitlem-id  28/10/1967Julia Roberts3English1940Fantasia4 The full disjunction does not include tuples that are based on Cartesian Product rather than join

95 95 In the Full Disjunction of a Given Set of Relations: Every tuple of the input is a part of at least one tuple of the output Tuples are joined as in a natural join, padded with null values The result includes only “maximal connected portions”

96 96 Motivation for Full Disjunctions Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94] Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]

97 97 Computing Full Disjunctions for γ-acyclic Relation Schemas Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic

98 98 Weak Semantics Generalizes Full Disjunctions Relations can be converted into a semistructured database The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics We have developed an algorithm that uses this generalization to compute full disjunctions in polynomial time under I/O complexity, even when the relation schemas are cyclic We have developed an algorithm that uses this generalization to compute full disjunctions in polynomial time under I/O complexity, even when the relation schemas are cyclic

99 99 Generalizing Full Disjunctions In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join) We can generalize full disjunctions to support constraints that are not merely equality among attributes

100 100 Example Movies (m-id, title, year, language, location) Actors (a-id, name, date-of-birth) Acted-in (a-id, m-id, role) Actors-that-Directed (a-id, m-id) Movies (m-id, title, year, language, location) Actors (a-id, name, date-of-birth) Acted-in (a-id, m-id, role) Actors-that-Directed (a-id, m-id) Historical-Events (name, date, description) Historical-Sites (Country, State, City, Site) Historical-Events (name, date, description) Historical-Sites (Country, State, City, Site) The date of the historical event is a date in the year when the movie was released The filming location is near the historical site

101 101 Another Way of Generalizing Full Disjunctions: Use OR-Semantics OR-semantics is used rather than weak semantics when tuples are joined This relaxes the requirement that every pair of tuples should be join consistent Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent

102 102 Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) streetcitybuilding dnamedept -no dept -no cityenamee-id  10MI-666LondonJames Bond007 KingLiverpool10 MI-66  Example The Full Disjunction

103 103 Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) streetcitybuilding dnamedept -no dept -no cityenamee-id KingLiverpool10 MI-666LondonJames Bond007 Example The Full Disjunction under OR-Semantics

104 104 Integrated Relation Data Source Information Integration from Heterogeneous Sources Query Relation Query Relation Query Relation

105 105 Integrated Relation Data Source Query Relation Query Relation Query Relation We use queries that combine flexible semantics and weak semantics: -The queries are insensitive to changes in the data - Easy to formulate the query

106 106 Integrated Relation Data Source Query Relation Query Relation Query Relation The integration of the relations is done with a full disjunction of the computed relations

107 107 Conclusion Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases –Querying the database even when the user is oblivious to the structure of the database –Queries are insensitive to variations in the structure of the database

108 108 Conclusion (continued) Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases –Querying the database even when the user is oblivious to missing data –Queries return maximal answers rather than complete answers

109 109 Conclusion (continued) The two paradigms of flexible queries and queries with maximal answers can be combined The combination of the paradigms can facilitate integration of information from heterogeneous sources

110 110 Conclusion (continued) Full disjunctions can be computed using queries in weak semantics Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints

111 111 Thank You Questions?


Download ppt "1 Oblivious Querying of Data with Irregular Structure."

Similar presentations


Ads by Google