Presentation is loading. Please wait.

Presentation is loading. Please wait.

25 nov 2001SDBI 20011 Sigalit Sima Sigalit Batiashvili Kdoshim Representing & Querying XML With Incomplete Information Serge Abiteboul Luc Segoufin Victor.

Similar presentations


Presentation on theme: "25 nov 2001SDBI 20011 Sigalit Sima Sigalit Batiashvili Kdoshim Representing & Querying XML With Incomplete Information Serge Abiteboul Luc Segoufin Victor."— Presentation transcript:

1 25 nov 2001SDBI 20011 Sigalit Sima Sigalit Batiashvili Kdoshim Representing & Querying XML With Incomplete Information Serge Abiteboul Luc Segoufin Victor Vianu

2 2 Goals Motivation - Data on the WEB - Incompleteness problem Representation System Refine Algorithm Querying The Incomplete Information - CWA approach - Answer using sub queries (CWA+OWA)

3 3 Introduction & Motivation

4 4 Data On The WEB Partial Information is known -expiration of data -unavailable sites -modification of data, etc. Irregular structure - self describing Semistructured Data

5 5 Data On The Web Problem Data no longer fits into tables (no rigid structure). We Want.. Apply database-like functionality to access data on the WEB. Focus: XML-ized portion of the WEB

6 6 XML eXtended Markup Language The Lingua franca of the WEB Facilitate the use of database techniques to manage WEB data Brings order -nested tags (similar to record structure) -ordered sub-elements -structure (DTD, XML-Schema) DTD (Document Type Definition) Define constrains on the XML Document Structure

7 7 XML Example Jhon Smith Green Field Park N.Y. jhon.smith@infineon.com emailaddrnamePerson Jhon.smith@..Green..Jhon Smith

8 8 View XML As Trees John Smith Green Field Park N.Y. john.smith@infineon.com person name addr email John Green Field john.smith@john.smith@ Smith Park N.Y. person name addr email* DTD

9 9 Webhouse -A collection of website sources -context: XML - Hold a DTD that describes the sources structure Warehouse A collection of information from many sources

10 10 Webhouse Maintaince The Webhouse continuously enriched by web sites exploration webhouse Technique: Crawling the web.

11 11 Webhouse Dynamic nature of WEB data Limited storage capacity Expiration of data Modification of data Etc. Why? Information held in the webhouse is never complete.

12 12 The Problem - Missing documents satisfying the query in the webhouse -Missing the relevant data in the document Posing a query against the webhouse may yield an incomplete answer

13 13 Solution Two main approaches Closed World Assumption (CWA) If some information does not appear explicitly it does not hold. - possible method: Best Effort - possible method: Fetch Data Open World Assumption (OWA) Anything not ruled out is possible

14 14 Solution Methods Best Effort Answer accordingly to the available information Fetch Data Seek the sources for additional information to provide a complete answer

15 15 Fetch Data Defining the missing portion of the data using the available information Thus, determining the additional exploration of WEB sources. How ? We use the Fetch Data approach We would like to Be able to define what additional resource we are looking for.

16 16 Example Given the DTD Catalog product+ Product name price cat picture* Cat subcat catalog product nameprice<200 cat=elec subcat Query1 Find the name, price & subcategories of electronics products with price < $200

17 17 Answer to query 1 catalog product Canon 120 elec camera Nikton199elec camera Sony175elec cdplayer

18 18 Given the DTD Catalog product+ Product name price cat picture* Query 2 catalog product name cat=elec subcat=camera picture Query 2 Finds the name & pictures of all cameras with picture Cat subcat

19 19 Answer Strategy We Already Have.. Elec Price < $200 Camera (with) picture Camera (with) Picture Price < $200 Query 1 Query 2

20 20 Answer Strategy (cont) We Need.. Elec Price < $200 Camera (with) picture Camera (with) Picture Price < $200 Query 1 Query 2 Camera (with) Picture Price >= $200 No need to query the Web for the whole query Define the missing information Reducing the search space

21 21 Representation System

22 22 Framework Define the data model -for the webhouse repository (XML data) Define constraint model -simplified DTD Define query language Define the representation system for the incomplete information

23 23 Data Model catalog product nameprice cat subcat nameprice cat subcat labeling function N – set of nodes =Canon=120 =elec =camera =Nikton=199 =elec =camera value mapping :  N  = nodes labels v: Q N Q = data values T data =

24 24 Data Tree Prefix T data =

25 25 Tree Type root (catalog) a anan a2a2 a1a1 … DTD as Regular expression  (a) = a a 1 a 2 … a n w2w2 w1w1 wnwn DTD as Regular expression  (a) = a a 1 a 2 … a n W i =1 exactly one child labeled a i W i =? at most one child labeled a i W i =+ at least one child labeled a i W i =* 0 or more child labeled a i w1w1 w2w2 wnwn T type =  : element names r: root label

26 26 Tree Type Satisfaction catalog product name pricecat picture subcat + 111 1 * T type product nikon cat =elec subcat =camera c.jpeg 199 catalog T data satisfies rep(T type ) = {T data : t T type }

27 27 Prefix-Selection Query We defined the structure of webhouse data using Tree Types. It is natural to define a pattern based query (tree format). The matching will thus be done by browsing the input tree. Such a query is called a PS query.

28 28 PS-Query Example catalog product name pricecat picture subcat + 111 1 * T type catalog product nameprice cat subcat T query data prefixes constraints =elec <200 T query =q= t: rooted tree : labeling function cond: constraints ( , , , ,  )

29 29 PS-Query Answer Denoted q(t’) where t’ is the data tree. Consists of a prefix of this tree matching the corresponding query tree nodes.

30 30 Answer Example catalog product Canon 120 elec camera Nikton 199elec camera catalog product nameprice cat subcat =elec <200 Note: q 1 (t’), q 2 (t’) share tree data prefixes (root and maybe more)

31 31 Incomplete Information Data available Prefixes of tree data enriched by previous queries Missing portion Simply define the missing information Using the initial T type and queries Missing portion Simply define the missing information Using the initial T type and queries Incomplete Tree

32 32 Conditional Tree Type 1 A Tree Type with a condition function on the tree nodes. Provides extensions to Tree Types dealer Usedcars Newcars ad model year ** Corresponding DTD dealer UsedCars |NewCars UsedCars ad* NewCars ad* ad model year | model 2 Allow context dependent structure definition. HOW ?

33 33 Specialization dealer Usedcars Newcars adUsedadNew model year ** dealer Usedcars Newcars ad model year **  = {dealer, UsedCars, NewCars, ad, model, year}  ’=   {adNew, adUsed}  :  ’ 

34 34 Specialization dealer Usedcars Newcars adUsedadNew model year ** dealer Usedcars Newcars ad model year **  (adNew) =  (adUsed) = ad  :  ’  CT type =

35 35 Incomplete Tree catalog product name price<200 cat=elec subcat Query 1: Find the name, price & subcategories of electronics products with price < $200 A tree representing the incomplete information.

36 36 Incomplete Tree (cont) catalog product Canon 120 elec camera Nikton199 elec camera Sony 175 elec cdplayer Query 1: Find the name, price & subcategories of electronics products with price < $200 What Is Missing??

37 37 What Is Missing? product1 namecat!=elec subcat picture price * All products with category differ than electricity product2 namecat=elec subcat picture price>200 * All products with price > 200

38 38 catalog product Canon 120 elec camera Nikton 199 elec camera Sony 175 elec cdplayer product2 namecat=elec subcat picture price>200 * * product1 namecat!=elec subcat picture price * * Incomplete Tree T Available Information Prefix of a full data tree (T data ) Missing Information Conditional tree type (CT type )

39 39 Query 2: Find the name & pictures of all cameras with picture product Olympuselec camera o.jpg c.jpg 3 32a catalog product Canon 120 elec camera Nikton199 elec camera Sony 175 elec cdplayer * * product1 product2 What Is Missing??

40 40 What Is Missing? product1 namecat!=elec subcat picture price * All products with category differ than electricity

41 41 product2b namecat=elec subcat  camera picture price  200 * All products with price  200 & subcategory is not camera What Is Missing? product1 name cat!=elec subcat picture price * All products with category differ than electricity

42 42 product2b name cat=elec subcat  camera picture price  200 * What Is Missing? product1 name cat!=elec subcat picture price * product2c namecat=elec subcat=camera price  200 All products with price  200 & no picture

43 43 Incomplete Tree Definition A Tree T which consists of the following A data tree T data = –Represents the known data –Use labels from  A conditional tree type, CT type –Represent the missing portion of the data –Use specialized alphabet  ’ A data labeling mapping ’ from T data nodes to element in  ’. –E.g. ’(n  N | (n)=product) = {product,product3…}

44 44 Rep(T) Definition Rep(T) is the set of trees represented by an incomplete tree T. T data  Rep(T) A possible completion on the prefix of the available data tree given by T.

45 45 Rep(T) Definition (cont)  Rep(T) student name addr id T type Given a T type student name=shlomo q student shlomo addr id T data

46 46 Acquiring Incomplete Information Refine Algorithm

47 47 Acquiring Incomplete Info. How this is done via WEB? -simply using answers to queries We now show how this can be done against the representation system Assumption The input tree is a single document described by a tree type. We can merge few documents to a single one.

48 48 Refine Motivation Each query posed against the webhouse defines additional constraints Answers to these queries help us refine the partial information. We describe this partial information using incomplete tree. As we acquire the webhouse for more information we want to be able to define the current incomplete information

49 49 Refine Motivation (cont) product2 namecat=elec subcat picture price  200 * Missing All products with price  200

50 50 Refine Motivation (cont) product2 name cat=elec subcat picture price  200 * Strong constraint product2b name cat=elec subcat  camera picture price  200 * product2 refinement product2c namecat=elec subcat=camera price  200 no picture

51 51 Refine Algorithm Refine the incomplete information Input T: incomplete tree q: PS-query A: = q(T) answer to q Output T’: incomplete tree compatible with the answer A to q

52 52 Refine Algorithm webhouse q A=q(T) q -1 (A) The set of trees compatible with the answer to q But we only need trees that match the so far incomplete tree q -1 (A) Rep(T)  q -1 (A) Rep(T’) =Rep(T)  q -1 (A)

53 53 Refine Output Defines a new incomplete tree T’ In order to do so we need to define 1.CT type to represent the missing portion 2.T data to represent the available data Step 1

54 54 Refine Algorithm – step 1 1.Compute the conditional tree type of the negation of q. I.e. Conditional tree for trees which return an empty answer to q.

55 55 Refine – step 1 1.Compute the conditional tree type of the negation of q. … tqtq a a tata tata tata ^ tata a anan a1a1 a2a2 cond’(t a ) =true tata a anan a1a1 a2a2 tata a anan a1a1 a2a2 ^ cond’(t a ) =¬cond q (a) cond’(t a ) =cond q (a) ^ Define  ’ The labels for the new types will be defined as specialization of label ‘a‘ I.e. (a) =  ’( t a )=  ’( t a )=  ’( t a ) ^

56 56 Refine – step 1 (cont) We defined CT’ cond’ mapping We defined the specialization mapping  ’ root’ has type ( t r  t r ) r: the root of query tree ^ t a t * … t *,accept everything t a t * … t *,accept everything below a because there the condition of q is not satisfied t a  i t * … t * t * … t *,one of the children must not satisfy a condition of q a1a1 anan a1a1 a1a1 aiai aiai anan anan ^^ root’ type (t r  t r ) ^ Lets define  rules..

57 57 Refine - step 1 Example product cat=elec subcat=camera picture tqtq trtr ^ t a1a1 t b t a2a2 product cat=elec subcat  camera picture product cat=elec subcat=camera no picture To provide a simple way to view the disjunction as defined by  product cat  elec subcat picture t =t q negation q -1 Negation computation complexity O(|q|*|  |) q the tree query size  max number of children

58 58 Refine – step 1 Example CT’ product2 name cat=elec subcat picture price  200 * product1 name cat!=elec subcat picture price * CT product cat=elec subcat  camera picture product cat=elec subcat=camera no picture product cat  elec subcat picture  t q -1 Note This intersection yields exactly the missing types product1, product2b and product2c We next show it..

59 59 Refine – step 1 Example  CT’ product1 name cat!=elec subcat picture price * product2 name cat=elec subcat picture price  200 * product1 name cat!=elec subcat picture price * product cat  elec subcat picture

60 60 Refine – step 1 Example  CT’ product2b name cat=elec subcat  camera picture price  200 * product2 name cat=elec subcat picture price  200 * product1 name cat!=elec subcat picture price * product cat=elec subcat  camera picture

61 61 Refine – step 1 Example  CT’ product2c name cat=elec subcat  camera price  200 product2 name cat=elec subcat picture price  200 * product1 name cat!=elec subcat picture price * product cat=elec subcat=camera no picture

62 62 Node ids Assumption Persistent node ids Distinct queries against an XML document return nodes with the same id iff the nodes are identical. product canon elec camera 120 &231 product canon elec camera c.jpg 120 *  &231 product canon elec camera c.jpg 120 &231 =

63 63 Node ids Assumption (cont) Make it possible to enrich the information about a given node through consecutive queries Otherwise, the size of representation system will be too large to handle. - the representation system will need to be extended in order to keep track of the various possible ways of matching nodes returned by different queries A crucial assumption

64 64 Refine Output Defines a new incomplete tree T’ In order to do so we need to define 1.CT type to represent the missing portion 2.T data to represent the available data Step 2

65 65 Refine – step 2 T’ data is the join between T data and A Nodes in both A and T data Compute the intersection. E.g. product Nodes in T data But not in A Node type is Specialized using the CT’ we just computed. E.g. product3 Nodes in A But not in T data Refinement of existing type E.g. product2a To compute..

66 66 Drawback – The Blowup Problem root a b Given a tree type root a=i b=i n queries q i (1  i  n) with empty answers Lets follow CT construction Where CT belongs to the incomplete tree based on queries q 1 … q i qiqi qiqi

67 67 The Blowup Problem root a=1 b=1 Query q 1 Incomplete tree T q1q1 T data is empty q1q1 root a1a1 b a b  1 CT q1q1

68 68 The Blowup Problem root a=2 b=2 Query q 2 CT q2q2 1. Compute the q 2 -1 negation of q 2 root a2a2 b a b  2

69 69 The Blowup Problem root a=2 b=2 Query q 2 CT q2q2 2. Compute the intersection q 2 -1  CT q1q1

70 70 The Blowup Problem 2. Compute the intersection q 2 -1  CT q1q1 root a  1, a  2 b root a b  1, b  2 root a2a2 b  1 root a  1 b  2 root a2a2 b a b  2 root a1a1 b a b  1  CT q1q1 q 2 -1 Continuing the computation yields: |CT | = 4*2 = 2 3 = 8 |CT | = 2 n Refine algorithm yields a disjunction of 2 n new types q3q3 qnqn … Continuing the computation yields: |CT | = 4*2 = 2 3 = 8 |CT | = 2 n Refine algorithm yields a disjunction of 2 n multiplicity statements Exponential blowup of representation system q3q3 qnqn …

71 71 Avoiding The Blowup We consider two ways of avoiding the exponential blowup of incomplete trees: Provide Extension to the incomplete tree. conjunctive incomplete trees Put some restrictions on the tree type and the queries.

72 72 Conjunctive Incomplete Tree root a1a1 b a b  1 Types defined only as disjunction I.e. root a 1 b root a b 1 Define Type as conjunction of disjunctions root (a 1 b  ab 1 ) root (a 1 b  ab 1 )  …  (a n b  ab n ) a i and b i are specialization of a and b, respectively cond(a i ) = (  i), 1  i  n cond(b i ) = (  i), 1  i  n

73 73 Conjunctive Incomplete Tree With conjunction The incomplete information can be represented using only n conjunctions of disjunctions. Without conjunction Algorithm Refine yields a disjunction of 2 n multiplicity statements.

74 74 Heuristics To deal with the case when the incomplete tree is already too large to be practical Shrink the incomplete tree by asking critical additional queries that help to eliminate the missing portion. Loose some information: allows a trade of accuracy against size of incomplete tree.

75 75 Acquiring Partial Information Summary Webhouse is acquired using answers to queries Each answer refines our partial information Partial information is described using incomplete trees We compute the new incomplete tree at each stage using Refine algorithm

76 76 Querying Incomplete Trees

77 77 Answering Queries Remember.. The known data is of the format product name cat subcat picture price * product2a name cat=elec subcat=camera picture price  200 product3 name cat=elec subcat  camera price cameras with price  200 elec products (not cameras) with price  200

78 78 Answering Queries Given query 3: Find the name, price & pictures of all cameras with price < $100 and have at least one picture. product namecat=elec subcat=camera picture price<100 + We can provide a complete answer to query 3 using the available information.

79 79 Given query 4: Find all cameras product namecat=elec subcat=camera picture price * No complete answer is available from the known information. We can do the following: Answering Queries 1. Provide the complete list of cameras with price < 200 2. Provide the complete list of cameras with a picture 3. Tell the user there may be more cameras (that are expensive and have no pictures)

80 80 Answering Queries Provides an incomplete answer to the query given the knowledge available No data source access for further information Next.. Mediator Approach: Provide a complete answer but seek the webhouse only for the missing information. The incomplete tree is used as a guide to the mediator.

81 81 Mediator Approach Additional queries may have to be generated against the input document to obtain the information needed to fully answer the query. product namecat=elec subcat=camera picture price  200 0 Seek the web only for cameras with price  200 with no picture

82 82 Mediator Approach (cont) Assumption: The generated queries are local. Local Queries Queries that explore the input document starting from the nodes already available. T Incomplete Tree root … T data Data Tree n root … q PS-query

83 83 Local Query T Incomplete Tree root … T data Data Tree n root … q PS-query Local ps-query: p@n p:ps-query n: node in T data root … n

84 84 Local Query L: { p @n | p a local query } n1n1 … nknk … … root … n1n1 nknk T data Data Tree p@ n 1 p@ n k L completes T if q(T)=q(T’). We want the set of queries to collect the additional information to fully answer a given ps-query. T’ T’ is obtained by extending each node n of T data for which p@n  L with p@n(T) T  rep( T )

85 85 Local Query Using local queries help us avoid doing the work already done by previous queries. We want the set of queries L to be non redundant 1. No nodes exist in T returns by query in L 2. No new nodes are returned by distinct queries of L. 3. Queries in L should always return non empty answer.

86 86 Mediator Approach Conclusion Mediator approach defines combination of the CWA and OWA semantic. CWA – describe the missing information. I.e. some facts are not known OWA – some data still ignored may exist.

87 87 Assumptions

88 88 Order 1. Origin XML documents define order on elements. Moving to tree representation lose the original ordering. Assumption No order is required in our representation system 2. The source DTD may describe the order of children at each node type. 3. Queries may use ordering in their selection patterns.

89 89 Branching Assumption A PS query tree patterns allow just one child with a given label. root product cameracdplayer Branching Allows multiple children with the same label

90 90 Branching T data root a1a1 anan a2a2 … aaa … q: branching ps-query b=1 … b=2b=n q(T) requires the description of n! possibilities of assigning the n values of b to a 1 … a n

91 91 References Representing and Querying XML with Incomplete Information. Serge Abiteboul, Luc Segoufin, Victor Vianu. Incomplete Information and XML Presentation. http://www-rocq.inria.fr/~abiteboul A WEB Odyssey: from Codd to XML. Victor Vianu. Incomplete Information in Relational Database Tomasz Imielinski and Jr. Lipski Witold.


Download ppt "25 nov 2001SDBI 20011 Sigalit Sima Sigalit Batiashvili Kdoshim Representing & Querying XML With Incomplete Information Serge Abiteboul Luc Segoufin Victor."

Similar presentations


Ads by Google