Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Information Preserving XML Schema Embedding Philip BohannonBell Laboratories Wenfei FanUniv of Edinburgh & Bell Labs Michael Flaster Bell Laboratories.

Similar presentations


Presentation on theme: "1 Information Preserving XML Schema Embedding Philip BohannonBell Laboratories Wenfei FanUniv of Edinburgh & Bell Labs Michael Flaster Bell Laboratories."— Presentation transcript:

1 1 Information Preserving XML Schema Embedding Philip BohannonBell Laboratories Wenfei FanUniv of Edinburgh & Bell Labs Michael Flaster Bell Laboratories PPS Narayan Bell Laboratories

2 2 XML mapping XML mapping σ d : I(S1) → I(S2): Instance-level: from XML instances of a given source DTD schema S1 to XML trees of a predefined target DTD schema S2 Information preserving (lossless) XML data exchange, migration, integration, P2P, … XML tree T of S1 XML tree of S2 XML mapping

3 3 Example: XML mapping – source DTD Source schema S1 : db  class* class  cno, title, type type  ( regular + project ) regular  prereq prereq  class* DTD: (E, P, r). E: element types; r: root; P: element type definitions A    ::= PCDATA |  | B1, …, Bk | B1 + … + Bk | B* Graph representation: –concatenation production B1, …, Bk : AND edge (solid) –disjunction B1 + … + Bk : OR edge (dashed) –Kleene star B*: STAR edge (with edge label *) class cnotypetitle projectregular prereq db * *

4 4 Example: XML mapping – target DTD target schema S2 : courses cnomandatorycredit basic current category school ** students history course semester termyeartitle advanced student projectseminarlabregular gpaprereq required * * gpanamessntaking * *

5 5 information preserving XML mapping Objective: Find an XML mapping σ d : I(S1) → I(S2) such that Type safety: for any XML tree T of S1, σ d (T) is an XML document that is conforms to the predefined target schema S2 Information preserving: –Invertibility: there exists an inverse σ -1 d : I(S2) → I(S1) such that for any XML tree T of S1, T = σ -1 d ( σ d (T)). The source T can be recovered from the target σ d (T) –Query preservation w.r.t a query language L: there is a query-rewriting function F: L → L such that for any Q in L and any T of S1, Q(T) = F(Q)( σ d (T)). All queries in L on the source can be answered on the target

6 6 Challenge: different structures S1 and S2 have vastly different structures: graph similarity (simulation) does not work here! * seminarlabregular gpaprereq required courses cnomandatorycredit basic current category school history course semester termyeartitle advanced students * project *... class cnotypetitle projectregular prereq db * * S1S2 *

7 7 Challenge: data integration S1’ * student ssntakingname cno db * * class cnotypetitle projectregular prereq db * * S1 S2 courses current school students history student *... Multiple sources are to be mapped to a single target: the target schema must have a larger information capacity – it cannot be similar to sources

8 8 About query preservation: XML query languages Regular XPath: Q ::=  | A | Q/text() | Q/Q | Q ∪ Q | Q* | Q[q] q ::= Q | Q/text() = ‘c’ | position() = k | q ∧ q | q ∨ q | not q An XPath fragment: Q//Q instead of Q* Example: a regular XPath query over S1: Find all prerequisites of CIS 331 class cnotypetitle projectregular prereq db * * Q1: class [ cno/text() = ‘ CS331 ’] / (type/regular/prereq/class)* Q2: courses/current/course [ basic/cno/text() = ‘CS331’] / (category/mandatory/regular/required/prereq/course)* query rewriting

9 9 Challenge: information preservation for XML For relational data w.r.t. relational calculus ( L), invertiblility (calculus dominance) and query preservation (dominance) coincide [Hull 84] Separation: (a) There is an invertible XML mapping that is NOT query preserving w.r.t. XPath. (b) There is an XML mapping that is query preserving w.r.t. XPath without position( ) but it is NOT invertible. Complexity: It is undecidable to determine, for an XML mapping defined in any language subsuming FO, whether it is (a) invertible, or (b) query preserving w.r.t. any query language with projection. beyond reach for XML mappings defined in XQuery/XSLT Other results: query preservation w.r.t. regular XPath: stronger than invertibility sufficient conditions under which the two coincide

10 10 Previous work XML mappings defined in XQuery/XSLT: no guidance on –type safety: for any XML tree T of S1, is σ d (T) guaranteed to conforms to predefined (recursive) target schema S2 ? –how to ensure information preservation Schema mapping: to derive instance-level mapping –similarity flooding, Cupid, Clio, TransSCM… –cannot guarantee information preservation Information preservation in traditional data models: not directly applicable to XML mappings No prior work has considered information-preserving XML mapping

11 11 Our approach A systematic way to find XML mappings commonly used in practice find a schema mapping (embedding): σ : S1 → S2 with certain properties, if there is any derive an instance-level mapping σ d : I(S1) → I(S2) from σ –automatically guarantee information preservation –accommodate integration (multiple sources) Input: source DTD S1 = (E1, P1, r1), target DTD S2 = (E2, P2, r2) ; similarity matrix att( ) on element type names: att(A, B) in [0, 1] indicates how close A ∈ E1 is to B ∈ E2 Output: Schema embedding: σ = ( λ( ), path( ))

12 12 Schema embedding σ = ( λ( ), path( )) λ : E1 → E2, type mapping: λ(r1) = r2 and att(A, λ(A)) > 0 path(A, B) maps an edge (A, B) in S1 to a unique path from λ(A) to λ(B) in S2 : A1[position( ) = k1] / … /An(position( ) = kn] –path type: AND (OR, STAR) edge to AND (OR, STAR) path (solid/star edges, solid + at least 1 dashed, solid edges + *) Information capacity –prefix-free: if P1(A) = A1, …, An, path(A, Ai) is NOT a prefix of any path(A, Aj) for j ≠ i ; similarly for P1(A) = A1+ … + An. Type safety – valid mapping Is there a schema embedding for the following? A BC A BC A BC A BC S1 S2

13 13 Example: Schema embedding A A BC S1S2 B C λ (A) = A, λ (B) = B, λ (C) = C path(A, B) = A/B path(A, C) = B/C Unfolding: the prefix-free condition query translation: B/C A B S1 1 2 A B S2 Schema embedding: NO Graph simulation: YES Schema embedding is not a mild generalization of graph simulation

14 14 Schema embedding: example λ (db) = school, λ (class) = course path(db, class) = courses/current/course –mapping edge to path –STAR edge to STAR path –Graph similarity? NO class db * courses current school ** students history course student * gpanamessntaking S1 S2

15 15 Schema embedding: example λ (type) = category, λ (A) = A path(class, cno) = basic/cno path(class, title) = basic/semester/title path(class, type) = category AND (STAR) edges to AND (STAR) paths Relative path: relative to course class cnotypetitle cnocredit basic category course semester termyeartitle S1 S2 *

16 16 Schema embedding: example λ (X) = X path(type, regular) = mandatory/regular path(type, project) = advanced/project λ (X) = X path(regular, prereq) = required/prereq path(prereq, class) = course projectregular type S1 mandatoryadvanced projectseminarlabregular category S2 * prereq regular class.... OR edges to OR paths * regular gpa prereq required course.... S1S2

17 17 Deriving instance-level mapping Each schema embedding σ : S1 → S2 determines an XML mapping σ d : I(S1) → I(S2) Path types and prefix-free Given an XML tree T1 of S1, σ d (T1) constructs an instance T2 of S2, top-down by mapping A-elements of T1 to λ (A)- nodes in T2 the root of T2 is mapped from the root of T1 ; for each λ (A) -element in T2 mapped from an A-element of T1, generate path(A, B) in T2 for each B-child of the A-element; when all the element in T2 mapped from nodes in T1 are fully expanded, add necessary “default” elements to T2 such that T2 satisfies S2.

18 18 Properties of schema embedding Theorem: The XML mapping σ d : I(S1) → I(S2) derived from a schema embedding σ : S1 → S2 is well defined (type safety) invertible (with a quadratic-time inverse), and query preserving w.r.t. regular XPath (query rewriting: linear-time data complexity, quadratic-time combined complexity)

19 19 Integration: multiple sources S1’ * student ssntakingname cno db * class cnotypetitle projectregular prereq db * * S1 S2 courses current school students history student *... λ (db) = school, λ (X) = X path(db, student) = students/student path(taking, cno) = cno gpanamessntaking * cno pairwise disjoint path mappings from S1, S1’ to S2

20 20 Schema embedding vs. graph simulation Definition: –embedding: mapping edges to paths –simulation: mapping edges to edges restructuring: –embedding: various DTD constructs, different structures –simulation: source and target schemas with similar structures information preservation for XML mappings: –embedding: automatically guarantee both invertibility and query preservation w.r.t. regular XPath –simulation: no data integration: –embedding: multiple source DTDs to a single target schema –simulation: no A systematic method to define information-preserving XML mappings

21 21 Complexity: finding schema embedding Input: two DTD schemas S1 and S2, and a similarity matrix att( ) Output: find a schema embedding from σ : S1 → S2 such that qual( σ, att) is maximal, if there is any qual( σ, att) is the sum of att(A, λ(A)) for all A in S1 Theorem: It is NP-complete to determine whether or not there is a schema embedding from S1 to S2, even when S1 and S2 are nonrecursive and they consist of concatenation types only. Efficient algorithms are necessarily heuristic. Find local embedding for each DTD production of S1 Assemble local embeddings to make a schema embedding

22 22 Computing local embedding – fixed type mapping Input: a production A → P(A) in source DTD S1, target schema S2 Output: σ 0 = (λ0, path0), a partial embedding from P(A) to S2 Example: find λ0( ) from types in P(A) to types of S2, and path0( ) projectregular type S1 mandatoryadvanced projectseminarlabregular category S2 If λ0 is given: an O(|P(A)| |S2|) algorithm findPath to find local embedding (depth-first search, checking each S2 subtree only once) When λ0 is not fixed, the local embedding problem is NP-hard Heuristic: randomized findPath to find both λ0 and path0 (randomly pick up possible type-node match in the search)...

23 23 Assembling local embeddings Input: C(A), a set of local embeddings for each A in the source DTD (initialized via randomized findPath); a target schema S2 Output: σ = (λ, path), a schema embedding from S1 to S2 if any Theorem: The assemble-embedding problem is NP-complete even when S1 and S2 are nonrecursive. Conflict: type mapping, prefix free Three heuristic algorithms: 1. Fix an order O on S1 types via qual( ), pick a local embedding σ A from C(A) in O, and increment σ with σ A if no conflict 2. Assume a random order O on S1 types, then do the same as (1) 3. Reduction to the MAX-Weight-Independence-Set problem, leveraging an existing tool for that problem.

24 24 Experimental evaluation benchmark –XMark (99 type nodes in its original form) –Real-life DTD s: SIGMOD (13), PSD (121), mondial (70), etc –Generating target schemas by adding noise: changing edges to paths, mutating names, inserting new subtrees. selectivity/accuracy of att ( ): [0, 1] (1.0: exact match) Target schemas with 75% noise: XMark (581-748), SIGMOD (54-96), PSD (712-820), mondial (395-496) system –933MHZ/1.0GHZ Pentium III, 256M memory –QUALEX: a tool for MAX-Weight-Independence-Set –Algorithms implemented in Java

25 25 Experimental result – target size XMark (acc 0.75). RandomOrder and MAXSet-Reduction perform well

26 26 Experimental result – running time required XMark (acc 0.75). In seconds for schemas of hundreds of nodes

27 27 Experimental result – different source schemas Various source schemas (acc 0.75). RandomOrder finds solutions more than 90% of the time, in seconds

28 28 Summary Information preservation: the first study for XML mappings –more intriguing than its relational counterparts: separation, equivalence, complexity of invertibility and query preservation –important for data exchange, migration, integration, P2P, … Schema embedding: –mapping edges to paths –capture various DTD constructs, support restructuring –automatically guarantee information preservation –accommodate multiple source to a single target –NP-complete, but with efficient and effective heuristic A practical solution for finding information-preserving XML mappings


Download ppt "1 Information Preserving XML Schema Embedding Philip BohannonBell Laboratories Wenfei FanUniv of Edinburgh & Bell Labs Michael Flaster Bell Laboratories."

Similar presentations


Ads by Google