1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.

1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

2 Outline XTRACT System for inferring DTDs from a set of XML documents Incremental validation of XML Documents

3 Schema & XML Databases Databases need a Schema DTDs serve the role of the schema of the document Efficient storage of XML data Optimization of XML queries DTDs are not mandatory !!!!

4 XTRACT Goal: Infer DTDs from a set of XML documents

5 Problem Simplification and Abstraction Infer a DTD for each tag separately Separate example sequences for each Infer a “good” DTD for each Resulting document DTD is a composition of all inferred “tag”-DTDs

6 Example book titleauthor editor name age TagExample sequence set book

7 Example book titleauthor editor name age TagExample sequence set book

8 Example book titleauthor editor name age TagExample sequence set book{ }

9 Example book titleauthor editor name age TagExample sequence set book{ } author

10 Example book titleauthor editor name age TagExample sequence set book{ } author

11 Example book titleauthor editor name age TagExample sequence set book{ } author { }

12 Example book titleauthor editor name age TagExample sequence set book{ } author{ }

13 Example book titleauthor editor name age TagExample sequence set book{ } author{ }

14 Example book titleauthor editor name age TagExample sequence set book{ } author {, }

15 Example book titleauthor editor name age TagExample sequence set book{ } author {, } editor

16 Example book titleauthor editor name age TagExample sequence set book{ } author {, } editor

17 Example book titleauthor editor name age TagExample sequence set book{ } author {, } editor{ }

18 What is a “good” DTD ? Given the example sequence set I={ ab, abab, ababab } Possible DTDs: (ab)* PreciseConciseCandidate DTD (a|b)* (ab|abab|ababab) ab|ab(ab|abab) YesNo Yes Somewhat

19 What is a “good” DTD ? (ctd.) A good DTD D must satisfy two restrictions  R1: D should be concise  R2: D should be precise Minimum Description Length quantifies and resolves the tradeoff between R1 and R2

20 The MDL Principle MDL principle states: The best theory to infer from a given set of data is the one which minimizes the sum of 1. The length of the theory in bits 2. The length of the data, in bits, when encoded with the help of the theory

21 Overview of XTRACT System MDL Modul Factoring Generalization Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe } S g = I  { (ab)*, (a|b)*, b*d, b*e } S f = S g  { (a|b)(c|d), b*(d|e) } Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

22 MDL Subsystem In order to use the MDL principle, we need to  Define theory description length  Define data description length  Solve the resulting minimization problem

23 MDL Coding scheme Description Length of a DTD  Number of characters of the DTD Cost of encoding the example sequences  encoding of b in terms of DTD a | b | c is 1, cost 1 (position of b in the DTD)  encoding of bbb in terms of DTD b* is 3 (number of repetitions of b), cost 1  encoding of b in terms of DTD b is , cost 0

24 MDL Subsystem Minimization Input SequencesCandidate DTDs ab abb abbb abbbb ab (a|b)* ab* abb

25 MDL Subsystem Minimization Input SequencesCandidate DTDs ab abb abbb abbbb (a|b)* ab* abb 6 3 4 5 6 7 abbbbb 30 + 1 b )= 1*+ (1 a

26 MDL Subsystem Minimization Input SequencesCandidate DTDs ab abb abbb abbbb abbbbb (a|b)* ab* abb 30 3 1 1 1 1 1 8

27 MDL Subsystem Minimization Input SequencesCandidate DTDs ab abb abbb abbbb (a|b)* ab* abb abbbbb 30 8 3 0 3

28 MDL Subsystem Minimization Input SequencesCandidate DTDs ab abb abbb abbbb ab (a|b)* ab* abb 30 8 3

29 Overview of XTRACT System MDL Modul Factoring Generalization Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe } S g = I  { (ab)*, (a|b)*, b*d, b*e } S f = S g  { (a|b)(c|d), b*(d|e) } Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

30 Generalization Subsystem Goal:  Infer regular expressions from example sequences  Produce candidate DTDs such as a*bc,(abc)*, (a|b|c)*,((ab)*c)*  Generate more general DTDs Two heuristics:  DiscoverSeqPattern(s,r): s= abbbbc => ab*c  DiscoverOrPattern(s,d): s= abacbc => (a|b|c)* Candidate DTDs are generated by calling the above functions for appropriate values of r and d

31 DiscoverSeqPattern Example (ab)*cabc(ab)*c(ab)*cabc(ab)*c(ab)*c)*( The pattern must occur at least two times: r=2 abababcabcababc ab abababcabcababc ab (ab)*cabcababc ab (ab)*cabcababc ab

32 DiscoverOrPattern Example Given: the example sequence s=axcxac distance parameter d=2 axccax

33 DiscoverOrPattern Example Given: the example sequence s=axcxac distance parameter d=2 axccax Step 1: Partition

39 DiscoverOrPattern Example Given: the example sequence s=axcxac distance parameter d=2 axccax Step 2: replace pattern a 1 …a n by (a 1 |..|a n )*

40 DiscoverOrPattern Example Given: the example sequence s=axcxac distance parameter d=2 a(xca|c)* Step 2: replace pattern a 1 …a n by (a 1 |..|a n )*

41 DiscoverOrPattern Example Given: the example sequence s=axcxac distance parameter d=2 a(xca|c)* x is an auxiliary symbol introduced by DiscoverSeqPattern a(ca|c)*((de)*e)* x = ((de)*e)*

42 Factoring Subsystem Goal: Combine different candidates to derive more compact, factored DTDs Example candidate set S g = { ac, ad, bc, bd } acadbcbd|||

43 Factoring Subsystem Goal: Combine different candidates to derive more compact, factored DTDs Example candidate set S g = { ac, ad, bc, bd } acadbcbd|||=>

44 Factoring Subsystem Goal: Combine different candidates to derive more compact, factored DTDs Example candidate set S g = { ac, ad, bc, bd } acadbcbd|||=>a(c|d)

45 Factoring Subsystem Goal: Combine different candidates to derive more compact, factored DTDs Example candidate set S g = { ac, ad, bc, bd } acadbcbd|||=>a(c|d)|

46 Factoring Subsystem Goal: Combine different candidates to derive more compact, factored DTDs Example candidate set S g = { ac, ad, bc, bd } acadbcbd|||=>a(c|d)|b(c|d)

47 Factoring Subsystem Goal: Combine different candidates to derive more compact, factored DTDs Example candidate set S g = { ac, ad, bc, bd } acadbcbd|||=>a(c|d)|b(c|d) =>

48 Factoring Subsystem Goal: Combine different candidates to derive more compact, factored DTDs Example candidate set S g = { ac, ad, bc, bd } acadbcbd|||=>a(c|d)|b(c|d) =>(a|b)(c|d) Reduces MDL description length of the candidate DTDs Adoption of factoring algorithms for Boolean expressions Use heuristic algorithm for selecting subsets of candidate DTDs that give a good factored form

49 Factoring Subsystem Heuristics Choose subsets S of candidate DTDs from S G such that  DTDs in S have a common prefix p or suffix s  number of DTDs with this common prefix in S G is high

50 Factoring Prefixes Candidate DTDs longer prefixes result in MDL cost reduction factored DTD covers all input sequences abcddd abceee abcfff abcggg abcd* abce* abcf* abcg* abc(d*|e*|f*|g*)

51 Factoring Subsystem Heuristics Choose subsets S of candidate DTDs from S G such that  DTDs in S have a common prefix p or suffix s  number of DTDs with this common prefix in S G is high

52 Factoring Subsystem Heuristics Choose subsets S of candidate DTDs from S G such that  DTDs in S have a common prefix p or suffix s  number of DTDs with this common prefix in S G is high The overlap between every pair of DTDs D, D’ in S should be minimal

53 Factoring Subsystem Overlap Input SequencesCandidate DTDs eab eabb eabbb eababab e(a|b)* eab* e((a|b)*|ab*)

54 Factoring Subsystem Overlap Input SequencesCandidate DTDs eab eabb eabbb eababab e(a|b)* eab* e((a|b)*|ab*) New factored form has much higher MDL cost ! Does not cover more input sequences then e(a|b|)*

55 Experimental Validation Comparison of XTRACT with IBM DDbE (Data Description by Example) Synthetic Documents  Randomly generated example sequences for synthetic DTDs Real Life Documents  Example documents from different sources e.g. Newspaper Association of America

56 Synthetic Documents 1abcde|efgh|ij|klm 2(a|b|c|d|f)*gh 3(a|b|c)d*e*(fgh)* 4(abcd)*|(e|f|g)*|h|(ijklm)* 5a*|(b|c|d|e|f)*|gh|(i|j|k)*|(lmn)*  XTRACT recovers each single one of them  DDbE shows serious weaknesses  Recovers only the first one correctly  Deduced DTDs are over-generalizations  Does not even cover all example sequences  Level of factoring is limited

57 Real Life Documents NoSimplified DTDDTD obtained by XTRACT DTD obtained by DDbE 1a|b|c|d|e 2(a|b|c|d|e)* 3ab*c* (ab+c*)|(ac*) 4a*b?c?d? (a+b(c|(c?d))?)|((b |a+)?cd)|((a+|b)?d) |((a+|b)?c)|(a+|b) 5(a(bc)+d)*(a(bc)*d)*(a|b|c|d)+ 6(ab?c*d?)*-(a|b|c|d)+

58 Conclusion MDL principle used to control the tradeoff between model simplicity and model generalisation General purpose tool to extract regular expressions from example documents Experimental results provide strong support Future work:  Generalization subsystem should detect patterns containing ? nested within Kleene stars (a(bc)?)*  Enhance the system to detect even more complex DTDs

59 Incremental Validation of XML - Documents

60 Abstraction of XML and DTD’s XML Docs abstracted as Labeled Ordered Trees LOT element content and attribute values are ignored DTD as extended CFG start symbol (root) productions : associate to each label a regular expression that specifies the acceptable labels of the list of children of a node with the given label LOT satisfies a DTD tree is derivation of the grammar

61 DTDs: Abstraction & Example root : cars cars  used new used  car* new  car* car  (year|  ) model 95 Tigra 94 AstraMini Boxster 03 cars usednew car yearmodelyearmodel car modelyear

62 Tree Satisfying DTD, General Case 11 22 ii  i-1  i+1  k-1 kk … … …   s1s1 s2s2 s k-1 sksk … … abc root :  …   r  …  L(r  )

63 Incremental Validation Problem Statement For each valid tree T : given a series of update commands, efficiently decide if the updated tree T’ is valid efficiently update auxiliary structure A(T) and T

64 Updates (1): Node Renaming u(v i,  ) 11 22 ii  i-1  i+1  k-1 kk … … …  r s1s1 s2s2 s k-1 sksk … … abc  vivi

65 Incremental Validation of Strings Renaming u(  i,b) in string  1...  n  with respect to regular language specified by NFA N(Σ,Q,q 0,F,δ)  validating updated string from scratch: O(n|Q| 2 log|Q|)  maintain auxiliary information: Pre(i) = δ(q 0,  1, …  i-1 ) Post(i) = { s | δ(s,  i+1, …  n ) ε F)}  1...  i-1 b  i+1 …  n valid exists s 1 ε Pre(i), s 2 ε Post(i) such that s 2 ε δ(b,s 1 )

66 Validating a Renaming u(a i,  ) 11 22 ii  i-1  i+1  n-1 nn …  N N … Validation of one update in O(1) given precomputed Pre and Post  Post(i) Pre(i) But u(i,  ) requires recomputation of Pre(i), Pre(i+1), … and of Post(i), Post(i-1), … q0q0 11 22  i-1 … qFqF nn  n-1  i+1 … q0q0 11 22  i-1 …

67 Transition Relation Definition 11 22 ii jj  n-1 nn … ……… mm T i,j = { (q, q’) | }  i+1 q ii … q’ jj  m+1 T i,j = T i,m  T m+1,j

68 Divide-and-conquer approach Transition-Relation-Tree Τ n (n=2 k )  root: T 1,2 k  node T ij has children T i,k and T k+1,j  leaves T i,i, 1≤i≤n  number of nodes: n+ (n/2) + … + 2 + 1 = 2n-1  balanced → Τ n has depth log n

69 Transition Relation Trees 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8 T 5,8 T 1,4 T 3,4 T 1,2 T 5,6 T 7,8 T 1,1 T 2,2 T 3,3 T 4,4 T 5,5 T 6,6 T 7,7 T 8,8 T 1,8

70 Updating T n affected nodes are lying on the path from a leaf to the root bottom-up recomputing T ij ‘s:  each T ij with children T ik and T kj for which at least one child has been recomputed is replaced by T ik ° T kj → O(log n) recomputations updated string valid if T 1n for some f F

71 Maintenance of the Structure and Validation in O(log n) u(6, )  1  2  3  4  5  6  7  8 T 1,1 T 2,2 T 3,3 T 4,4 T 5,5 T 6,6 T 7,7 T 8,8 T 1,2 T 3,4 T 5,6 T 7,8 T 5,8 T 1,4 T 1,8 If (q 0, q F )  then valid T 6,6 T 5,6 T 5,8 T 1,8 

72 Insertions and Deletions positions of nodes in the string can change length n of string is dynamic → Recomputing of the entire tree T n necessary New approach based on B-Trees:  tree structure can be incrementally maintained  tree is still balanced and has depth O(log n)

73 Transition B-Trees (2-3 Trees) 11 22 33 55 66 77 99 T 1 T 2 T 3 T 5 T 6 T 7 T 9 Ta Tb TcTa Tb Tc T a = T 1  T 2 If (q 0, q F )  T a  T b  T c then valid

74 Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions 11 22 33 55 66 77 99 88 T 1 T 2 T 3 T 5 T 6 T 7 T 8 T 9 T a T b T c

75 Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions 11 22 33 55 66 44 77 99 88 T 1 T 2 T 7 T 8 T 9 T a T b T c T 3 T 5 T 6

76 Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions T3 T4T3 T4 T 5 T 6 11 22 33 55 66 44 77 99 88 T 1 T 2 T 7 T 8 T 9 T a T b T c

77 Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions Ta TdTa Td T e T c T3 T4T3 T4 T 5 T 6 11 22 33 55 66 44 77 99 88 T 1 T 2 T 7 T 8 T 9 T f T g

78 Auxiliary Structures for Incremental DTD Validation 11 22 ii  i-1  i+1  k-1 kk … … …  r s1s1 s2s2 s k-1 sksk … … vivi  u(v i, ) rr i … … rr rr

79 XML Schema Validation XML Schema provide a mechanism to decouple element names from their types and thus allow context-dependent definitions of their structure Update to a single node may have global repercussions for the typing of the tree Need more theory:  Specialized DTD‘s, binary tree encoding, non-deterministic tree automata…  details are left to the interested reader…

80 Review Given m updates on tree of size n:  incrementally validate DTD in O(m log n)  validate XML Schema in O(m log 2 n) Weakness Only updates that affected one node at a time are considered

81 Summary XTRACT as a tool to infer DTDs from a set of example XML documents An approach to incrementally validate a XML document after an update Questions?

1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.

Similar presentations

Presentation on theme: "1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann.

Similar presentations

Presentation on theme: "1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann."— Presentation transcript:

Similar presentations

About project

Feedback